ICM Manual v.3.9
by Ruben Abagyan,Eugene Raush and Max Totrov
Copyright © 2020, Molsoft LLC
Feb 15 2024

Contents
 
Introduction
Reference Guide
 ICM options
 Editing
 Graph.Controls
 Alignment Editor
 Constants
 Subsets
 Molecules
 Selections
 Fingerprints
 Regexp
  Regexp syntax
 Cgi programming with icm
 Xml drugbank example
 Tree cluster
 Arithmetics
 Flow control
 MolObjects
 Energy Terms
 Integers
 Reals
 Logicals
 Strings
 Preferences
 Tables
 Other
 Chemical
 Smiles
 Chemical Functions
 MolLogP
 MolLogS
 MolSynth
 Soap
 Gui programming
 Commands
 Functions
 Icm shell functions
 Macros
 Files
Command Line User's Guide
References
Glossary
 
Index
PrevICM Language Reference
Regular expressions (regexp)
Next

[ Regexp syntax ]

Functions supporting regular expressions:

See regexp syntax .

ICM regular expression syntax


[ Simple expressions | Shortcuts | Regexp back references | Greedy matching ]

Simple expressions


  • . any character except new line ( to match anything, say (.|\n) or use (?n) in the beginning of the expression )
  • ^ the beginning of the line
  • $ the end of the line
  • [abc] any character from the list
  • [^abc] any character NOT in the list
  • [a-z] a range, e.g. [0-9] or [0-9A-Z]
  • \c backslash suppresses special meaning of a character
  • \\ backslash itself
  • (string) enclose a simple expression in parentheses to write repetitions, back-references, or field=number expressions in the Split, Match and Replace functions.

Inline modifiers of regular expressions:

  • (?i) ignore case until the end of the same enclosing group, e. g. 'aBc' ~ '(?i)abc', 'a((?i)bc)d' matches 'aBCd','abcd','aBcd', but not 'Abcd' or 'abcD'
  • (?-i) match case-sensitive until the end of the same enclosing group, e. g. 'a(?i)bc(?-i)d' matches 'aBCd', but not 'Abcd' or 'abcD',
  • (?n) begin matching newline character with dot '.': "1bc\nd2" ~ '(?n)1.*2'

Shortcuts


  • \d matches a digit ( '[0-9]' ). '\d+' matches one or more digits.
  • \D matches a NON-digit. '\D+' matches space between numbers
  • \w matches a character in a word ( [a-zA-Z_] ). '\w+' matches a word
  • \W matches a NON-word character. '\W+' matches the interword space
  • \s matches a whitespace character, or a separator ( [ \r\t\n\f] )
  • \S matches a non-separator symbol
  • \b matches a word boundary, i. e. a boundary between \w and \W symbols, for example, '\bedgeh\b' matches inside 'the edge' and does not match inside 'the hedge'

Repetitions and back-references

( a and b are simple regular expressions, e.g. a DNA base [ACTG], or ([hp]anky.*) ):

  • a? - nothing or a single occurrence of a
  • a* - nothing or any number of repetitions of a
  • a+ - matches a at least once or more
  • a{n,m} - matches a from n to m times
  • a|b - matches a or b
  • ab - matches a and b
  • (a)\1 - \1 is a back-reference: matches a, then matches exactly the same string. Back-references can go from \1 to \9.

A problem with the posix repetitions


Imagine that you want to match text between two tags, e.g. <i>one</i> in a text which has two items of the same kind ( <i>one</i> and <i>two</i> ). Unfortunately, we can not just use <i>.*</i> to match <i>one</i> since the POSIX standard tries to match the MAXIMAL LENGTH expression between the italic tags (shown in bold are the flanking expressions: <i>one</i> and <i>two</i>).
A straight-forward solution of this problem is to make a more complex definition of the word between the tags, by saying that the 'italized' word should not contain the '<' symbol.

ICM followed Perl in using the question mark (?) after the repetition symbol to enforce the minimal match. The minimal match expressions will look like this (a is a simple regular expression, like a character or a string in parentheses ):

  • a?? - nothing or a single occurrence of minimal occurrence of a
  • a*? - nothing or any number of repetitions of minimal occurrence of a (e.g. Match(s,'tag(.*?)endtag':n))
  • a+? - matches a at least once or more
Therefore:
  • '<i>.*</i>' - matches the entire 'one</i> and <i>two'
  • '<i>[^<]*</i>' - explicitly prohibits the tag inside. matches only the first word
  • '<i>.*?</i>' - the '*?' expression enforces the smallest match


Prev
Fingerprints
Home
Up
Next
Cgi programming with icm

Copyright© 1989-2024, Molsoft,LLC - All Rights Reserved. Copyright© 1989-2024, Molsoft,LLC - All Rights Reserved. This document contains proprietary and confidential information of Molsoft, LLC. The content of this document may not be disclosed to third parties, copied or duplicated in any form, in whole or in part, without the prior written permission from Molsoft, LLC.