ICM Manual v.3.9
by Ruben Abagyan,Eugene Raush and Max Totrov Copyright © 2020, Molsoft LLC Jun 5 2024
|
[ Regexp syntax ]
Functions supporting regular expressions:
See regexp syntax .
ICM regular expression syntax |
[ Simple expressions | Shortcuts | Regexp back references | Greedy matching ]
- . any character except new line ( to match anything, say (.|\n)
or use (?n) in the beginning of the expression )
- ^ the beginning of the line
- $ the end of the line
- [abc] any character from the list
- [^abc] any character NOT in the list
- [a-z] a range, e.g. [0-9] or [0-9A-Z]
- \c backslash suppresses special meaning of a character
- \\ backslash itself
- (string) enclose a simple expression in parentheses to write
repetitions, back-references, or field=number expressions in
the Split, Match and Replace functions.
Inline modifiers of regular expressions:
- (?i) ignore case until the end of the same enclosing group, e. g. 'aBc' ~ '(?i)abc', 'a((?i)bc)d' matches 'aBCd','abcd','aBcd', but not 'Abcd' or 'abcD'
- (?-i) match case-sensitive until the end of the same enclosing group, e. g. 'a(?i)bc(?-i)d' matches 'aBCd', but not 'Abcd' or 'abcD',
- (?n) begin matching newline character with dot '.': "1bc\nd2" ~ '(?n)1.*2'
- \d matches a digit ( '[0-9]' ).
'\d+' matches one or more digits.
- \D matches a NON-digit. '\D+' matches space between numbers
- \w matches a character in a word ( [a-zA-Z_] ). '\w+' matches a word
- \W matches a NON-word character. '\W+' matches the interword space
- \s matches a whitespace character, or a separator ( [ \r\t\n\f] )
- \S matches a non-separator symbol
- \b matches a word boundary, i. e. a boundary between \w and \W symbols, for example,
'\bedgeh\b' matches inside 'the edge' and does not match inside 'the hedge'
Repetitions and back-references ( a and b are simple regular expressions, e.g. a DNA base [ACTG], or ([hp]anky.*) ): |
- a? - nothing or a single occurrence of a
- a* - nothing or any number of repetitions of a
- a+ - matches a at least once or more
- a{n,m} - matches a from n to m times
- a|b - matches a or b
- ab - matches a and b
- (a)\1 - \1 is a back-reference: matches a, then matches exactly the same string.
Back-references can go from \1 to \9.
A problem with the posix repetitions |
Imagine that you want to match text between
two tags, e.g. <i>one</i> in a text which has
two items of the same kind ( <i>one</i> and <i>two</i> ).
Unfortunately, we can not just use <i>.*</i> to match <i>one</i>
since the POSIX standard tries to match the MAXIMAL LENGTH expression
between the italic tags (shown in bold are the flanking expressions:
<i>one</i> and <i>two</i>).
A straight-forward solution of this problem is to make a more complex
definition of the word between the tags, by saying that the 'italized' word should not contain the '<' symbol.
ICM followed Perl in using the question mark (?) after the repetition symbol to enforce the minimal match.
The minimal match expressions will look like this (a is a simple regular expression, like a character or a string in parentheses ):
- a?? - nothing or a single occurrence of minimal
occurrence of a
- a*? - nothing or any number of repetitions of minimal
occurrence of a
(e.g. Match(s,'tag(.*?)endtag':n))
- a+? - matches a at least once or more
Therefore:
- '<i>.*</i>' - matches the entire 'one</i> and <i>two'
- '<i>[^<]*</i>' - explicitly prohibits the tag inside. matches only the first word
- '<i>.*?</i>' - the '*?' expression enforces the smallest match
|