ICM regular expression syntax

[ Simple expressions | Shortcuts | Regexp back references | Greedy matching ]

Simple expressions

. any character except new line ( to match anything, say (.|\n) or use (?n) in the beginning of the expression )
^ the beginning of the line
$ the end of the line
[abc] any character from the list
[^abc] any character NOT in the list
[a-z] a range, e.g. [0-9] or [0-9A-Z]
\c backslash suppresses special meaning of a character
\\ backslash itself
(string) enclose a simple expression in parentheses to write repetitions, back-references, or field=number expressions in the Split, Match and Replace functions.

Inline modifiers of regular expressions:

(?i) ignore case until the end of the same enclosing group, e. g. 'aBc' ~ '(?i)abc', 'a((?i)bc)d' matches 'aBCd','abcd','aBcd', but not 'Abcd' or 'abcD'
(?-i) match case-sensitive until the end of the same enclosing group, e. g. 'a(?i)bc(?-i)d' matches 'aBCd', but not 'Abcd' or 'abcD',
(?n) begin matching newline character with dot '.': "1bc\nd2" ~ '(?n)1.*2'

Shortcuts

\d matches a digit ( '[0-9]' ). '\d+' matches one or more digits.
\D matches a NON-digit. '\D+' matches space between numbers
\w matches a character in a word ( [a-zA-Z_] ). '\w+' matches a word
\W matches a NON-word character. '\W+' matches the interword space
\s matches a whitespace character, or a separator ( [ \r\t\n\f] )
\S matches a non-separator symbol
\b matches a word boundary, i. e. a boundary between \w and \W symbols, for example, '\bedgeh\b' matches inside 'the edge' and does not match inside 'the hedge'

Repetitions and back-references

( a and b are simple regular expressions, e.g. a DNA base [ACTG], or ([hp]anky.*) ):

a? - nothing or a single occurrence of a
a* - nothing or any number of repetitions of a
a+ - matches a at least once or more
a{n,m} - matches a from n to m times
a|b - matches a or b
ab - matches a and b
(a)\1 - \1 is a back-reference: matches a, then matches exactly the same string. Back-references can go from \1 to \9.

A problem with the posix repetitions

Imagine that you want to match text between two tags, e.g. one in a text which has two items of the same kind ( one and two ). Unfortunately, we can not just use .* to match one since the POSIX standard tries to match the MAXIMAL LENGTH expression between the italic tags (shown in bold are the flanking expressions: one and two).
A straight-forward solution of this problem is to make a more complex definition of the word between the tags, by saying that the 'italized' word should not contain the '<' symbol.

ICM followed Perl in using the question mark (?) after the repetition symbol to enforce the minimal match. The minimal match expressions will look like this (a is a simple regular expression, like a character or a string in parentheses ):

a?? - nothing or a single occurrence of minimal occurrence of a
a*? - nothing or any number of repetitions of minimal occurrence of a (e.g. Match(s,'tag(.*?)endtag':n))
a+? - matches a at least once or more

Therefore:

'.*' - matches the entire 'one and two'
'[^<]*' - explicitly prohibits the tag inside. matches only the first word
'.*?' - the '*?' expression enforces the smallest match