ICM Language Reference : Regular expressions (regexp)

ICM Manual v.3.9 by Ruben Abagyan,Eugene Raush and Max Totrov
Copyright © 2020, Molsoft LLC
Feb 15 2024

Contents


Introduction
Reference Guide
ICM options
Editing
Graph.Controls
Alignment Editor
Constants
Subsets
Molecules
Selections
Fingerprints
Regexp
Regexp syntax
Cgi programming with icm
Xml drugbank example
Tree cluster
Arithmetics
Flow control
MolObjects
Energy Terms
Integers
Reals
Logicals
Strings
Preferences
Tables
Other
Chemical
Smiles
Chemical Functions
MolLogP
MolLogS
MolSynth
Soap
Gui programming
Commands
Functions
Icm shell functions
Macros
Files
Command Line User's Guide
References
Glossary

Index

ICM Language Reference
Regular expressions (regexp)

[ Regexp syntax ]

Functions supporting regular expressions:

Match match expressions in a string or sarray
Replace - replace expressions in a string or sarray
Index - find substring position and length
Split - by a regular expression

See regexp syntax .

ICM regular expression syntax

[ Simple expressions | Shortcuts | Regexp back references | Greedy matching ]

Simple expressions

. any character except new line ( to match anything, say (.|\n) or use (?n) in the beginning of the expression )
^ the beginning of the line
$ the end of the line
[abc] any character from the list
[^abc] any character NOT in the list
[a-z] a range, e.g. [0-9] or [0-9A-Z]
\c backslash suppresses special meaning of a character
\\ backslash itself
(string) enclose a simple expression in parentheses to write repetitions, back-references, or field=number expressions in the Split, Match and Replace functions.

Inline modifiers of regular expressions:

(?i) ignore case until the end of the same enclosing group, e. g. 'aBc' ~ '(?i)abc', 'a((?i)bc)d' matches 'aBCd','abcd','aBcd', but not 'Abcd' or 'abcD'
(?-i) match case-sensitive until the end of the same enclosing group, e. g. 'a(?i)bc(?-i)d' matches 'aBCd', but not 'Abcd' or 'abcD',
(?n) begin matching newline character with dot '.': "1bc\nd2" ~ '(?n)1.*2'

Shortcuts

\d matches a digit ( '[0-9]' ). '\d+' matches one or more digits.
\D matches a NON-digit. '\D+' matches space between numbers
\w matches a character in a word ( [a-zA-Z_] ). '\w+' matches a word
\W matches a NON-word character. '\W+' matches the interword space
\s matches a whitespace character, or a separator ( [ \r\t\n\f] )
\S matches a non-separator symbol
\b matches a word boundary, i. e. a boundary between \w and \W symbols, for example, '\bedgeh\b' matches inside 'the edge' and does not match inside 'the hedge'

Repetitions and back-references
( a and b are simple regular expressions, e.g. a DNA base [ACTG], or ([hp]anky.*) ):

a? - nothing or a single occurrence of a
a* - nothing or any number of repetitions of a
a+ - matches a at least once or more
a{n,m} - matches a from n to m times
a|b - matches a or b
ab - matches a and b
(a)\1 - \1 is a back-reference: matches a, then matches exactly the same string. Back-references can go from \1 to \9.

A problem with the posix repetitions

Imagine that you want to match text between two tags, e.g. one in a text which has two items of the same kind ( one and two ). Unfortunately, we can not just use .* to match one since the POSIX standard tries to match the MAXIMAL LENGTH expression between the italic tags (shown in bold are the flanking expressions: one and two).
A straight-forward solution of this problem is to make a more complex definition of the word between the tags, by saying that the 'italized' word should not contain the '<' symbol.

ICM followed Perl in using the question mark (?) after the repetition symbol to enforce the minimal match. The minimal match expressions will look like this (a is a simple regular expression, like a character or a string in parentheses ):

a?? - nothing or a single occurrence of minimal occurrence of a
a*? - nothing or any number of repetitions of minimal occurrence of a (e.g. Match(s,'tag(.*?)endtag':n))
a+? - matches a at least once or more

Therefore:

'.*' - matches the entire 'one and two'
'[^<]*' - explicitly prohibits the tag inside. matches only the first word
'.*?' - the '*?' expression enforces the smallest match

Prev
Fingerprints

Home
Up

Next
Cgi programming with icm

Copyright© 1989-2024, Molsoft,LLC - All Rights Reserved. Copyright© 1989-2024, Molsoft,LLC - All Rights Reserved. This document contains proprietary and confidential information of Molsoft, LLC. The content of this document may not be disclosed to third parties, copied or duplicated in any form, in whole or in part, without the prior written permission from Molsoft, LLC.