Prev | ICM Language Reference Parsing XML example: DrugBank. | Next |
The DrugBank database is a unique bioinformatics and cheminformatics resource that combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information. The database contains 6826 drug entries including 1431 FDA-approved small molecule drugs, 133 FDA-approved biotech (protein/peptide) drugs, 83 nutraceuticals and 5211 experimental drugs. Additionally, 4435 non-redundant protein (i.e. drug target/enzyme/transporter/carrier) sequences are linked to these drug entries. Each DrugCard entry contains more than 150 data fields with half of the information being devoted to drug/chemical data and the other half devoted to drug target or protein data. Read more information: here
The most complete drug information (target, transporter, carrier, and enzyme information ) is provided in XML format. Chemical structures are provided separately in SDF format
The following example will demonstrate how to deal with such data in ICM.
read xml "http://www.drugbank.ca/system/downloads/current/drugbank.xml.zip" name="drugbank"The command above will create collection object "drugbank".
icm/def> Name( drugbank ) #>S string_array drugsThis shows us that collection contains a single root node called "drugs"
icm/def> Name( drugbank["drugs"] ) #>S string_array drug partners xmlns xmlns:xs xs:schemaLocation icm/def> Type( drugbank["drugs","drug"] ) array icm/def> Type( drugbank["drugs","partners"] ) collection icm/def> Name( drugbank["drugs","partners"] ) #>S string_array partner icm/def> Type( drugbank["drugs","partners","partner"] ) arrayWhich means that drugbank["drugs","drug"] is an arraywhere each entry contains the information about particular drug. In addition there is an another array drugbank["drugs","partners","partner"] which contains an additional information about targets.
drugbank["drugs","drug"][1] drugbank["drugs","drug"][2] drugbank["drugs","partners","partner"][1] drugbank["drugs","partners","partner"][2]The default output format for displaying collection is JSON which gives you nicely formated easy-to-read text. Looking at the output it's easy find the fields of interest.
WARNING: do not try to show the entire array into the terminal window because it'll take very long and most likely you'll need to kill the window.
Let's create a table with a single column containing an array with drug cards.
add column drugs drugbank["drugs","drug"]
Hint: In GUI you can resize all simultaneously by holding 'CTRL' key which resizing an individual row.
The single field can be extracted by providing dot separated path to it. Note that fields which contain non-alphanumeric characters must be quoted.# extracts drugbank-id into separate column add column drugs function="A.'drugbank-id'" name="drugbank_id" # extracts name into separate column add column drugs function="A.name" name="name"
Multiple properties will be extracted as an array for each drug entry.
# display targets information for the second entry drugs.A[2]["targets","target"] # extract array of partner IDs for each drug into separate column add column drugs function = "A.targets.target.partner" name="partner_id" Type( drugs.partner_id[2] ) # arrayThis way to extract multiple properties has one problem. For entries with only one property the result will be not array but rather individual value (E.g: Type(Type( drugs.partner_id[1] ). This will prevent from the unified access to the column in the future. In such cases it's recommended to use ':' operation instead of '.'. The result of this operation will always be an array (even for single entries).
delete drugs.partner_id add column drugs function="A.targets.target:partner" name="partner_id" # will create an array for all entries. Type( drugs.partner_id[1] ) # array (even for single entries)
Let's say you want to extract a value of the property with name which start with "logP". It can be done similar to the ICM-table filtering operations. The only difference is that colon ':' (instead of dot) must be used to separate field name
The general filtering syntax:<field1>.<field2>:<queryField> <op> <value>
The following operations are supported in array filtering: ==,!=,>,<,>=,<=,~,!~
Example:# query and extract logP property add column drugs function="(A.'experimental-properties'.property:kind ~ '^logP').value[1]" name="logP"Note that some entries contain text information ('0.61 [HANSCH,C ET AL. (1995)]') so the result column will not be automatically converted to rarray. You can convert it explicitly:
# empty or 'bad' entries will be marked as 'ND' add column drugs Rarray( drugs.logP ) name="logPNum" delete drugs.logPThe other example will extract Wikipedia links:
add column drugs \ function="(A.'external-links'.'external-link':resource == 'Wikipedia')[1].url"\ name ="wiki"
For each drug entry we have list of partner IDs which refers to information from drugbank["drugs","partners","partner"] array To join them we need to add this array to the other table and extract fields which will be used in join.
# creates a table and put partner entries there. add column partners drugbank["drugs","partners","partner"] # extract ID column which will be used to join with drugs.partner_id add column partners function = "A.id" name="id" # extract uniprot-id from the "external-identifiers" array using query functions add column partners \ function = '(A."external-identifiers"."external-identifier":resource ~ "UniProtKB")."identifier"[1]' \ name = "uniprot_id"Finally we need to join drugs.partner_id with partners.id.
join drugs.partner_id partners.id column ="drugs.*,partners.uniprot_id" name="drugs"Note that since drugs.partner_id contains multiple entries for each row the result drugs.uniprot_id will also contain multiple entries for each row. You can set special format with set format command to execute a special action when particular uniprot entry is clicked.
# load sequence set format drugs.uniprot_id \ "<!--icmscript name=\"1\"\nread sequence swiss \"http://www.uniprot.org/uniprot/%1.txt\"\n--><a href=#_>%1</a>" # or simply go to the website set format drugs.uniprot_id "<a href=http://www.expasy.org/uniprot/%1>%1</a>"
# read SDF from the website read table mol "http://www.drugbank.ca/system/downloads/current/structures/all.sdf.zip" name="drugs_chem" # join 'mol' column join drugs.drugbank_id drugs_chem.DRUGBANK_ID column="drugs.*,drugs_chem.mol" name="drugs"A little bit more rearrangements and your table is ready to be exported to SDF file.
move drugs.mol 1 # move structure column to the first position delete drugs.A # delete drug-card information delete drugs.partner_id # delete partner id information write table mol drugs "mydrugs.sdf"
See also: collection, read xml
Prev Greedy matching | Home Up | Next Tree cluster |