Thor 'tree' format explained


Abstract

This document describes the organization and notation of 'THOR data trees'.  
That is, how THOR data is logically structured, and how they are actually 
written.  In theory, it provides all the information required to design new 
datatypes for THOR and write programs for parsing and transforming existing 
THOR trees.

This document will proceed from the abstract fundamentals of THOR data 
representation to actual examples with real chemical data.  If this style 
does not suit you, perhaps you should read it from  back to front.


Terminology

A THOR (= THesaurus ORiented) database stores chemical information 
based on the primary structural key of SMILES.  All information pertaining 
to a given molecular topology is gathered together under a SMILES.  This 
fundamental unit of storage is referred to as a 'tree'.  The tree is said 
to be 'rooted' at the SMILES.  Synonyms for SMILES include 'structure' and 
'molecule'.

Each tree may store many kinds of information about the structure--measured
and calculated logP (in various solvent systems), molecular formula, CAS 
number, and so on.  Each 'kind' of information is referred to as a 
'datatype'.  Each datatype may be present one or more times in the tree, or 
may be absent entirely, with the sole exception of the SMILES datatype which 
should be present once and only once per tree.

[Note: this last statement is something of a fudge, since 'indirect' trees
do not contain SMILES, but more on that later] 

Datatype definition

THOR does not have fixed or built-in datatypes--each kind of information 
stored in THOR is 'defined' by a second source.  The utility of this is 
that one is free to add any new datatype one desires.

THTAG.FMT if the file that defines the contents of each datatype with 
It contains 'prototypes' of the general format:

        [$]internal_tag ;tag;field;#field;

where
        internal_tag    regular datatype
        $internal_tag   identifier datatype
        tag             name by which one refers to datatype
        field           direct data
        #field          indirect data
        [digit]field    numeric data (i.e. many values)

Each sub-field is delimited by a semi-colon.  Note that this implies that 
the ';' is an illegal character in THOR data!

Some examples:

REMARK is a regular datatype with a single field.
        REM  ;REMARK;

NAME is an identifier datatype with a single field.
        $NAM ;NAME;

CLOGP has an indirect reference to 'error level', and a direct version stamp.
        CP   ;CLOGP;#ERROR LEV;VERSION;

LOGP is a complex datatype with 7 fields: 3 direct and 4 indirect.
        P    ;LOGP;#SOLV PAIR;#REFERENCE;#FOOTNOTE;S;pH;COMMENT;

IKEY is a special datatype that references indirectly stored data.  See the
section on indirect trees below.

Notes:

Originally, 'internal_tags' were made short and cryptic in order to save 
storage space.  There is no fundamental reason why they couldn't be longer 
of more descriptive.

Any line in THTAG.FMT beginning with one of '#', '!' or ' ' is a comment, 
and null lines likewise ignored.

Representation of Trees

Recall that a 'tree' is a collection of datatypes, 'rooted' at a 
fundamental identifier datatype.  The tree notation must delimit the 
datatypes from each other, and the content (i.e., numeric and text data) of 
each datatype from the tag itself.  THOR notation employs three special 
characters: 
        <       to begin data
        >       to end data
        |       to terminal the tree

Hence datatypes in a tree are written in the form:
        tag<data>

A complete tree consists of one or more datatypes following by the terminating 
delimiter '|', so a complete, if short, THOR tree would be:
        $SMI<CCO>       tag = $SMI      data = CCO
        PCN<ETHANOL>    tag = PCN       data = ETHANOL
        |
        
Notes: 

Trees are usually stored in 'list' format, with one datatype per line, but 
some older files may have the entire tree on a single line.  The conversion 
to a datatype-per-line format involves merely adding a newline after each 
'>

The complete set of reserved characters in THOR data is '< '> '|' and ';'.
At present there is no mechanism for quoting these items to be treated as 
regular data. 

Taxonomy of trees

Here is an overview of types of tree structures:

primary         rooted at SMILES

   simple       no subsets

   complex      with subsets

      explicit  subsets marked ($SS internal tag)

      implicit  subsets unmarked ($WLN internal tag)

indirect        SMILES not present

Then identifiers WLN (Wisserwesser line notation) and SS (subset)
create 'logical' subsets.  
Subsets are typically used to store variations on the root SMILES such as
isomers and complexes like salts.

Subsets are not nested--no subsets inside subsets, so a tree with the 
following sequence of identifiers:

        $SMI $SS $WLN $NAM $SS $WLN $NAM

should be interpreted as:       and NOT:

        $SMI                    $SMI
        $SS                     $SS
           $WLN                    $WLN
           $NAM                        $NAM
        $SS                     $SS
           $WLN                    $WLN
           $NAM                        $NAM

It is up to the person/program parsing the tree to correctly identify the 
beginning and end of subsets.

Primary

Our primary trees are always rooted at the structural identifier, SMILES.

In this tree there is only one WLN, hence there is no subset.  Identifiers 
like $NAM and $CAS do not create subsets.

        $SMI<COCCOc2c(OC)cc(Cc1cnc(N)nc1N)cc2OC>
        CP<0.454;-0P;2.10>
        CR<8.9102;-0R;3.55>
        TS<921208123435>
        MF<C16H22N4O4>
-->     $WLN<T6N CNJ BZ DZ E1R CO1 EO1 DO2O1>
        PCN<TETROXOPRIM>
        AC<ANTIBACTERIAL>
        P<1.81;S1;R547;F342;;1.0>
        P<0.56;S1;R1485;F2~F314;*;7.4>
        P1<0.56>
        P<0.38;S1;R1485;F1~F314;;7.4>
        $NAM<24DIAMPYRIMIDINE535DIMEO4OCH2CH2OMEBENZYL>
        $NAM<STERINOR>
        $CAS<53808-87-0>
        $NAM<TETROXOPRIM>

Primary & Complex & Implicit

The two WLN function as implicit subset markers

        $SMI<ClC1CCCCC1Cl>
|       CP<3.300;-0P;2.10>
|       CR<3.7656;-0R;3.55>
|       TS<861212025624>
|       MF<C6H10CL2>
|       $WLN<L6TJ AG BG -C>
  |     PCN<1,2-DICHLOROCYCLOHEXANE-C>
  |     P<3.18;S1;R1553;;*>
  |     $NAM<12DICHLOROCYCLOHEXANEC>
|       $WLN<L6TJ AG BG -T>
  |     PCN<1,2-DICHLOROCYCLOHEXANE-TRANS>
  |     P<3.21;S1;R1553;;*>
  |     $NAM<12DICHLOROCYCLOHEXANETRANS>


Primary & Complex & Explicit

There are a pair (a single one wouldn't be very useful) of $SS datatypes.  
Each is followed by a $WLN--in this case the WLN does not create a second 
subset.

        $SMI<OS(=O)c1ccccc1>
        TS<911119224306>
-->     $SS<NEUTRAL>
        MF<C6H6O2S1>
        $WLN<QSO&R>
        PCN<BENZENESULFINIC ACID>
        PKA<1.84;S1000;RM73>
        $NAM<BENZENESULFINICACID>
        $CAS<618-41-7>
-->     $SS<SALTS>
        MF<C6H5NA1O2S1>
        $WLN<WSR &-NA- &7/2>
        PCN<BENZENESULFINICACID,SODIUMSALT>
        P<3.54;S1;R547;F820;!>
        P2<3.54>
        $CAS<873-55-2>
        $NAM<BENZENESULFINICACIDSODIUMSALT>

Indirect

Indirect trees (those with the internal tag 'I') are not rooted at SMILES.
Instead, they begin with a 'indirect key' which is pointed at by datatypes
in the primary trees.  Recall the prototype for the IKEY datatype:
        I    ;IKEY;CONTENT;

The first field is the key, the second the data to be substituted.

For example, the following LOGP data:
        P<1.90;S1;R680;F462;>

defined as:
        P    ;LOGP;#SOLV PAIR;#REFERENCE;#FOOTNOTE;S;pH;COMMENT;

refers to the three trees:

 field name  ikey       indirect tree
 ----------  ----       -------------
 solv pair = S1    -->  I<S1;Octanol>
 footnote  = R680  -->  I<R680;Seiler,P., Eur. J. Med. Chem., (1974) 9, 663>
 reference = F462  -->  I<F462;Brandstrom analysis (refs 29 & 680)>



Miscellaneous notes

Trailing null fields may be empty

Unfilled fields at the end of a datatype are sometimes omitted.  For 
example, while the LOGP datatype is defined:
        P    ;LOGP;#SOLV PAIR;#REFERENCE;#FOOTNOTE;S;pH;COMMENT;

and should always be written with the trailing fields defined:
        P<3.54;S1;R547;;>

it is sometimes found as:
        P<3.54;S1;R547>

Weakness, bugs, and gotchas in the THOR format

There is no mechanism for enumerating the valid contents of fields.

For example, the 'SELECTED' field of the LOGP datatype should only contain 
a '*' if the value represents the best measured logP ('star' value), or 
contain '!' if the value is pretty good, but not the best.  Otherwise it 
should be null.  At times, this field will contain blanks, tabs, and 
assorted entry errors.