Thor 'tree' format explained
Abstract
This document describes the organization and notation of 'THOR data trees'.
That is, how THOR data is logically structured, and how they are actually
written. In theory, it provides all the information required to design new
datatypes for THOR and write programs for parsing and transforming existing
THOR trees.
This document will proceed from the abstract fundamentals of THOR data
representation to actual examples with real chemical data. If this style
does not suit you, perhaps you should read it from back to front.
Terminology
A THOR (= THesaurus ORiented) database stores chemical information
based on the primary structural key of SMILES. All information pertaining
to a given molecular topology is gathered together under a SMILES. This
fundamental unit of storage is referred to as a 'tree'. The tree is said
to be 'rooted' at the SMILES. Synonyms for SMILES include 'structure' and
'molecule'.
Each tree may store many kinds of information about the structure--measured
and calculated logP (in various solvent systems), molecular formula, CAS
number, and so on. Each 'kind' of information is referred to as a
'datatype'. Each datatype may be present one or more times in the tree, or
may be absent entirely, with the sole exception of the SMILES datatype which
should be present once and only once per tree.
[Note: this last statement is something of a fudge, since 'indirect' trees
do not contain SMILES, but more on that later]
Datatype definition
THOR does not have fixed or built-in datatypes--each kind of information
stored in THOR is 'defined' by a second source. The utility of this is
that one is free to add any new datatype one desires.
THTAG.FMT if the file that defines the contents of each datatype with
It contains 'prototypes' of the general format:
[$]internal_tag ;tag;field;#field;
where
internal_tag regular datatype
$internal_tag identifier datatype
tag name by which one refers to datatype
field direct data
#field indirect data
[digit]field numeric data (i.e. many values)
Each sub-field is delimited by a semi-colon. Note that this implies that
the ';' is an illegal character in THOR data!
Some examples:
REMARK is a regular datatype with a single field.
REM ;REMARK;
NAME is an identifier datatype with a single field.
$NAM ;NAME;
CLOGP has an indirect reference to 'error level', and a direct version stamp.
CP ;CLOGP;#ERROR LEV;VERSION;
LOGP is a complex datatype with 7 fields: 3 direct and 4 indirect.
P ;LOGP;#SOLV PAIR;#REFERENCE;#FOOTNOTE;S;pH;COMMENT;
IKEY is a special datatype that references indirectly stored data. See the
section on indirect trees below.
Notes:
Originally, 'internal_tags' were made short and cryptic in order to save
storage space. There is no fundamental reason why they couldn't be longer
of more descriptive.
Any line in THTAG.FMT beginning with one of '#', '!' or ' ' is a comment,
and null lines likewise ignored.
Representation of Trees
Recall that a 'tree' is a collection of datatypes, 'rooted' at a
fundamental identifier datatype. The tree notation must delimit the
datatypes from each other, and the content (i.e., numeric and text data) of
each datatype from the tag itself. THOR notation employs three special
characters:
< to begin data
> to end data
| to terminal the tree
Hence datatypes in a tree are written in the form:
tag<data>
A complete tree consists of one or more datatypes following by the terminating
delimiter '|', so a complete, if short, THOR tree would be:
$SMI<CCO> tag = $SMI data = CCO
PCN<ETHANOL> tag = PCN data = ETHANOL
|
Notes:
Trees are usually stored in 'list' format, with one datatype per line, but
some older files may have the entire tree on a single line. The conversion
to a datatype-per-line format involves merely adding a newline after each
'>
The complete set of reserved characters in THOR data is '< '> '|' and ';'.
At present there is no mechanism for quoting these items to be treated as
regular data.
Taxonomy of trees
Here is an overview of types of tree structures:
primary rooted at SMILES
simple no subsets
complex with subsets
explicit subsets marked ($SS internal tag)
implicit subsets unmarked ($WLN internal tag)
indirect SMILES not present
Then identifiers WLN (Wisserwesser line notation) and SS (subset)
create 'logical' subsets.
Subsets are typically used to store variations on the root SMILES such as
isomers and complexes like salts.
Subsets are not nested--no subsets inside subsets, so a tree with the
following sequence of identifiers:
$SMI $SS $WLN $NAM $SS $WLN $NAM
should be interpreted as: and NOT:
$SMI $SMI
$SS $SS
$WLN $WLN
$NAM $NAM
$SS $SS
$WLN $WLN
$NAM $NAM
It is up to the person/program parsing the tree to correctly identify the
beginning and end of subsets.
Primary
Our primary trees are always rooted at the structural identifier, SMILES.
In this tree there is only one WLN, hence there is no subset. Identifiers
like $NAM and $CAS do not create subsets.
$SMI<COCCOc2c(OC)cc(Cc1cnc(N)nc1N)cc2OC>
CP<0.454;-0P;2.10>
CR<8.9102;-0R;3.55>
TS<921208123435>
MF<C16H22N4O4>
--> $WLN<T6N CNJ BZ DZ E1R CO1 EO1 DO2O1>
PCN<TETROXOPRIM>
AC<ANTIBACTERIAL>
P<1.81;S1;R547;F342;;1.0>
P<0.56;S1;R1485;F2~F314;*;7.4>
P1<0.56>
P<0.38;S1;R1485;F1~F314;;7.4>
$NAM<24DIAMPYRIMIDINE535DIMEO4OCH2CH2OMEBENZYL>
$NAM<STERINOR>
$CAS<53808-87-0>
$NAM<TETROXOPRIM>
Primary & Complex & Implicit
The two WLN function as implicit subset markers
$SMI<ClC1CCCCC1Cl>
| CP<3.300;-0P;2.10>
| CR<3.7656;-0R;3.55>
| TS<861212025624>
| MF<C6H10CL2>
| $WLN<L6TJ AG BG -C>
| PCN<1,2-DICHLOROCYCLOHEXANE-C>
| P<3.18;S1;R1553;;*>
| $NAM<12DICHLOROCYCLOHEXANEC>
| $WLN<L6TJ AG BG -T>
| PCN<1,2-DICHLOROCYCLOHEXANE-TRANS>
| P<3.21;S1;R1553;;*>
| $NAM<12DICHLOROCYCLOHEXANETRANS>
Primary & Complex & Explicit
There are a pair (a single one wouldn't be very useful) of $SS datatypes.
Each is followed by a $WLN--in this case the WLN does not create a second
subset.
$SMI<OS(=O)c1ccccc1>
TS<911119224306>
--> $SS<NEUTRAL>
MF<C6H6O2S1>
$WLN<QSO&R>
PCN<BENZENESULFINIC ACID>
PKA<1.84;S1000;RM73>
$NAM<BENZENESULFINICACID>
$CAS<618-41-7>
--> $SS<SALTS>
MF<C6H5NA1O2S1>
$WLN<WSR &-NA- &7/2>
PCN<BENZENESULFINICACID,SODIUMSALT>
P<3.54;S1;R547;F820;!>
P2<3.54>
$CAS<873-55-2>
$NAM<BENZENESULFINICACIDSODIUMSALT>
Indirect
Indirect trees (those with the internal tag 'I') are not rooted at SMILES.
Instead, they begin with a 'indirect key' which is pointed at by datatypes
in the primary trees. Recall the prototype for the IKEY datatype:
I ;IKEY;CONTENT;
The first field is the key, the second the data to be substituted.
For example, the following LOGP data:
P<1.90;S1;R680;F462;>
defined as:
P ;LOGP;#SOLV PAIR;#REFERENCE;#FOOTNOTE;S;pH;COMMENT;
refers to the three trees:
field name ikey indirect tree
---------- ---- -------------
solv pair = S1 --> I<S1;Octanol>
footnote = R680 --> I<R680;Seiler,P., Eur. J. Med. Chem., (1974) 9, 663>
reference = F462 --> I<F462;Brandstrom analysis (refs 29 & 680)>
Miscellaneous notes
Trailing null fields may be empty
Unfilled fields at the end of a datatype are sometimes omitted. For
example, while the LOGP datatype is defined:
P ;LOGP;#SOLV PAIR;#REFERENCE;#FOOTNOTE;S;pH;COMMENT;
and should always be written with the trailing fields defined:
P<3.54;S1;R547;;>
it is sometimes found as:
P<3.54;S1;R547>
Weakness, bugs, and gotchas in the THOR format
There is no mechanism for enumerating the valid contents of fields.
For example, the 'SELECTED' field of the LOGP datatype should only contain
a '*' if the value represents the best measured logP ('star' value), or
contain '!' if the value is pretty good, but not the best. Otherwise it
should be null. At times, this field will contain blanks, tabs, and
assorted entry errors.