CHEMICAL ONTOLOGIES • ANNOTATION OF DATABASES AND TEXT WITH CHEMICAL COMPOUND CLASSES • CREATE AND DEVELOP YOUR OWN CHEMICAL ONTOLOGIES OntoChem GmbH is a pioneer in the field of chemical ontologies. While biology related ontologies have made a great impact on knowledge and data mining in life sciences, chemical ontologies are just at their beginning. Searching for chemical compound classes and related data has traditionally been the area of chemistry experts. OntoChem’s chemical ontologies make this knowledge available to a broad community of scientists. Classifying and retrieving data on compounds and classes is now straightforward and easy to understand – also for non-chemists. In addition, chemical ontologies enable new ways of knowledge discovery by extracting relationships between compound classes and data from other domains, traditionally known as structure-activity relationships (SAR) or structure-property relationships (SPR).
ontochem
● OntoChem’s chemical ontologies are used to annotate databases or text collections such as scientific literature or patents – individual compounds are annotated with their compound classes ● Chemistry enabled text-based web browser search engines can be used to search for compounds in data collections using chemistry structure search or using compound classes as search terms (e.g. www.ocminer.com)
● Add, delete and edit compounds and compound class structures using a chemistry editor ● Our extended SMARTS based logic (http://www. daylight.com/dayhtml/doc/theory/theory.smarts.html) for describing compound classes extends the query capabilities of current chemistry search engines significantly
● Structure-Activity-Relationship (SAR) studies with ontology compound classes and OCMiner® relationship extraction from large data collections
SODIAC To aid the development of chemical ontologies we have developed the first chemistry ontology editor SODIAC (Structure Ontology Development and Individual Assignment Center). SODIAC can be used to build, edit and validate chemical ontologies that contain chemical query structures for the representation of chemical compound classes. The program allows to automatically assign chemical compounds in databases with their compound classes. In text and information retrieval, annotation of compounds allows to retrieve all compounds belonging to an ontology class by simple text search instead of using expert structure search tools. SODIAC has been developed based on ChemAxon’s Java chemistry related libraries (www.chemaxon.com). Using the OBO format, it implements typical functionalities of an ontology editor together with chemical functionalities such as structure based addition and editing of chemical compound classes, as well as performing logical chemistrybased ontology checks: ● Read or write SMILES or MOL files to annotate compound classes to individual compounds ● Connect with databases to assign compound classes to individual compounds ● Add and edit ontology definitions, custom tags, synonyms, references to external databases and further data for each compound class concept
Screenshot showing the acridines compound class and the custom tags definition window in SODIAC
SODIAC FEATURES ● Create, edit and validate your own chemistry ontology ● Integrated chemistry structure editor ● Automated structure checking ● Check for loops or dubious ontology hierarchy associations ● Annotation of individual compounds in files (SDF, MOL or SMILES format) with compound classes ● Annotation of individual compounds in SQL databases
CHEMICAL ONTOLOGIES
USE OF CHEMICAL ONTOLOGIES
Using SODIAC we have developed chemical ontologies that comprise structure based classifications but also biology related classifications of chemical compounds. Particular emphasis has been given to natural products, e.g. steroids or sugars, but also to classes of heterocycles and other compound classes that are of special interest for biomedical research. In addition, property based classifications such as vitamins, food and flavor, cosmetics, drugs and FEMA compounds have been added.
Our data and knowledge extraction technology uses chemical ontologies for a variety of information retrieval tasks:
OntoChem’s chemical ontologies contain up to several thousand compound classes, together with individual compounds and their additional information, for example using all names and synonyms for compounds, these ontologies may contain more than 100 million terms depending on the underlying compound database. This is larger than any other ontology, but our patented OCMiner® technology can utilize this data in a straightforward way to annotate huge data collections. Examples for the use of a chemical ontology is provided by the annotation of the PubMed abstracts with compound classes, e.g. using the chemistry search term “heterocyclic compounds” in www.ocminer.com retrieves 3.124.129 hit documents, while a native PubMed (http://www.ncbi.nlm.nih.gov/pubmed?term=heterocyclic) search finds 24.524 hit documents. In addition, through ChemAxon’s Marvin search interface, one may perform conventional chemical substructure and similarity searches over the entire PubMed abstract collection.
● Classification of entities, for example assigning specific compounds with its related compound classes, enabling chemistry class term searches for data collections ● Finding specific relationships between compounds and compound classes with other knowledge domains. This allows to ask questions such as “Which compounds have been isolated from the specific plant family of rosaceae?” – this information was previously only available from manually curated databases – now it is generated on the fly. ● Similarity search and ranking of documents based on compound classes (e.g. documents dealing with steroids). This gives more relevant results than conventional technologies using word frequencies or key words. ● Thematic search engines such as for example retrieval and ranking of all documents dealing with the use of natural products for the treatment of diseases To annotate chemical terms in data collections (for example office documents, XML, HTML, PDF or databases) we have developed a range of different software tools that are modules in our OCMiner® pipeline. For example, chemical names, structures or InChI’s are recognized with high performance either through our chemical term dictionaries or via name-to-structure tools. New compounds can be stored into a chemistry database. As chemical names often contain brackets, hyphens or spaces in different ways, we have a tunable module that recognizes these variations. Finally, compounds are annotated together with their compound classes and chemical structures in the text.
AVAILABILITY SODIAC is commercially available as a Java based application – please contact us at info@ontochem.com for more information or for a test installation. The screenshot shows the assigned parent ontology classification of the D-glucoheptopyranoses compound class.
OntoChem GmbH Heinrich-Damerow-Str. 4 06120 Halle Germany info@ontochem.com www.ontochem.com