ONTOCHEM ONTOLOGIES • AVAILABLE ONTOLOGIES FOR YOUR BIOMEDICAL TEXT MINING: proteins, genes, chemicals, diseases, cell lines, general species, plants, anatomy, physiological effects, cosmetology, geopolitical regions, authors, relationships...
• WE CREATE CUSTOM ONTOLOGIES WITH OUR UNIQUE AND POWERFUL SOFTWARE TOOLS Ontologies provide the basis for identifying concepts in text mining technologies. Subsequent extraction of facts and relationships between these concepts enables data mining and provides the foundation for novel “in silico” knowledge discovery methods. OntoChem is using ontologies for the extraction of implicit, unknown and useful information from databases and document collections such as patents or scientific literature.
ontochem
Ontology (derived from onto- the Greek ὤν, ὄντος “being; that which is”, present participle of the verb εἰμί “be”, and -λογία , -logia: science, study, theory) is the philosophical study of the nature of being, existence, or reality as such, as well as the basic categories of being and their relations. In computer science, an ontology formally represents knowledge as a set of concepts within a domain, and the relationships between those concepts, enabling semantic data integration, data mining and knowledge generation. Ontologies are explicit specifications of a topic including a vocabulary of terms and concepts with defined logical relationships to each other. http://en.wikipedia.org/wiki/Ontology_(information_science)
● Finding specific relationships between domains, e.g. which compounds have been isolated from plants – information that was previously only available from manually curated databases is now generated on the fly ● Similarity search and ranking of documents based on ontology concept metrics. This gives more relevant results than conventional technologies such as word frequencies or key words. OntoChem develops ontologies in the areas of chemistry, species, diseases, anatomy, cell lines, proteins, pharmacological effects, languages, geopolitical and climate zones, company information for business intelligence and others.
INTEGRATED APPROACH
EXAMPLE
OntoChem has an integrated approach – from custom made novel tools and algorithms up to ready-to-use ontologies and text annotation with OCMiner®. We build, update, validate and merge general, chemical and biological ontologies for biomedical data mining applications. OntoChem’s ontology approach allows for stable concept IDs – making updates easier and past annotations interpretable. Our modular software enables quick assembly of derived meta-ontologies that are quality checked. OntoChem’s unique selling point is also the scalability of its patented methods for high performance text processing – enabling ontologies to contain up to billion terms for annotation and very fast text annotations.
Ontologies together with heuristic and linguistic methods are applied for semantic processing of unstructured information sources like scientific articles, patents and others. Using for example our species, chemistry and geographical ontologies, one may retrieve relationships for the white willow (Salix alba) as follows: regions is a world
is part of
is part of
Europe
Africa is part of
USE OF ONTOLOGIES Our data and knowledge extraction technology OCMiner® uses ontologies for a variety of information retrieval tasks: ● Classification of entities, for example assigning specific compounds to compound classes, relating physiological symptoms to a disease, or defining specific relationship types using a custom developed regular expression syntax language ● Ontology aware search engines such as our demo server www.ocminer.com allow to search for concepts, for example the search term “plants” will return documents mentioning specific plants such as “salix” or “Filipendula ulmaria”
range of distribution anti-inflammatory agent
range of distribution
is a
SALIX ALBA
is a
treats
Northern Africa
Rheumatic Fever is a
…
…
is found in
is a
is a
diseases
is a
effects
D (-)-Salicin
is found in
Filipendula ulmaria is a
is a
is a aromatic compunds
6 membered heterocycles
is a
…
is a
is a chemistry
is a
Filipendula
is a
…
Salix
Salicaceae
is a
is a
…
… is a
is a species
AVAILABLE ONTOLOGIES OntoChem has implemented technologies to build dictionaries, controlled vocabulary, taxonomies or ontologies comprising more than 100 million terms from various domains. Examples are our ontologies for general species, plants and fungi, cell lines, general anatomy, plant and fungal anatomy, diseases, pharmacological and physiological effects, cosmetology, proteins, genes, chemistry, languages, geopolitical and climate zones, company information for business intelligence and domain specific relationship ontologies. Each ontology concept contains further data, such as relationships to other concepts, links to external sources, language information, its synonyms and related updating information. OntoChem’s ontologies can be stored and used in various formats such as OBO, CSV, XML (using specific flavors such as RDF, OWL, CML, SBML or others), SKOS etc. When ontologies are used for text mining, we have specific modules that enhance the value of ontologies, either by generating an enriched ontology with additional terms or by using these modules at the time point of annotation: ● Spelling variations (e.g. British-American English, plural forms) ● Diacritic character, space/hyphen/apostrophe handling ● Ontology dependent conditional black and white lists, case sensitive annotations ● Automated detection of acronyms and abbreviations An unique ontology format has been developed to extract relationships between named entities (NE) in text. Domain specific relationship ontologies are used together with the related ontologies and a new regular syntax expression language to extract relationships with high precision and recall.
Screenshot of SODIAC, our specialized ontology editor for general and chemical ontologies
Using SODIAC, we have developed chemical ontologies that comprise structure based classifications but also biology related classifications of chemical compounds. Particular emphasis has been given to natural products, for example steroids or sugars, but also to all classes of heterocycles and compound classes that are of interest for biomedical research. In addition, classifications such as vitamins, food and flavor, cosmetics, drugs and FEMA compounds can be assigned. OntViewer is designed to display, review and check very large ontologies with up to multi-GB data, such as for example the chemistry or the proteins/genes ontologies. It also performs logical, statistical and consistency checks on the ontology.
ONTOLOGY TOOLS To create, manage, update and validate ontologies we have developed a range of different software tools. Chemistry ontology editor We have developed the first specialized chemistry ontology editor, SODIAC (structure ontology development and individual assignment center), to support the development of chemical ontologies. Using the OBO format, it implements known functionality of an ontology editor together with a chemistry structure editor that allows structure based addition, editing and ontology checks. SODIAC can be used to annotate conventional structure files or chemical databases whereby each compound will be assigned to its chemical structure classes.
Screenshot of OntViewer, showing the ontology tree of ChEBI with different relationships.
HugeEdit is a simple and fast text editor for displaying, searching and editing very large data files with up to multi-GB data and multi-million lines without the need to hold the complete data in memory. It is especially suited to work with column separated data too large to be edited in standard spreadsheet editors.
OntoChem has also developed a series of custom build command line tools that aid creating, updating and validating ontologies: ● Searching and proposing candidate synonym concept terms in document collections ● Automated generation of spelling variations ● Checking and correcting homonyms or logical errors within or between ontologies Together, our technologies provide a straightforward and comprehensive toolbox for various tasks when working with ontologies.
ADVANTAGES
Screenshot of HugeEdit, showing a large data file of compounds and their names.
OntoChem’s ontologies, together with OCMiner® are ideally suited for high speed, high quality annotation and search of large data volumes. For example, annotating PubMed abstracts in the demo application www.ocminer.com and using the chemistry search term “heterocyclic compounds” in www.ocminer.com retrieves 3.124.129 hit documents, while a native PubMed (http://www.ncbi.nlm.nih.gov/ pubmed?term=heterocyclic) search finds 24.524 hit documents. Using the cell line “SKMEL-28” as a search term retrieves 296 documents, while the native PubMed search (http://www.ncbi.nlm.nih.gov/pubmed?term=skmel-28) delivers 26 hit documents.
OntoChem GmbH Heinrich-Damerow-Str. 4 06120 Halle Germany info@ontochem.com www.ontochem.com