Session 2, Track A: Automation of Knowledge Extraction and Ontology Learning

Session Chair: Gary Berg-Cross

The meeting page with the connection information is at http://bit.ly/2oryr7k
The chatroom is at http://bit.ly/2lRq4h5

Context:

Building & maintaining knowledge bases & ontologies is hard work and could use some automated help.

Perception:

Various parts of AI, such as NLP and machine learning are developing rapidly and could offer help.

Approach:

As with Session 1 we aim to bring together various researchers to discuss the issues and state of the art.

Sample Questions:

What are the ranges of methods used to extract knowledge and build ontologies and other knowledge structures? How have been techniques enhanced and expanded over time? What issues of knowledge building and reuse have been noted? Are there hybrid efforts?

Agenda and Speakers

Speakers:

Michael Yu (UCSD) - "Inferring the hierarchical structure and function of a cell from millions of biological measurements". Slides

Abstract A cell operates at many physical scales. For example, genetic variation in nucleotides (1 nm) gives rise to functional changes in proteins (1-10 nm), which in turn affect protein complexes, cellular processes, pathways, organelles (10 nm-1 μm), and, ultimately, phenotypes observed in cells (1–10 μm). In the first half of the talk, I will present a general strategy for automatically inferring these cellular subsystems and their hierarchical organization based on millions of experimental measurements. The result, a “data-driven gene ontology”, complements the biological knowledge found by manual curation of literature. In the second half of the talk, I will also show how a gene ontology can be applied not only to describe cell structure but also to predict cell functions, such as growth rate, from this structure. Predictions made in this way outperform those by alternative methods that do not take advantage of the hierarchical knowledge in an ontology.

Short Bio Michael Yu is a Bioinformatics Ph.D. student in Trey Ideker’s laboratory at UC San Diego. His current research focuses on designing algorithms for integrating large “omics” datasets into predictive models of molecular biology and human disease. Prior to his Ph.D., he studied comparative genomics at MIT, where he received his Bachelor’s in mathematics and Master’s in computer science

Francesco Corcoglioniti (Post-doc at Fondazione Bruno Kessler, Italy)

"Frame-based Ontology Population from text with PIKES" Slides

Abstract: PIKES (http://pikes.fbk.eu/) is an open source tool for ontology population from natural language English text that extracts RDF triples according to FrameBase, a Semantic Web ontology derived from FrameNet. Processing is decoupled in two phases: (i) linguistic feature extraction, where several NLP tools are used to produce an RDF graph of mentions, i.e., snippets of text denoting some entity / fact; and (ii) knowledge distillation, where the mention graph is mapped via rules to produce a knowledge graph, whose content is linked to DBpedia and organized around semantic frames, i.e., prototypical descriptions of events and situations. A single RDF/OWL representation is used where each triple is related to the mentions/tools it comes from. This talk provides an overview of PIKES approach, implementation, and related/future research developments. Full Article

Short bio: Francesco Corcoglioniti is a post-doc researcher at Fondazione Bruno Kessler (FBK), where he previously conducted his activities to obtain his Ph.D in Computer Science from the University of Trento in 2016. His research interests cross the areas of Semantic Web, Data Management and Natural Language Processing, and focus on the extraction, modeling, processing, and storage of knowledge from natural language text and social media.

Evangelos Pafilis (Hellenic Center Marine Research [HCMR]) -

“EXTRACT 2.0: interactive extraction of environmental and biomedical contextual information." Slides

Abstract: EXTRACT, http://extract.hcmr.gr, http://extract.jensenlab.org is an interactive web-based annotation tool that employs a basic text mining technique, the Named Entity Recognition (NER) to identify and extract standard-compliant terms for the annotation of metagenomic and biomedical records. A fast performing, dictionary-based tagger [1] constitutes EXTRACT’s core. The tagger relies on a set of dictionaries that map biological names to corresponding terms in biological ontologies, or to pertinent records in public biological databases. (depending on the entity type). In the case of ontology-described entity types, the NER dictionaries are constructed based on the ontology-term names and synonyms. In particular, after the latter are extracted, they are subjected to a series of filtering, rule-based expansion, and manual curation steps. In its first version EXTRACT supported the identification of environment descriptors, tissue, disease and organism mentions in text [2]. Environment Ontology, Brenda Tissue Ontology, Disease Ontology terms, and NCBI Taxonomy database records – in corresponding order, were used to this end (see [2] for the mentioned web resource references). Aim of this effort was to explore easy-to-use methods to assist the annotation of metagenomics records with standards compliant metadata. In such context, EXTRACT participated context of the BioCreative V interactive annotation task (BCV-IAT)[3] In its present version, and via the work described in [4], EXTRACT’s has been extended to support a wider scope of biological record annotation (e.g. protein function). In particular, it supports the identification also of: genes/proteins, PubChem Compound identifiers, and Gene Ontology terms. This talk will describe: the need for standards-compliant metagenomics record annotation, the EXTRACT architecture – focusing on the Environment Ontology [5] term identification, the EXTRACT web interface, its performance in BCV-IAT, and briefly present its present version. References – Resources [1] Pafilis,E. et al. (2013) The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text. PLoS One, 8, e65390. [2] Pafilis,E. et al. (2016) EXTRACT: Interactive extraction of environment metadata and term suggestion for metagenomic sample annotation. Database, 2016. baw005 [3] Wang,Q. et al. (2016) Overview of the interactive task in BioCreative V. Database, 2016, baw119. [4] Pafilis,E. et al. (2017) EXTRACT 2.0: text-mining-assisted interactive annotation of biomedical named entities and ontology terms. bioRxiv. [5] Buttigieg,P.L. et al. (2016) The environment ontology in 2016: bridging domains with increased scope, semantic density, and interoperation. J. Biomed. Semantics, 7, 57.

Short Bio Evangelos Pafilis is a Postdoctoral Researcher at the Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC) at Hellenic Center for Marine Research (HCMR), Crete, Greece. Originally trained as a biologist, Evangelos specialized in Bioinformatics, and in particular in literature mining and data integration. Such skills were initially developed in a biomedical research context (PhD in Bioinformatics EMBL/Uni of Heidelberg). In IMBBC/HCMR he is exploring how text mining, data integration, and interactive web application development can be applied and/or extended to serve the information extraction needs of additional biological fields, such as microbiology, ecology and biodiversity.

Ontolog Forum

Contents