Ontolog Forum
Ontology Summit 2014: (Track-D) "Tackling the Variety Problem in Big Data" Synthesis
Track Co-champions: Ken Baclawski and Anne Thessen
Mission Statement
Track D will discuss the integration of ontologies with Big Data. The focus will be on those aspects of Big Data that are most likely to benefit from the use of ontologies, primarily the problem of managing the complexity and variety of the data being processed and analyzed. Here are some of the sources of complexity and variety that will form the starting point for the discussion:
- Background knowledge of the domain within which the data is being produced, processed and analyzed
- Dealing with information requirements
- Data governance, including ingestion management
- The structure of the data
- This is especially important for the columnar databases commonly used in Big Data.
- Provenance of the data, including any transformations, analyses and interpretations of the data that have been performed
- Processing workflows, including merging and mapping data from multiple sources
- Privacy concerns
The mission of the track is:
- Identify potential solutions for tackling the sources of complexity and variety in Big Data.
- Develop specific use cases for the integration of ontologies with Big Data tools and techniques to tackle the variety problem.
Track Plan and Deliverables
- Elaborate the list of sources of Big Data variety that have the greatest potential for benefiting from the use of ontologies.
- Identify potential solutions for the list developed in the first phase.
- Find examples Big Data projects for which we can propose how ontology techniques can be applied to the project.
Express these as use cases for ontologies in Big Data.
- Develop the write-up (a section) that synthesizes the results of
the discourse in this track as a contribution to the OntologySummit2014.Communique.
- Gather citations to relevant resources for the OntologySummit2014_CommunityResources library.
see also: OntologySummit2014_Tackling_Variety_in_BigData_CommunityInput
Sessions
Draft Synthesis
Big Data has emerged as a field that can address important societal and commercial needs. Big Data is distinguished from traditional data processing when one or more of volume, velocity and variety are so large and complicated that traditional techniques are inadequate. Ontologies have the potential for a significant impact on Big Data. We have identified a number of ways that ontologies and semantic technologies could benefit Big Data.
One of the more traditional, but still very important, uses of ontologies in data analytics is to formalize the domain. Usually this takes the form of a taxonomy of vocabulary terms, but increasingly these taxonomies are being enhanced with properties and axioms. There are a number of difficulties (which could also be viewed as opportunities) with the use of ontologies for domain formalization include:
- Existing ontologies are usually insufficient for an application and must be
extended.
- Terminology used at one time for one set of data might have a different
meaning than what appears to be the same terminology used at a different time for another set of data. For ontologies to deal with this effectively, they must not only to evolve over time but also map the previous meanings to the new ones.
- Existing semantic tools are not always compatible with existing tools and
workflows in a domain.
A more recent use of ontologies for data analytics that has potential for high impact is for managing data provenance. Currently, most Big Data projects handle provenance in an ad hoc rather than systematic manner. Developing standard ontologies for commonly used, but informal, process models such as the OODA loop could have a significant impact on data analytics. Standard statistical reasoning ontologies are another area that has the potential for having a high impact.
Reusing existing ontologies is important for tool interoperability as well as for reducing the time and cost of ontology development. Sources of such reuse that have not been adequately exploited are informal models that appear in many domains, whether they are scientific or not. One example is the OODA loop which was originally developed for military purposes but is now recognized as being relevant in virtually every domain. Opportunities for reuse depend not only on having good technologies for supporting reuse but also on developing ontologies using relatively small modules, since smaller ontologies are more likely to be reusable.
Second Draft Synthesis (1 April 2014)
Big Data has emerged as a field that can address important societal and commercial needs. Big Data is distinguished from traditional data processing when one or more of volume, velocity and variety are so large and complicated that traditional techniques are inadequate. Ontologies have the potential for a significant impact on Big Data by addressing challenges associated with variety. We have identified a number of ways that ontologies and semantic technologies could help the variety problem in Big Data.
Most Big Data projects use relatively simple columnar databases. These databases usually lack the system catalogs and foreign keys of more traditional databases. Moreover, the data has a very high degree of diversity. Losing track of what the data means is commonplace. Ontologies can play a critical role in the governance of such databases. As one speaker put it, "It's all about the metadata!"
Traditionally ontologies have taken the form of a taxonomy of vocabulary terms, but increasingly these ontologies are being enhanced with properties, axioms and rules to address the requirements of large-scale multi-source integration efforts. A single indicator can require the integration of many types of ontologies, along with measurement rules and meta information rules. Existing ontologies are usually insufficient for an application and must be extended, and the available semantic tools are not always compatible with the tools and workflows that are commonly used in a domain.
There are many challenges in the use of ontologies on larger scales and over longer periods of time:
- Terminology used at one time for one set of data might have a different meaning than what appears to be the same terminology used at a different time for another set of data. For ontologies to deal with this effectively, they must not only to evolve over time but also map the previous meanings to the new ones.
- Domain experts use terms ambiguously and disagree on definitions of terms
- As knowledge changes in active areas of research, axioms and ontologies will have to change. Axioms that were true 50 years ago are not necessarily true today. Users need measures of quality that include referencing the source of an axiom or term definition.
Ontologies can tackle variety in Big Data by aiding the annotation of data and metadata. Data sets will differ in completeness of metadata, granularity and terms used. Ontologies can reduce some of this variety by normalizing terms and filling in absent metadata.
A more recent use of ontologies for data analytics that has potential for high impact is for managing data provenance, including any transformations, analyses and interpretations of the data that have been performed. Currently, most Big Data projects handle provenance in an ad hoc rather than systematic manner. Ontologies for describing data provenance do exist, such as the PROVO ontology. Developing standard ontologies for commonly used, but informal, process models such as the OODA loop and JDL/DFIG fusion models could have a significant impact on data analytics. The KIDS framework is an example of such a formalization. Standard statistical reasoning ontologies are another area that has the potential for having a high impact.
Reusing existing ontologies is important for tool interoperability as well as for reducing the time and cost of ontology development. Sources of such reuse that have not been adequately exploited are informal models that appear in many domains, whether they are scientific or not. One example is the OODA loop which was originally developed for military purposes but is now recognized as being relevant in virtually every domain. Opportunities for reuse depend not only on having good technologies for supporting reuse but also on developing ontologies using relatively small modules, since smaller ontologies are more likely to be reusable.
At the global level, there are too many domains to have very deep semantics common to them all. Nevertheless, Schema.org has been tackling the formidable problem of developing a generally accepted vocabulary that is now being used by over 5 million domains, and gradually introducing deeper semantics. Incorporation of ontologies into the Schema.org framework is challenging but has the potential of significant benefits.
Use Cases
- Harvest from data partners
- Rather than build it yourself, make use of collaborators. You still
have a lot of work to do converting, formalizing the input and integrating the sources. But the result can be very high quality and it has a builtin user community.
- Nathan Wilson Presentation gave an example from the EOL community.
- Mark Fox Presentation
- Rosario Uceda-Sosa Presentation gave an example of harvesting data and metadata from cities.
- Ruth Duerr Presentation combined input from native communities in the Arctic.
- Modular development
- It is much easier to reuse a smaller ontology. One can combine
several of them to create an ontology that satisfies most of your requirements.
- Ruth Duerr Presentation used this to develop an ontology for sea ice.
- Reuse
- Every speaker in the two sessions had examples of this use case.
- Formalize existing informal models
- Eric Chan Presentation formalized the OODA loop and then extended and generalized it.
- Ruth Duerr Presentation formalized techniques for describing sea ice conditions.
- Develop ontology with extension points
- Eric Chan Presentation developed a framework for observation and decision making
but rather than immediately specializing it to a particular domain, he built his framework with extension points to allow ease of reuse.
- Involve communities
- This is similar to the Harvest from data partners use case, but was
shown separately to emphasize that community involvement is important even if the communities do not directly contribute to the ontology.
- Ruth Duerr Presentation used this in her work on the ontology of sea ice.
- Governance framework
- This framework uses relatively deep inference using axioms and rules to ensure data quality.
- Malcolm Chisholm Presentation
- Vocabulary pipeline
- Pattern matching
- Information ecosystem
- Bridge axioms
- Use axioms both at the data and metadata levels to bridge the gap
between the semantics of data from different sources.
- These use cases are now on their own page along with contributions from other tracks: UseCases_Of_AppliedOntology_In_SemanticWeb_BigData
--
maintained by the Track co-champions ... please do not edit