Semantic Database SystemsAt a glance
Massive data sets possess not just great size, but also great complexity, with many heterogeneous elements linked by relationships of many different types. Graph theoretical representation can be superior to optimize the ability of users and data owners to find and identify meaningful patterns and query results in such data sets. They can be especially powerful when tasks involve pattern finding in networks which are sparse in attributes, with qualitative attributes (labels, categories), rich in random-access connections, and with patterns and paths of indefinite length. The new semantic graph database (SGD) paradigm uses ontological systems for typing schema; large labeled, directed graphs for data; graph pattern matching for query; and recursive query languages for graph analysis. One popular SGD format uses the prominent OWL/RDF/SPARQL software capabilities; we position these technologies against related models and languages such as Datalog. The Semantic Database Task for CASS develops high-performance software platforms for SGDs in massively multi-threaded environments; advances scalable methods for analyzing and representing SGDs in terms of their structural and semantic properties; and performs R&D in hybrid data environments to pair multithreaded platforms with traditional relational databases (RDBs), key-value stores, and distributed cloud environments.
"Technologies in massive, complex graphs are essential for big data applications in eScience and national security. We are using the strengths of multithreaded environments to drive innovations providing data analysts with powerful new capabilities to find meaningful patterns in reasonable times as the size and complexity of data sets inevitably grow." - Task Lead Cliff Joslyn, Pacific Northwest National LaboratoryWhat we do
Focusing on SGDs, novel analytical techniques will assist in identifying features of multirelational data at both the structural and semantic levels, identifying new opportunities for knowledge discovery and efficient query. Massive shared memory, multi-threaded platforms like the Cray XMT make it an ideal candidate for hosting SGDs. Our overall goal is to provide a software and algorithms base that will make it easy to use multithreaded architectures for semantic database applications. Our three foci include:
- Research in combinatorial and information theoretical approaches to identifying prominent semantic patterns in data; graph compression and automorphism; and methods in hybrid graph/relational hybrid data, queries, formalisms, platforms, data models, and languages.
- Engineering of prototype semantic graph database capability for large memory, multi-threaded platforms.
- Analysis of massive SGD data and seeking out characteristic benchmark data and test suites.
Multi-threaded architectures are known to be good at solving graph problems with sparseness and irregularity. Semantic graphs are a generalization of this where edges are directional and nodes and edges have types and other attributes. In turn, these types have a logical structure as reflected in an ontological typing system, supported semantically constrained query and inference. We are developing efficient graph data structures capable of supporting these types of general graphs and algorithms that search these graphs for user-specified patterns. Performing optimization techniques on the search queries is critical to the viability of the system.
Within the framework of an SGD, functions beyond the building of the graph and searching for patterns (querying) are essential to realizing the full potential of the knowledge representation and information retrieval system. Some of these functions include inferencing such as RDFS closure, Owl Horst Semantics, and rule-based languages. In addition to these foundational features, extensions will be investigated such as expressing (e.g., in SPARQL or Datalog) and executing path and other subgraph queries.Applications
Our prototype capabilities are being applied to a range of data sets especially aiming at benchmark standards, but also including massive compendia of computational and systems biology information.
Bob Adolf, PNNL
Sinan al-Saffar, PNNL
Eric Goodman, Sandia National Laboratories
David Haglin, PNNL
Larry Holder, Washington State University
Bill Howe, University of Washington
Cliff Joslyn, Task Lead, PNNL