Billion Triple Challenge 2010
High Performance Semantic Factoring of Giga-Scale Semantic Graph Databases
Abstract:
As semantic graph database technology grows to address components ranging from extant large triple stores to SPARQL endpoints over SQL-structured relational databases, it will become increasingly important to be able to bring high performance computational resources to bear on their analysis, interpretation, and visualization, especially with respect to their innate semantic structure. Our research group built a novel high performance hybrid system comprising computational capability for semantic graph database processing utilizing the large multithreaded architecture of the Cray XMT platform, conventional clusters, and large data stores. We present the results of our exploration for the analysis of the Billion Triple dataset with respect to its semantic factors, including basic properties, connected components, namespace interaction, and typed paths.
You may download our submission paper.
BTC2010 Base Statistics
We acquired BTC10 and verified it as an RDF graph with 3.2B <s, p, o, q> quads, which we projected to 1.4B unique <s, p, o> triples, ignoring the quad field (useful for provenance and other operations but not for analyzing the main content). We identified duplicates by hashing the triples, now of integers, into a shared hash table, in 10 min. 37 sec. The entire process of converting the data from string to integers, removing the quad field, and deduplicating, compressed BTC10 from 624 GBs to 32 GBs.
The basic statistics we extracted from the 1.4B triples include:
We measured BTC10's very low graph density of 1.8 × 10−8 links/node2
The number of unique subjects is much smaller than the number of unique objects, indicating a much larger in-degree of objects to out-degree of subjects. The most prevalent subjects are "containers", each (e.g. Bestbuy) pointing to a single category of a large number of objects (e.g. Offers). The most prevalent objects are types, virtually all (e.g. foaf:person) the destination of rdf:type predicates.
Download: Most frequent 100,000 Subjects and Counts
Download: Most frequent 100,000 Objects and Counts
Download: All 95,228 Predicates and Counts
Download: the 218 most frequent links in a graphical display
Connected Components
The Live Demo is currently not available. Check back soon ...
There are 208,300 components, one of which there is a giant component covering 278,400 vertices, or 99.8% of the total.
The Cray XMT computing Connected Components on the 1.4B-edge graph scales well as seen in the log-log plot of the time versus processors:
Namespace Architecture
To understand the relationships between the sources which generated the BTC dataset, we explore a summary metric for data linking among namespaces.
We call triples "linked" when two or more of the subject, predicate, or object map to different fully-qualified domain names.
Using our Cray XMT, we extracted domain names from each entity and computed a census of both the domains and linked triples. With this data, we crafted an interactive tool which visually presents the relationships between these namespaces.
Summary census information:
Typed Path Structures
We computed bigrams (paths of 2-edges/3-nodes) and trigrams (paths of 3 edges/4-nodes).
We note the prominence of low-frequency predicates in both the bigrams and trigrams. For example, consider the most frequent bigram <dgtwc:isPartOf, dgtwc:partial_data>, with a frequency of 35.8%. The constituent predicates have individual frequencies of only 0.0038% and 0.027% respectively.
