Skip to Main Content U.S. Department of Energy
Center for Adaptive Supercomputing - Multithreaded Architectures

Billion Triple Challenge 2010

High Performance Semantic Factoring of Giga-Scale Semantic Graph Databases

BTC2010 Base Statistics

We acquired BTC10 and verified it as an RDF graph with 3.2B <s, p, o, q> quads, which we projected to 1.4B unique <s, p, o> triples, ignoring the quad field (useful for provenance and other operations but not for analyzing the main content). We identified duplicates by hashing the triples, now of integers, into a shared hash table, in 10 min. 37 sec. The entire process of converting the data from string to integers, removing the quad field, and deduplicating, compressed BTC10 from 624 GBs to 32 GBs.

The basic statistics we extracted from the 1.4B triples include:

  • We measured BTC10's very low graph density of 1.8 × 10−8 links/node2

  • We use these prefix abbreviations:

  • The number of unique subjects is much smaller than the number of unique objects, indicating a much larger in-degree of objects to out-degree of subjects. The most prevalent subjects are "containers", each (e.g. Bestbuy) pointing to a single category of a large number of objects (e.g. Offers). The most prevalent objects are types, virtually all (e.g. foaf:person) the destination of rdf:type predicates.

    Download: Most frequent 100,000 Subjects and Counts

    Download: Most frequent 100,000 Objects and Counts

    Download: All 95,228 Predicates and Counts

    Here we show the top 20 in each table:

  • Download: the 218 most frequent links in a graphical display


You can view this image of the top 218 links.

Download: Most frequent 100,000 Objects and Counts

CASS

Research and Development

Resources