Re: [Wikidata-l] Best practices for large RDF dumps, was: Re: Wikidata RDF export available - Wikidata

12 Aug 2013

    My feelings are strong towards one-line-per-fact.

    Large RDF data sets have validity problems,  and the difficulty of convincing
publishers that this matters indicates that this situation will continue.

    I’ve thought a bit about the problem of the “streaming converter from Turtle to
N-Triples”.  It’s true that this can be done in a streaming manner most of the time,  but
there is a stack that can get infinitely deep in Turtle so you can’t say,  strictly,  that
memory consumption is bounded.

    It’s also very unclear to me how exactly to work around broken records and restart the
parser in the general case.  It’s not hard to mock up examples where a simple recovery
mechanism works,  but I dread the thought of developing one for commercial use where I’d
probably be playing whack-a-mole for edge cases for years.

    There was a gap of quite a few years in the lare-90’s when there weren’t usable
open-source web browsers because a practical web browser had to:  (1) read broken markup, 
and (2) render it exactly the same as Netscape 3.  Commercial operations can get things
like this done by burning out programmers,  who finally show up at a standup meeting one
day,  smash their laptop and stomp out.  It’s not so easy in the open source world where
you’re forced to use carrots and not sticks.

    So far as compression v. inner format I also have some thoughts because for every
product I’ve made in the last few years I always tried a few different packaging methods
before releasing something to final.

    Gzip eats up a lot of the ‘markupbloat’ in N-Triples because recently used IRIs and
prefixes will be in the dictionary.  The minus is that the dictionary isn’t very big,  so
the contents of the dictionary itself are bloated;  there isn’t much entropy there,  but
the same markupbloat gets repeated hundreds of times;  if you just put the prefixes in a
hash table that might be more like 1000 bytes total to represent that.  When you
prefix-compress RDF and gzip it then,  you’ve got the advantage that the dictionary
contains more entropy than it would otherwise.  Even though gzip is not cutting out so
much markup bloat,  it is compressing off a better model of the document so you get better
results.

    As has been pointed out,  sorting helps.  If you sort in ?s ?p ?o . order it helps, 
partially because the sorting itself removes entropy (There are N! possible unsorted files
and only one sorted one) and obviously the dictionary is being set up to roll together
common ?s and ?s ?p’s the way turtle does.

    Bzip’s ability to work like a markov chain with the element of chance taken out is
usually more effective at compression than gzip is,  but I’ve noticed some exceptions.  In
the original :BaseKB products,  all of the nodes looked like

<http://rdf.basekb.com/ns/m.112az>

I found my ?s ?p ?o sorted data compressed better with gzip than bzip,  and perhaps the
structure of the identifiers had something to do with it.

A big advantage of bzip tha is that the block-based nature of the compression means that
blocks can be compressed and decompressed in parallel (pbzip2 is a drop-in replacements
for bzip2),  so that the possible top speed of decompressing bzip data is in principle
unlimited,  even though bzip is a more expensive algorithm  Hadoop in version 1.1.0+ can
even automatically decompress a bzip2 file and split the result into separate mappers. 
Generally system performance is better if you read data out of pre-split gzip,  but it is
just so easy to load a big bz2 in HDFS and point a lot of transistors at it.

I am very much against blank nodes for ‘wiki-ish’ data that is shared between systems. 
The fact that Freebase reifies “blank nodes” as CVTs means that we can talk about them on
the outside,  reason about them,  and then name them in order to interact with them on the
live Freebase system.  By their nature,  blank nodes defy the “anyone, anything, anywhere”
concept because they can’t be referred to.  In the case of OWL that’s a feature not a bug
because you can really close the world because nobody can add anything to a lisp-list
without introducing a new node.  Outside tiny tiny T-Boxes (say SUMO size),  internal DSLs
like SPIN,  or expressing algebraic sorts of things (i.e. describe the mixed eigenstates
of quarks in some Hadron),  the mainstream of linked data doesn’t use them.

Personally I’d like to see the data published in Quad form and have the reification data
expressed in the context field.  As much as possible,  the things in the (?s ?p ?o) fields
should make sense as facts.  Ideally you could reuse one ?c node for a lot of facts,  such
as if a number of them came in one transactions.  You could ask for the ?c fields (show me
all estimates for the population of Berlin from 2000 to the present and who made them) or
you could go through the file of facts and pick the ?c’s that provide the point of view
that you need the system to have.