My feelings are strong towards one-line-per-fact.
Large RDF data sets have validity problems, and
the difficulty of convincing publishers that this matters indicates that this
situation will continue.
I’ve thought a bit about the problem of the “streaming
converter from Turtle to N-Triples”. It’s true that this can be done in a
streaming manner most of the time, but there is a stack that can get
infinitely deep in Turtle so you can’t say, strictly, that memory
consumption is bounded.
It’s also very unclear to me how exactly to work around
broken records and restart the parser in the general case. It’s not hard
to mock up examples where a simple recovery mechanism works, but I dread
the thought of developing one for commercial use where I’d probably be playing
whack-a-mole for edge cases for years.
There was a gap of quite a few years in the lare-90’s
when there weren’t usable open-source web browsers because a practical web
browser had to: (1) read broken markup, and (2) render it exactly
the same as Netscape 3. Commercial operations can get things like this
done by burning out programmers, who finally show up at a standup meeting
one day, smash their laptop and stomp out. It’s not so easy in the
open source world where you’re forced to use carrots and not sticks.
So far as compression v. inner format I also have some
thoughts because for every product I’ve made in the last few years I always
tried a few different packaging methods before releasing something to
final.
Gzip eats up a lot of the ‘markupbloat’ in N-Triples
because recently used IRIs and prefixes will be in the dictionary. The
minus is that the dictionary isn’t very big, so the contents of the
dictionary itself are bloated; there isn’t much entropy there, but
the same markupbloat gets repeated hundreds of times; if you just put the
prefixes in a hash table that might be more like 1000 bytes total to represent
that. When you prefix-compress RDF and gzip it then, you’ve got the
advantage that the dictionary contains more entropy than it would
otherwise. Even though gzip is not cutting out so much markup bloat,
it is compressing off a better model of the document so you get better
results.
As has been pointed out, sorting helps. If
you sort in ?s ?p ?o . order it helps, partially because the sorting
itself removes entropy (There are N! possible unsorted files and only one sorted
one) and obviously the dictionary is being set up to roll together common ?s and
?s ?p’s the way turtle does.
Bzip’s ability to work like a markov chain with the
element of chance taken out is usually more effective at compression than gzip
is, but I’ve noticed some exceptions. In the original :BaseKB
products, all of the nodes looked like
I found my ?s ?p ?o sorted data compressed better with gzip than
bzip, and perhaps the structure of the identifiers had something to do
with it.
A big advantage of bzip tha is that the block-based nature of the
compression means that blocks can be compressed and decompressed in parallel
(pbzip2 is a drop-in replacements for bzip2), so that the possible top
speed of decompressing bzip data is in principle unlimited, even though
bzip is a more expensive algorithm Hadoop in version 1.1.0+ can even
automatically decompress a bzip2 file and split the result into separate
mappers. Generally system performance is better if you read data out of
pre-split gzip, but it is just so easy to load a big bz2 in HDFS and point
a lot of transistors at it.
I am very much against blank nodes for ‘wiki-ish’ data that is shared
between systems. The fact that Freebase reifies “blank nodes” as CVTs
means that we can talk about them on the outside, reason about them,
and then name them in order to interact with them on the live Freebase
system. By their nature, blank nodes defy the “anyone, anything,
anywhere” concept because they can’t be referred to. In the case of OWL
that’s a feature not a bug because you can really close the world because nobody
can add anything to a lisp-list without introducing a new node. Outside
tiny tiny T-Boxes (say SUMO size), internal DSLs like SPIN, or
expressing algebraic sorts of things (i.e. describe the mixed eigenstates of
quarks in some Hadron), the mainstream of linked data doesn’t use
them.
Personally I’d like to see the data published in Quad form and have the
reification data expressed in the context field. As much as
possible, the things in the (?s ?p ?o) fields should make sense as
facts. Ideally you could reuse one ?c node for a lot of facts, such
as if a number of them came in one transactions. You could ask for the ?c
fields (show me all estimates for the population of Berlin from 2000 to the
present and who made them) or you could go through the file of facts and pick
the ?c’s that provide the point of view that you need the system to have.