On 07.02.20 14:32, Guillaume Lederrey wrote:
Keeping all of Wikidata in a single graph is most
probably not going
to work long term. We have not found examples of public SPARQL
endpoints with > 10 B triples and there is probably a good reason for
that. We will probably need to split the graphs at some point. We
don't know how yet (that's why we loaded the dumps into Hadoop, that
might give us some more insight). We might expose a subgraph with only
truthy statements. Or have language specific graphs, with only
language specific labels. Or something completely different.
I have not looked in detail at query runtimes nor how blazegraph
indexing works internally, however I noticed that in many cases queries
that involve SPARQL property paths (and especially joins of those) take
a long time to run. At the same time, I recently discovered that if we
only store which entity is connected to which other entity (without
storing the actual statement details, like property, qualifiers or
ranks), those only take up about 2GB compressed with Zstandard (I
represented each connection as <32 bit int source entity> <32 bit int
destination entity>). Of course that discards a lot of important
information, but it made me wonder if there is perhaps something that
could be done to more efficiently evaluate queries, given the relatively
strict schema the RDF representation of Wikidata adheres to? (Since it
is generated from a more structured form, Statements). As an example,
blazegraph doesn't know the relationship between wdt:Pxxx and p:Pxxx, or
even things like p:Pxxx/ps:Pxxx.
Another, somewhat related idea: perhaps it's possible to keep the SPARQL
interface for the frontend, but use a more efficient, split
representation of the graph in the backend? Not sure how different that
would be from the indexing that blazegraph does already, though.
Regards,
Benno
PS: appologies to Guillaume if you receive this mail twice, i clicked
the wrong button when replying