I hope that splitting the wikidata dump into smaller, more functional chunks is something the wikidata project considers.

It's probably less about splitting the dumps up and more about starting to split the main wikidata namespace into more discrete areas, because without that the full wikidata graph is hard to partition/dumps to be functionally split up into something. For example, the latest wikidata news was "The sixty-three millionth item, about a protein, is created." (yay!) - but there are lots and lots of proteins. If someone is mirroring wikidata locally to speed up their queries for say an astronomy use case, having to download, store, and process a bunch of triples about a huge collection of proteins is only making their life harder. Maybe some of these specialized collections should go into their own namespace, like "wikidata-proteins" or "wikidata-biology". The project can have some guidelines about how "notable" an item has to be before it gets moved into "wikidata-core". Hemoglobin, yeah, that probably belongs in "wikidata-core". "MGG_03181-t26_1" aka Q63000000 (which is some protein that's been found in rice blast fungus) - well, maybe that's not quite notable enough just yet, but is certainly still valuable to some subset of the community.

Federated queries mean that this isn't too much harder to manage from a usability standpoint. If my local graph query processor/database knows that it has large chunks of wikidata mirrored into it, it doesn't need to use federated SPARQL to make remote network calls to wikidata.org's WDQS to resolve my query - but if it stumbles across a graph item that it needs to follow back across the network to wikidata.org, it can.

And wikidata.org could and still should strive to manage as many entities in its knowledge base as possible, and load as many of these different datasets into its local graph database to feed the WDQS, potentially even knowledgebases that aren't from wikidata.org. That way, federated queries that previously would have had to have made network calls can instead be just integrated into the local query plan and hopefully go much faster.

-Erik

On Fri, May 3, 2019 at 9:50 AM Darren Cook <darren@dcook.org> wrote:

> Wikidata grows like mad. This is something we all experience in the really bad
> response times we are suffering. It is so bad that people are asked what kind of
> updates they are running because it makes a difference in the lag times there are.
>
> Given that Wikidata is growing like a weed, ...

As I've delved deeper into Wikidata I get the feeling it is being
developed with the assumptions of infinite resources, and no strong
guidelines of exactly what the scope is (i.e. where you draw the line
between what belongs in Wikidata and what does not).

This (and concerns of it being open to data vandalism) has personally
made me back-off a bit. I'd originally planned to have Wikidata be the
primary data source, but I'm now leaning towards keeping data tables and
graphs outside, with scheduled scripts to import into Wikidata, and
export from Wikidata.

> For the technical guys, consider our growth and plan for at least one year.

The 37GB (json, bz2) data dump file (it was already 33GB, twice the size
of the English wikipedia dump, when I grabbed it last November) is
unwieldy. And, as there is no incremental changes being published, it is
hard to create a mirror.

Can that dump file be split up in some functional way, I wonder?

Darren

--
Darren Cook, Software Researcher/Developer

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata