2017-10-31 14:56 GMT+01:00 Laura Morales lauretas@mail.com:
- I have downloaded it and I'm trying to use it, but the HDT tools (eg. query) require to build an index before I can use the HDT file. I've tried to create the index, but I ran out of memory again (even though the index is smaller than the .hdt file itself). So any Wikidata dump should contain both the .hdt file and the .hdt.index file unless there is another way to generate the index on commodity hardware
I've just loaded the provided hdt file on a big machine (32 GiB wasn't enough to build the index but ten times this is more than enough), so here are a few interesting metrics: - the index alone is ~14 GiB big uncompressed, ~9 GiB gzipped and ~6.5 GiB xzipped ; - once loaded in hdtSearch, Wikidata uses ~36 GiB of virtual memory ; - right after index generation, it includes ~16 GiB of anonymous memory (with no memory pressure, that's ~26 GiB resident)… - …but after a reload, the index is memory mapped as well, so it only includes ~400 MiB of anonymous memory (and a mere ~1.2 GiB resident).
Looks like a good candidate for commodity hardware, indeed. It loads in less than one second on a 32 GiB machine. I'll try to run a few queries to see how it behaves.
FWIW, my use case is very similar to yours, as I'd like to run queries that are too long for the public SPARQL endpoint and can't dedicate a powerful machine do this full time (Blazegraph runs fine with 32 GiB, though — it just takes a while to index and updating is not as fast as the changes happening on wikidata.org).