Wikidata HDT dump

List overview All Threads
Download

newer

older

Structured data on Commons - IRC...

Weekly Summary #332

Laura Morales

27 Oct 2017 27 Oct '17

8:38 p.m.

Hello everyone,

I'd like to ask if Wikidata could please offer a HDT [1] dump along with the already available Turtle dump [2]. HDT is a binary format to store RDF data, which is pretty useful because it can be queried from command line, it can be used as a Jena/Fuseki source, and it also uses orders-of-magnitude less space to store the same data. The problem is that it's very impractical to generate a HDT, because the current implementation requires a lot of RAM processing to convert a file. For Wikidata it will probably require a machine with 100-200GB of RAM. This is unfeasible for me because I don't have such a machine, but if you guys have one to share, I can help setup the rdf2hdt software required to convert Wikidata Turtle to HDT.

Thank you.

[1] http://www.rdfhdt.org/ [2] https://dumps.wikimedia.org/wikidatawiki/entities/

Show replies by date

Jasper Koehorst

27 Oct 27 Oct

8:41 p.m.

Would it be an idea if HDT remains unfeasible to place the journal file of blazegraph online? Yes, people need to use blazegraph if they want to access the files and query it but it could be an extra next to turtle dump?

...

On 27 Oct 2017, at 17:08, Laura Morales lauretas@mail.com wrote:

Hello everyone,

I'd like to ask if Wikidata could please offer a HDT [1] dump along with the already available Turtle dump [2]. HDT is a binary format to store RDF data, which is pretty useful because it can be queried from command line, it can be used as a Jena/Fuseki source, and it also uses orders-of-magnitude less space to store the same data. The problem is that it's very impractical to generate a HDT, because the current implementation requires a lot of RAM processing to convert a file. For Wikidata it will probably require a machine with 100-200GB of RAM. This is unfeasible for me because I don't have such a machine, but if you guys have one to share, I can help setup the rdf2hdt software required to convert Wikidata Turtle to HDT.

Thank you.

[1] http://www.rdfhdt.org/ [2] https://dumps.wikimedia.org/wikidatawiki/entities/

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Laura Morales

8:47 p.m.

...

Would it be an idea if HDT remains unfeasible to place the journal file of blazegraph online? Yes, people need to use blazegraph if they want to access the files and query it but it could be an extra next to turtle dump?

How would a blazegraph journal file be better than a Turtle dump? Maybe it's smaller in size? Simpler to use?

Jasper Koehorst

9:40 p.m.

You can mount te jnl file directly to blazegraph so loading and indexing is not needed anymore.

Van: Laura Morales Verzonden: vrijdag 27 oktober 2017 17:18 Aan: wikidata@lists.wikimedia.org CC: Discussion list for the Wikidata project. Onderwerp: Re: [Wikidata] Wikidata HDT dump

...

Would it be an idea if HDT remains unfeasible to place the journal file of blazegraph online? Yes, people need to use blazegraph if they want to access the files and query it but it could be an extra next to turtle dump?

How would a blazegraph journal file be better than a Turtle dump? Maybe it's smaller in size? Simpler to use?

_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Laura Morales

9:43 p.m.

...

You can mount te jnl file directly to blazegraph so loading and indexing is not needed anymore.

How much larger would this be compared to the Turtle file?

Luigi Assom

10:21 p.m.

Laura, Wouter, thank you I did not know about HDT

I found and share this resource: http://www.rdfhdt.org/datasets/

there is also Wikidata dump in HDT

I am new to it: is it possible to store a weighted adjacency matrix as an HDT instead of an RDF?

Something like a list of entities for each entity, or even better a list of tuples for each entity. So that a tuple could be generalised with propoerties.

Here an example with one property, 'weight', and an entity 'x1' is associated to a list of other entities, including itself. x1 = [(w1, x1) ... (w1, xn)]

On Fri, Oct 27, 2017 at 6:13 PM, Laura Morales lauretas@mail.com wrote:

...

...
You can mount te jnl file directly to blazegraph so loading and indexing

is not needed anymore.

How much larger would this be compared to the Turtle file?

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Jérémie Roquet

10:26 p.m.

2017-10-27 18:51 GMT+02:00 Luigi Assom itsawesome.yes@gmail.com:

...

I found and share this resource: http://www.rdfhdt.org/datasets/

there is also Wikidata dump in HDT

The link to the Wikidata dump seems dead, unfortunately :'(

-- Jérémie

Jérémie Roquet

10:28 p.m.

2017-10-27 18:56 GMT+02:00 Jérémie Roquet jroquet@arkanosis.net:

...

2017-10-27 18:51 GMT+02:00 Luigi Assom itsawesome.yes@gmail.com:

...
I found and share this resource: http://www.rdfhdt.org/datasets/

there is also Wikidata dump in HDT

The link to the Wikidata dump seems dead, unfortunately :'(

… but there's a file on the server: http://gaia.infor.uva.es/hdt/wikidata-20170313-all-BETA.hdt.gz (ie. the link was missing the “.gz”)

-- Jérémie

Jasper Koehorst

28 Oct 28 Oct

12:32 a.m.

I will look into the size of the jnl file but should that not be located where the blazegraph is running from the sparql endpoint or is this a special flavour? Was also thinking of looking into a gitlab runner which occasionally could generate a HDT file from the ttl dump if our server can handle it but for this an md5 sum file would be preferable or should a timestamp be sufficient?

Jasper

...

On 27 Oct 2017, at 18:58, Jérémie Roquet jroquet@arkanosis.net wrote:

2017-10-27 18:56 GMT+02:00 Jérémie Roquet jroquet@arkanosis.net:

...
2017-10-27 18:51 GMT+02:00 Luigi Assom itsawesome.yes@gmail.com:

...
I found and share this resource: http://www.rdfhdt.org/datasets/

there is also Wikidata dump in HDT

The link to the Wikidata dump seems dead, unfortunately :'(

… but there's a file on the server: http://gaia.infor.uva.es/hdt/wikidata-20170313-all-BETA.hdt.gz (ie. the link was missing the “.gz”)

-- Jérémie

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Edgard Marx

1:26 a.m.

Hey guys,

I don't know if you already knew about it, but you can use KBox for Wikidata, DBpedia, Freebase, Lodstats...

https://github.com/AKSW/KBox

And yes, you can also use it to merge your graph with one of those....

https://github.com/AKSW/KBox#how-can-i-query-multi-bases

cheers, <emarx>

On Oct 27, 2017 21:02, "Jasper Koehorst" jasperkoehorst@gmail.com wrote:

Jasper

...

On 27 Oct 2017, at 18:58, Jérémie Roquet jroquet@arkanosis.net wrote:

2017-10-27 18:56 GMT+02:00 Jérémie Roquet jroquet@arkanosis.net:

...
2017-10-27 18:51 GMT+02:00 Luigi Assom itsawesome.yes@gmail.com:

...
I found and share this resource: http://www.rdfhdt.org/datasets/

there is also Wikidata dump in HDT

The link to the Wikidata dump seems dead, unfortunately :'(

… but there's a file on the server: http://gaia.infor.uva.es/hdt/wikidata-20170313-all-BETA.hdt.gz (ie. the link was missing the “.gz”)

-- Jérémie

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Ghislain ATEMEZING

1:24 p.m.

Hello emarx, Many thanks for sharing KBox. Very interesting project! One question, how do you deal with different versions of the KB, like the case here of wikidata dump? Do you fetch their repo every xx time? Also, for avoiding your users to re-create the models, you can pre-load "models" from LOV catalog.

Cheers, Ghislain

2017-10-27 21:56 GMT+02:00 Edgard Marx digamarx@gmail.com:

...

Hey guys,

I don't know if you already knew about it, but you can use KBox for Wikidata, DBpedia, Freebase, Lodstats...

https://github.com/AKSW/KBox

And yes, you can also use it to merge your graph with one of those....

https://github.com/AKSW/KBox#how-can-i-query-multi-bases

cheers,

<emarx>

On Oct 27, 2017 21:02, "Jasper Koehorst" jasperkoehorst@gmail.com wrote:

I will look into the size of the jnl file but should that not be located where the blazegraph is running from the sparql endpoint or is this a special flavour? Was also thinking of looking into a gitlab runner which occasionally could generate a HDT file from the ttl dump if our server can handle it but for this an md5 sum file would be preferable or should a timestamp be sufficient?

Jasper

...
On 27 Oct 2017, at 18:58, Jérémie Roquet jroquet@arkanosis.net wrote:

2017-10-27 18:56 GMT+02:00 Jérémie Roquet jroquet@arkanosis.net:

...
2017-10-27 18:51 GMT+02:00 Luigi Assom itsawesome.yes@gmail.com:

...
I found and share this resource: http://www.rdfhdt.org/datasets/

there is also Wikidata dump in HDT

The link to the Wikidata dump seems dead, unfortunately :'(

… but there's a file on the server: http://gaia.infor.uva.es/hdt/wikidata-20170313-all-BETA.hdt.gz (ie. the link was missing the “.gz”)

-- Jérémie

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- "*Love all, trust a few, do wrong to none*" (W. Shakespeare)

Laura Morales

1:31 p.m.

...

Also, for avoiding your users to re-create the models, you can pre-load "models" from LOV catalog.

The LOV RDF dump is broken instead. Or at least it still was the last time I checked. And I don't broken in the sense of Wikidata, that is with some wrong types, I mean broken as it doesn't validate at all (some triples are broken).

Ghislain ATEMEZING

1:49 p.m.

Hi Laura, Thanks to report that. I remember one issue that I added here https://github.com/pyvandenbussche/lov/issues/66

Please shout out or flag an issue on Github! That will help on quality issue of different datasets published out there 😊

Best, Ghislain

Provenance : Courrier pour Windows 10

De : Laura Morales Envoyé le :samedi 28 octobre 2017 10:12 À : wikidata@lists.wikimedia.org Cc : Discussion list for the Wikidata project. Objet :Re: [Wikidata] Wikidata HDT dump

...

Also, for avoiding your users to re-create the models, you can pre-load "models" from LOV catalog.

_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Laura Morales

2:54 p.m.

...

Thanks to report that. I remember one issue that I added here https://github.com/pyvandenbussche/lov/issues/66

Yup, still broken! I've tried just now.

Ghislain ATEMEZING

3:12 p.m.

@Laura : you mean this list http://lov.okfn.org/lov.nq.gz ? I can download it !!

Which one ? Please send me the URL and I can fix it !!

Best, Ghislain

Provenance : Courrier pour Windows 10

De : Laura Morales Envoyé le :samedi 28 octobre 2017 11:24 À : wikidata@lists.wikimedia.org Cc : Discussion list for the Wikidata project. Objet :Re: [Wikidata] Wikidata HDT dump

...

Thanks to report that. I remember one issue that I added here https://github.com/pyvandenbussche/lov/issues/66

Yup, still broken! I've tried just now.

_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Laura Morales

4:49 p.m.

...

@Laura : you mean this list http://lov.okfn.org/lov.nq.gz ? I can download it !!

Which one ? Please send me the URL and I can fix it !!

Yes you can download it, but the nq file is broken. It doesn't validate because some URIs contains white spaces, and some triples have an empty subject (ie. <>).

Edgard Marx

3:11 p.m.

Hoi Ghislain,

On Sat, Oct 28, 2017 at 9:54 AM, Ghislain ATEMEZING < ghislain.atemezing@gmail.com> wrote:

...

Hello emarx, Many thanks for sharing KBox. Very interesting project!

thanks

...

One question, how do you deal with different versions of the KB, like the case here of wikidata dump?

KBox works with the so called KNS (Knowledge Name Service) servers, so any dataset publisher can have his own KNS. Each dataset has its own KN (Knowledge Name) that is distributed over the KNS (Knowledge Name Service). E.g. wikidata dump is https://www.wikidata.org/20160801.

...

Do you fetch their repo every xx time?

No, the idea is that each organization will have its own KNS, so users can add the KNS that they want. Currently all datasets available in KBox KNS are served by KBox team. You can check all of them here kbox.tech, or using the command line ( https://github.com/AKSW/KBox#how-can-i-list-available-knowledge-bases).

...

Also, for avoiding your users to re-create the models, you can pre-load "models" from LOV catalog.

We plan to share all LOD datasets in KBox, we are currently in discussing this with W3C, DBpedia might have its own KNS soon. Regarding LOV catalog, you can help by just asking them to publish their catalog in KBox.

best, <emarx/> http://emarx.org

...

Cheers, Ghislain

2017-10-27 21:56 GMT+02:00 Edgard Marx digamarx@gmail.com:

...
Hey guys,

I don't know if you already knew about it, but you can use KBox for Wikidata, DBpedia, Freebase, Lodstats...

https://github.com/AKSW/KBox

And yes, you can also use it to merge your graph with one of those....

https://github.com/AKSW/KBox#how-can-i-query-multi-bases

cheers,

<emarx>

On Oct 27, 2017 21:02, "Jasper Koehorst" jasperkoehorst@gmail.com wrote:

I will look into the size of the jnl file but should that not be located where the blazegraph is running from the sparql endpoint or is this a special flavour? Was also thinking of looking into a gitlab runner which occasionally could generate a HDT file from the ttl dump if our server can handle it but for this an md5 sum file would be preferable or should a timestamp be sufficient?

Jasper

...
On 27 Oct 2017, at 18:58, Jérémie Roquet jroquet@arkanosis.net wrote:

2017-10-27 18:56 GMT+02:00 Jérémie Roquet jroquet@arkanosis.net:

...
2017-10-27 18:51 GMT+02:00 Luigi Assom itsawesome.yes@gmail.com:

...
I found and share this resource: http://www.rdfhdt.org/datasets/

there is also Wikidata dump in HDT

The link to the Wikidata dump seems dead, unfortunately :'(

… but there's a file on the server: http://gaia.infor.uva.es/hdt/wikidata-20170313-all-BETA.hdt.gz (ie. the link was missing the “.gz”)

-- Jérémie

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

--

"*Love all, trust a few, do wrong to none*" (W. Shakespeare)

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Laura Morales

4:46 p.m.

...

No, the idea is that each organization will have its own KNS, so users can add the KNS that they want.

How would this compare with a traditional SPARQL endpoint + "federated queries", or with "linked fragments"?

Edgard Marx

5:11 p.m.

Hoi Laura,

Thnks for the opportunity to clarify it. KBox is an alternative to other existing architectures for publishing KB such as SPARQL endpoints (e.g. LDFragments, Virtuoso), and Dump files. I should add that you can do federated query with KBox as as easier as you can do with SPARQL endpoints. Here an example:

https://github.com/AKSW/KBox#how-can-i-query-multi-bases

You can use KBox either on JAVA API or command prompt.

best, <emarx/> http://emarx.org

On Sat, Oct 28, 2017 at 1:16 PM, Laura Morales lauretas@mail.com wrote:

...

...
No, the idea is that each organization will have its own KNS, so users

can add the KNS that they want.

How would this compare with a traditional SPARQL endpoint + "federated queries", or with "linked fragments"?

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Laura Morales

6:01 p.m.

...

KBox is an alternative to other existing architectures for publishing KB such as SPARQL endpoints (e.g. LDFragments, Virtuoso), and Dump files. I should add that you can do federated query with KBox as as easier as you can do with SPARQL endpoints.

OK, but I still fail to see what is the value of this? What's the reason why I'd want to use it rather than just start a Fuseki endpoint, or use linked-fragments?

Edgard Marx

6:26 p.m.

On Sat, Oct 28, 2017 at 2:31 PM, Laura Morales lauretas@mail.com wrote:

...

...
KBox is an alternative to other existing architectures for publishing KB

such as SPARQL endpoints (e.g. LDFragments, Virtuoso), and Dump files.

...
I should add that you can do federated query with KBox as as easier as

you can do with SPARQL endpoints.

OK, but I still fail to see what is the value of this? What's the reason why I'd want to use it rather than just start a Fuseki endpoint, or use linked-fragments?

I agree that KBox is not indicated to all scenarios, rather, it fits to users that query frequently a KG, plus do not want to spend time downloading and indexing dump files. KBox bridge this cumbersome task, plus, it shift query execution to the client, so no scalability issues. BTW, if you want to work with Javascript you can also simple start an local endpoint:

https://github.com/AKSW/KBox/blob/master/README.md#starting-a-sparql-endpoin...

...

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Stas Malyshev

12:11 p.m.

Hi!

...

I will look into the size of the jnl file but should that not be located where the blazegraph is running from the sparql endpoint or is this a special flavour? Was also thinking of looking into a gitlab runner which occasionally could generate a HDT file from the ttl dump if our server can handle it but for this an md5 sum file would be preferable or should a timestamp be sufficient?

Publishing jnl file for Blazegraph may be not as useful as one would think, because jnl file is specific for a specific vocabulary and certain other settings - i.e., unless you run the same WDQS code (which customizes some of these) of the same version, you won't be able to use the same file. Of course, since WDQS code is open source, it may be good enough, so in general publishing such file may be possible.

Currently, it's about 300G size uncompressed. No idea how much compressed. Loading it takes a couple of days on reasonably powerful machine, more on labs ones (I haven't tried to load full dump on labs for a while, since labs VMs are too weak for that).

In general, I'd say it'd take about 100M per million of triples. Less if triples are using repeated URIs, probably more if they contain ton of text data.

-- Stas Malyshev smalyshev@wikimedia.org

Ghislain ATEMEZING

1:45 p.m.

Hi, +1 to not share the jrnl file ! I agree with Stas that it doesn’t seem a best practice to publish a specific journal file for a given RDF store (here for blazegraph). Regarding the size of that jrnl file, I remember having one project with almost 500M for 1 billion triples (~ 1/2 size of disk of the dataset).

Best, Ghislain

Provenance : Courrier pour Windows 10

De : Stas Malyshev Envoyé le :samedi 28 octobre 2017 08:42 À : Discussion list for the Wikidata project.; Jasper Koehorst Objet :Re: [Wikidata] Wikidata HDT dump

Hi!

...

I will look into the size of the jnl file but should that not be located where the blazegraph is running from the sparql endpoint or is this a special flavour? Was also thinking of looking into a gitlab runner which occasionally could generate a HDT file from the ttl dump if our server can handle it but for this an md5 sum file would be preferable or should a timestamp be sufficient?

In general, I'd say it'd take about 100M per million of triples. Less if triples are using repeated URIs, probably more if they contain ton of text data.

-- Stas Malyshev smalyshev@wikimedia.org _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Jérémie Roquet

12:31 a.m.

2017-10-27 18:56 GMT+02:00 Jérémie Roquet jroquet@arkanosis.net:

...

2017-10-27 18:51 GMT+02:00 Luigi Assom itsawesome.yes@gmail.com:

...
I found and share this resource: http://www.rdfhdt.org/datasets/

there is also Wikidata dump in HDT

The link to the Wikidata dump seems dead, unfortunately :'(

Javier D. Fernández of the HDT team was very quick to fix the link :-)

One can contact them for support either on their forum or by email¹, as they are willing to help the Wikidata community make use of HDT.

Best regards,

¹ http://www.rdfhdt.org/team/

-- Jérémie

Laura Morales

2:44 a.m.

...

Javier D. Fernández of the HDT team was very quick to fix the link :-)

their dump is almost ~1 year old though.

Laura Morales

2:34 a.m.

...

is it possible to store a weighted adjacency matrix as an HDT instead of an RDF?

Something like a list of entities for each entity, or even better a list of tuples for each entity. So that a tuple could be generalised with propoerties.

Sorry I don't know this, you would have to ask the devs. As far as I understand, it's a triplestore and that should be it...

Wouter Beek

27 Oct 27 Oct

8:50 p.m.

Dear Laura, others,

If somebody points me to the RDF datadump of Wikidata I can deliver an HDT version for it, no problem. (Given the current cost of memory I do not believe that the memory consumption for HDT creation is a blocker.)

--- Cheers, Wouter Beek.

Email: wouter@triply.cc WWW: http://triply.cc Tel: +31647674624

On Fri, Oct 27, 2017 at 5:08 PM, Laura Morales lauretas@mail.com wrote:

...

Hello everyone,

I'd like to ask if Wikidata could please offer a HDT [1] dump along with the already available Turtle dump [2]. HDT is a binary format to store RDF data, which is pretty useful because it can be queried from command line, it can be used as a Jena/Fuseki source, and it also uses orders-of-magnitude less space to store the same data. The problem is that it's very impractical to generate a HDT, because the current implementation requires a lot of RAM processing to convert a file. For Wikidata it will probably require a machine with 100-200GB of RAM. This is unfeasible for me because I don't have such a machine, but if you guys have one to share, I can help setup the rdf2hdt software required to convert Wikidata Turtle to HDT.

Thank you.

[1] http://www.rdfhdt.org/ [2] https://dumps.wikimedia.org/wikidatawiki/entities/

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Ghislain ATEMEZING

8:59 p.m.

@Wouter: See here https://dumps.wikimedia.org/wikidatawiki/entities/ ? Nice idea LAura.

Ghislain

El vie., 27 oct. 2017 a las 17:21, Wouter Beek (wouter@triply.cc) escribió:

...

Dear Laura, others,

If somebody points me to the RDF datadump of Wikidata I can deliver an HDT version for it, no problem. (Given the current cost of memory I do not believe that the memory consumption for HDT creation is a blocker.)

Cheers, Wouter Beek.

Email: wouter@triply.cc WWW: http://triply.cc Tel: +31647674624 <+31%206%2047674624>

On Fri, Oct 27, 2017 at 5:08 PM, Laura Morales lauretas@mail.com wrote:

...
Hello everyone,

I'd like to ask if Wikidata could please offer a HDT [1] dump along with

the already available Turtle dump [2]. HDT is a binary format to store RDF data, which is pretty useful because it can be queried from command line, it can be used as a Jena/Fuseki source, and it also uses orders-of-magnitude less space to store the same data. The problem is that it's very impractical to generate a HDT, because the current implementation requires a lot of RAM processing to convert a file. For Wikidata it will probably require a machine with 100-200GB of RAM. This is unfeasible for me because I don't have such a machine, but if you guys have one to share, I can help setup the rdf2hdt software required to convert Wikidata Turtle to HDT.

...
Thank you.

[1] http://www.rdfhdt.org/ [2] https://dumps.wikimedia.org/wikidatawiki/entities/

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- ------- "Love all, trust a few, do wrong to none" (W. Shakespeare) Web: http://atemezing.org

Wouter Beek

28 Oct 28 Oct

4:04 a.m.

Hi Ghislain,

...

@Wouter: See here https://dumps.wikimedia.org/wikidatawiki/entities/ ?

Thanks for the pointer! I'm downloading from https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz now.

The Content-Type header for that URI seems incorrect to me: it says `application/octet-stream`, but the file actually contains `text/turtle'. (For specifying the compression mechanism the Content-Encoding header should be used.)

The first part of the Turtle data stream seems to contain syntax errors for some of the XSD decimal literals. The first one appears on line 13,291:

```text/turtle http://www.wikidata.org/value/ec714e2ba0fd71ec7256d3f7f7606c35 < http://wikiba.se/ontology-beta#geoPrecision%3E "1.0E-6"^^< http://www.w3.org/2001/XMLSchema#decimal%3E . ```

Notice that scientific notation is not allowed in the lexical form of decimals according to XML Schema Part 2: Datatypes https://www.w3.org/TR/xmlschema11-2/#decimal. (It is allowed in floats and doubles.) Is this a known issue or should I report this somewhere?

--- Cheers!, Wouter.

Email: wouter@triply.cc WWW: http://triply.cc Tel: +31647674624

Laura Morales

11:34 a.m.

...

The first part of the Turtle data stream seems to contain syntax errors for some of the XSD decimal literals. The first one appears on line 13,291:

Notice that scientific notation is not allowed in the lexical form of decimals according to XML > Schema Part 2: Datatypes[https://www.w3.org/TR/xmlschema11-2/#decimal%5D.%C2%A0 (It is allowed in floats and doubles.) Is this a known issue or should I report this somewhere?

I wouldn't call these "syntax" errors, just "logical/type" errors. It would be great if these could fixed by changing the correct type from decimal to float/double. On the other hand, I've never seen any medium or large dataset without this kind of errors. So I would personally treat these as warnings at worst.

@Wouter when you build the HDT file, could you please also generate the .hdt.index file? With rdf2hdt, this should be activated with the -i flag. Thank you again!

Stas Malyshev

12:20 p.m.

Hi!

...

I wouldn't call these "syntax" errors, just "logical/type" errors. It would be great if these could fixed by changing the correct type from decimal to float/double. On the other hand, I've never seen any medium or large dataset without this kind of errors. So I would personally treat these as warnings at worst.

Float/double are range-limited and have limited precision. Decimals are not. Whether it is important for geo precision, needs to be checked, but we could be hitting the limits of precision pretty quickly.

-- Stas Malyshev smalyshev@wikimedia.org

Ghislain ATEMEZING

1:04 p.m.

Hi, @Wouter: As Stas said, you might report that error. I don't agree with Laura who tried to under estimate that "syntax error". It's also about quality ;)

Many thanks in advance ! @Laura: Do you have a different rdf2hdt program or the one in the GitHub of HDT project ?

Best, Ghislain

Sent from my iPhone, may include typos

...

Le 28 oct. 2017 à 08:50, Stas Malyshev smalyshev@wikimedia.org a écrit :

Hi!

...
I wouldn't call these "syntax" errors, just "logical/type" errors. It would be great if these could fixed by changing the correct type from decimal to float/double. On the other hand, I've never seen any medium or large dataset without this kind of errors. So I would personally treat these as warnings at worst.

Float/double are range-limited and have limited precision. Decimals are not. Whether it is important for geo precision, needs to be checked, but we could be hitting the limits of precision pretty quickly.

-- Stas Malyshev smalyshev@wikimedia.org

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Laura Morales

1:27 p.m.

...

@Wouter: As Stas said, you might report that error. I don't agree with Laura who tried to under estimate that "syntax error". It's also about quality ;)

Don't get me wrong, I am all in favor of data quality! :) So if this can be fixed, it's better! The thing is, that I've seen so many datasets with these kind of type errors, that by now I pretty much live with them and I'm OK with these warnings (the triple is not broken after all, it's just not following the standards).

...

@Laura: Do you have a different rdf2hdt program or the one in the GitHub of HDT project ?

I just use https://github.com/rdfhdt/hdt-cpp compiled from the master branch. To verify data instead, I use RIOT (a CL tool from the Apache Jena package) like this `riot --validate file.nt`.

Stas Malyshev

12:18 p.m.

Hi!

...

The first part of the Turtle data stream seems to contain syntax errors for some of the XSD decimal literals. The first one appears on line 13,291:
<http://www.wikidata.org/value/ec714e2ba0fd71ec7256d3f7f7606c35>
<http://wikiba.se/ontology-beta#geoPrecision>
"1.0E-6"^^<http://www.w3.org/2001/XMLSchema#decimal> .

Could you submit a phabricator task (phabricator.wikimedia.org) about this? If it's against the standard it certainly should not be encoded like that.

-- Stas Malyshev smalyshev@wikimedia.org

Stas Malyshev

4:56 p.m.

Hi!

...

The first part of the Turtle data stream seems to contain syntax errors for some of the XSD decimal literals. The first one appears on line 13,291:
<http://www.wikidata.org/value/ec714e2ba0fd71ec7256d3f7f7606c35>
<http://wikiba.se/ontology-beta#geoPrecision>
"1.0E-6"^^<http://www.w3.org/2001/XMLSchema#decimal> .

I've added https://phabricator.wikimedia.org/T179228 to handle this. geoPrecision is a float value and assigning decimal type to it is a mistake. I'll review other properties to see if we don't have more of this. Thanks for bringing it to my attention!

-- Stas Malyshev smalyshev@wikimedia.org

Laura Morales

31 Oct 31 Oct

12:22 p.m.

@Wouter

...

Thanks for the pointer! I'm downloading from https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz now.

Any luck so far?

Ghislain ATEMEZING

4:39 p.m.

@Laura: I suspect Wouter wants to know if he "ignores" the previous errors and proposes a rather incomplete dump (just for you) or waits for Stas' feedback. Btw why don't you use the oldest version in HDT website?

El mar., 31 oct. 2017 a las 7:53, Laura Morales (lauretas@mail.com) escribió:

...

@Wouter

...
Thanks for the pointer! I'm downloading from

https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz now.

Any luck so far?

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- ------- "Love all, trust a few, do wrong to none" (W. Shakespeare) Web: http://atemezing.org

Laura Morales

7:26 p.m.

...

@Laura: I suspect Wouter wants to know if he "ignores" the previous errors and proposes a rather incomplete dump (just for you) or waits for Stas' feedback.

OK. I wonder though, if it would be possible to setup a regular HDT dump alongside the already regular dumps. Looking at the dumps page, https://dumps.wikimedia.org/wikidatawiki/entities/, it looks like a new dump is generated once a week more or less. So if a HDT dump could be added to the schedule, it should show up with the next dump and then so forth with the future dumps. Right now even the Turtle dump contains the bad triples, so adding a HDT file now would not introduce more inconsistencies. The problem will be fixed automatically with the future dumps once the Turtle is fixed (because the HDT is generated from the .ttl file anyway).

...

Btw why don't you use the oldest version in HDT website?

1. I have downloaded it and I'm trying to use it, but the HDT tools (eg. query) require to build an index before I can use the HDT file. I've tried to create the index, but I ran out of memory again (even though the index is smaller than the .hdt file itself). So any Wikidata dump should contain both the .hdt file and the .hdt.index file unless there is another way to generate the index on commodity hardware

2. because it's 1 year old :)

Ghislain ATEMEZING

7:43 p.m.

Interesting use case Laura! Your UC is rather "special" :) Let me try to understand ... You are a "data consumer" with the following needs: - Latest version of the data - Quick access to the data - You don't want to use the current ways to access the data by the publisher (endpoint, ttl dumps, LDFragments) However, you ask for a binary format (HDT), but you don't have enough memory to set up your own environment/endpoint due to lack of memory. For that reason, you are asking the publisher to support both .hdt and .hdt.index files.

Do you think there are many users with your current UC?

El mar., 31 oct. 2017 a las 14:56, Laura Morales (lauretas@mail.com) escribió:

...

...
@Laura: I suspect Wouter wants to know if he "ignores" the previous

errors and proposes a rather incomplete dump (just for you) or waits for Stas' feedback.

OK. I wonder though, if it would be possible to setup a regular HDT dump alongside the already regular dumps. Looking at the dumps page, https://dumps.wikimedia.org/wikidatawiki/entities/, it looks like a new dump is generated once a week more or less. So if a HDT dump could be added to the schedule, it should show up with the next dump and then so forth with the future dumps. Right now even the Turtle dump contains the bad triples, so adding a HDT file now would not introduce more inconsistencies. The problem will be fixed automatically with the future dumps once the Turtle is fixed (because the HDT is generated from the .ttl file anyway).

...
Btw why don't you use the oldest version in HDT website?

I have downloaded it and I'm trying to use it, but the HDT tools (eg.

query) require to build an index before I can use the HDT file. I've tried to create the index, but I ran out of memory again (even though the index is smaller than the .hdt file itself). So any Wikidata dump should contain both the .hdt file and the .hdt.index file unless there is another way to generate the index on commodity hardware

because it's 1 year old :)

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- ------- "Love all, trust a few, do wrong to none" (W. Shakespeare) Web: http://atemezing.org

Laura Morales

8:33 p.m.

I feel like you are misrepresenting my request, and possibly trying to offend me as well.

My "UC" as you call it, is simply that I would like to have a local copy of wikidata, and query it using SPARQL. Everything that I've tried so far doesn't seem to work on commodity hardware since the database is so large. But HDT could work. So I asked if a HDT dump could, please, be added to other dumps that are periodically generated by wikidata. I also told you already that *I AM* trying to use the 1 year old dump, but in order to use the HDT tools I'm told that I *MUST* generate some other index first which unfortunately I can't generate for the same reasons that I can convert the Turtle to HDT. So what I was trying to say is, that if wikidata were to add any HDT dump, this dump should contain both the .hdt file and .hdt.index in order to be useful. That's about it, and it's not just about me. Anybody who wants to have a local copy of wikidata could benefit from this, since setting up a .hdt file seems much easier than a Turtle dump. And I don't understand why you're trying to blame me for this?

If you are part of the wikidata dev team, I'd greatly appreciate a "can/can't" or "don't care" response rather than playing the passive-aggressive game that you displayed in your last email.

...

Let me try to understand ... You are a "data consumer" with the following needs:

Latest version of the data

Quick access to the data

You don't want to use the current ways to access the data by the publisher (endpoint, ttl dumps, LDFragments)

However, you ask for a binary format (HDT), but you don't have enough memory to set up your own environment/endpoint due to lack of memory. For that reason, you are asking the publisher to support both .hdt and .hdt.index files.

Do you think there are many users with your current UC?

Luigi Assom

1 Nov 1 Nov

1:14 a.m.

Doh what's wrong with asking for supporting own user case "UC" ?

I think it is a totally legit question to ask, and that's why this thread exists.

Also, I do support for possibility to help access to data that would be hard to process from "common" hardware. Especially in the case of open data. They exists to allow someone take them and build them - amazing if can prototype locally, right?

I don't like the use case where a data-scientist-or-IT show to the other data-scientist-or-IT own work looking for emotional support or praise. I've seen that, not here, and I hope this attitude stays indeed out from here..

I do like when the work of data-scientist-or-IT ignites someone else's creativity - someone who is completely external - , to say: hey your work is cool and I wanna use it for... my use case! That's how ideas go around and help other people build complexity over them, without constructing not necessary borders.

About a local version of compressed, index RDF - I think that if was available, more people yes probably would use it.

On Tue, Oct 31, 2017 at 4:03 PM, Laura Morales lauretas@mail.com wrote:

...

I feel like you are misrepresenting my request, and possibly trying to offend me as well.

My "UC" as you call it, is simply that I would like to have a local copy of wikidata, and query it using SPARQL. Everything that I've tried so far doesn't seem to work on commodity hardware since the database is so large. But HDT could work. So I asked if a HDT dump could, please, be added to other dumps that are periodically generated by wikidata. I also told you already that *I AM* trying to use the 1 year old dump, but in order to use the HDT tools I'm told that I *MUST* generate some other index first which unfortunately I can't generate for the same reasons that I can convert the Turtle to HDT. So what I was trying to say is, that if wikidata were to add any HDT dump, this dump should contain both the .hdt file and .hdt.index in order to be useful. That's about it, and it's not just about me. Anybody who wants to have a local copy of wikidata could benefit from this, since setting up a .hdt file seems much easier than a Turtle dump. And I don't understand why you're trying to blame me for this?

If you are part of the wikidata dev team, I'd greatly appreciate a "can/can't" or "don't care" response rather than playing the passive-aggressive game that you displayed in your last email.

...
Let me try to understand ... You are a "data consumer" with the following needs:

Latest version of the data

Quick access to the data

You don't want to use the current ways to access the data by the

publisher (endpoint, ttl dumps, LDFragments)

...
However, you ask for a binary format (HDT), but you don't have enough

memory to set up your own environment/endpoint due to lack of memory.

...
For that reason, you are asking the publisher to support both .hdt and

.hdt.index files.

...
Do you think there are many users with your current UC?

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Ghislain ATEMEZING

1:35 a.m.

Hola, Please don’t get me wrong and don’t give any interpretation based on my question. Since the beginning of this thread, I am also trying to push the use of HDT here. For example, I was the one contacting HDT gurus to fix the dataset error on Twitter and so on...

Sorry if Laura or any one thought I was giving “some lessons here “. I don’t have a super computer either nor a member of Wikidata team. Just a “data consumer” as many here ..

Best, Ghislain

Sent from my iPhone, may include typos

...

Le 31 oct. 2017 à 20:44, Luigi Assom itsawesome.yes@gmail.com a écrit :

Doh what's wrong with asking for supporting own user case "UC" ?

I think it is a totally legit question to ask, and that's why this thread exists.

Also, I do support for possibility to help access to data that would be hard to process from "common" hardware. Especially in the case of open data. They exists to allow someone take them and build them - amazing if can prototype locally, right?

I don't like the use case where a data-scientist-or-IT show to the other data-scientist-or-IT own work looking for emotional support or praise. I've seen that, not here, and I hope this attitude stays indeed out from here..

I do like when the work of data-scientist-or-IT ignites someone else's creativity - someone who is completely external - , to say: hey your work is cool and I wanna use it for... my use case! That's how ideas go around and help other people build complexity over them, without constructing not necessary borders.

About a local version of compressed, index RDF - I think that if was available, more people yes probably would use it.

...
On Tue, Oct 31, 2017 at 4:03 PM, Laura Morales lauretas@mail.com wrote: I feel like you are misrepresenting my request, and possibly trying to offend me as well.

My "UC" as you call it, is simply that I would like to have a local copy of wikidata, and query it using SPARQL. Everything that I've tried so far doesn't seem to work on commodity hardware since the database is so large. But HDT could work. So I asked if a HDT dump could, please, be added to other dumps that are periodically generated by wikidata. I also told you already that *I AM* trying to use the 1 year old dump, but in order to use the HDT tools I'm told that I *MUST* generate some other index first which unfortunately I can't generate for the same reasons that I can convert the Turtle to HDT. So what I was trying to say is, that if wikidata were to add any HDT dump, this dump should contain both the .hdt file and .hdt.index in order to be useful. That's about it, and it's not just about me. Anybody who wants to have a local copy of wikidata could benefit from this, since setting up a .hdt file seems much easier than a Turtle dump. And I don't understand why you're trying to blame me for this?

If you are part of the wikidata dev team, I'd greatly appreciate a "can/can't" or "don't care" response rather than playing the passive-aggressive game that you displayed in your last email.

...
Let me try to understand ... You are a "data consumer" with the following needs:

Latest version of the data

Quick access to the data

You don't want to use the current ways to access the data by the publisher (endpoint, ttl dumps, LDFragments)

However, you ask for a binary format (HDT), but you don't have enough memory to set up your own environment/endpoint due to lack of memory. For that reason, you are asking the publisher to support both .hdt and .hdt.index files.

Do you think there are many users with your current UC?

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Jérémie Roquet

31 Oct 31 Oct

11:33 p.m.

2017-10-31 14:56 GMT+01:00 Laura Morales lauretas@mail.com:

...

I have downloaded it and I'm trying to use it, but the HDT tools (eg. query) require to build an index before I can use the HDT file. I've tried to create the index, but I ran out of memory again (even though the index is smaller than the .hdt file itself). So any Wikidata dump should contain both the .hdt file and the .hdt.index file unless there is another way to generate the index on commodity hardware

I've just loaded the provided hdt file on a big machine (32 GiB wasn't enough to build the index but ten times this is more than enough), so here are a few interesting metrics: - the index alone is ~14 GiB big uncompressed, ~9 GiB gzipped and ~6.5 GiB xzipped ; - once loaded in hdtSearch, Wikidata uses ~36 GiB of virtual memory ; - right after index generation, it includes ~16 GiB of anonymous memory (with no memory pressure, that's ~26 GiB resident)… - …but after a reload, the index is memory mapped as well, so it only includes ~400 MiB of anonymous memory (and a mere ~1.2 GiB resident).

Looks like a good candidate for commodity hardware, indeed. It loads in less than one second on a 32 GiB machine. I'll try to run a few queries to see how it behaves.

FWIW, my use case is very similar to yours, as I'd like to run queries that are too long for the public SPARQL endpoint and can't dedicate a powerful machine do this full time (Blazegraph runs fine with 32 GiB, though — it just takes a while to index and updating is not as fast as the changes happening on wikidata.org).

-- Jérémie

Laura Morales

1 Nov 1 Nov

1:57 a.m.

...

I've just loaded the provided hdt file on a big machine (32 GiB wasn't

enough to build the index but ten times this is more than enough)

Could you please share a bit about your setup? Do you have a machine with 320GB of RAM? Could you please also try to convert wikidata.ttl to hdt using "rdf2hdt"? I'd be interested to read your results on this too. Thank you!

...

I'll try to run a few queries to see how it behaves.

I don't think there is a command-line tool to parse SPARQL queries, so you probably have to setup a Fuseki endpoint which uses HDT as a data source.

Jérémie Roquet

2:28 a.m.

2017-10-31 21:27 GMT+01:00 Laura Morales lauretas@mail.com:

...

...
I've just loaded the provided hdt file on a big machine (32 GiB wasn't

enough to build the index but ten times this is more than enough) Could you please share a bit about your setup? Do you have a machine with 320GB of RAM?

It's a machine with 378 GiB of RAM and 64 threads running Scientific Linux 7.2, that we use mainly for benchmarks.

Building the index was really all about memory because the CPUs have actually a lower per-thread performance (2.30 GHz vs 3.5 GHz) compared to those of my regular workstation, which was unable to build it.

...

Could you please also try to convert wikidata.ttl to hdt using "rdf2hdt"? I'd be interested to read your results on this too.

As I'm also looking for up-to-date results, so I plan do it with the last turtle dump as soon as I have a time slot for it; I'll let you know about the outcome.

...

...
I'll try to run a few queries to see how it behaves.

I don't think there is a command-line tool to parse SPARQL queries, so you probably have to setup a Fuseki endpoint which uses HDT as a data source.

You're right. The limited query language of hdtSearch is closer to grep than to SPARQL.

Thank you for pointing out Fuseki, I'll have a look at it.

-- Jérémie

Laura Morales

12:17 p.m.

...

It's a machine with 378 GiB of RAM and 64 threads running Scientific Linux 7.2, that we use mainly for benchmarks.

Building the index was really all about memory because the CPUs have actually a lower per-thread performance (2.30 GHz vs 3.5 GHz) compared to those of my regular workstation, which was unable to build it.

If your regular workstation was using more CPU, I guess it was because of swapping. Thanks for the statistics, it means a "commodity" CPU could handle this fine, the bottleneck is RAM. I wonder how expensive it is to buy a machine like yours... it sounds like in the $30K-$50K range?

...

You're right. The limited query language of hdtSearch is closer to grep than to SPARQL.

Thank you for pointing out Fuseki, I'll have a look at it.

I think a SPARQL command-line tool could exist, but AFAICT it doesn't exist (yet?). Anyway, I have already successfully setup Fuseki with a HDT backend, although my HDT files are all small. Feel free to drop me an email if you need any help setting up Fuseki.

Jasper Koehorst

12:22 p.m.

We are actually planning to buy a new barebone server and they are around E2500,-. With barely any memory. I will check later with sales, 16gig ram strips are around max 200 euros so below 10K should be sufficient?

...

On 1 Nov 2017, at 07:47, Laura Morales lauretas@mail.com wrote:

...
It's a machine with 378 GiB of RAM and 64 threads running Scientific Linux 7.2, that we use mainly for benchmarks.

Building the index was really all about memory because the CPUs have actually a lower per-thread performance (2.30 GHz vs 3.5 GHz) compared to those of my regular workstation, which was unable to build it.

If your regular workstation was using more CPU, I guess it was because of swapping. Thanks for the statistics, it means a "commodity" CPU could handle this fine, the bottleneck is RAM. I wonder how expensive it is to buy a machine like yours... it sounds like in the $30K-$50K range?

...
You're right. The limited query language of hdtSearch is closer to grep than to SPARQL.

Thank you for pointing out Fuseki, I'll have a look at it.

I think a SPARQL command-line tool could exist, but AFAICT it doesn't exist (yet?). Anyway, I have already successfully setup Fuseki with a HDT backend, although my HDT files are all small. Feel free to drop me an email if you need any help setting up Fuseki.

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Stas Malyshev

5:02 a.m.

Hi!

...

OK. I wonder though, if it would be possible to setup a regular HDT dump alongside the already regular dumps. Looking at the dumps page, https://dumps.wikimedia.org/wikidatawiki/entities/, it looks like a new dump is generated once a week more or less. So if a HDT dump could

True, the dumps run weekly. "More or less" situation can arise only if one of the dumps fail (either due to a bug or some sort of external force majeure).

-- Stas Malyshev smalyshev@wikimedia.org

sushil dutt

10:35 a.m.

Please take me out from these conversations.

On Wed, Nov 1, 2017 at 5:02 AM, Stas Malyshev smalyshev@wikimedia.org wrote:

...

Hi!

...
OK. I wonder though, if it would be possible to setup a regular HDT dump alongside the already regular dumps. Looking at the dumps page, https://dumps.wikimedia.org/wikidatawiki/entities/, it looks like a new dump is generated once a week more or less. So if a HDT dump could

True, the dumps run weekly. "More or less" situation can arise only if one of the dumps fail (either due to a bug or some sort of external force majeure). -- Stas Malyshev smalyshev@wikimedia.org

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Regards, Sushil Dutt 8800911840

Laura Morales

12:38 p.m.

...

Please take me out from these conversations.

Sorry for the long thread, this is probably a small inconvenience with mailing list. However the "Subject" is always the same, you can delete messages right away without having to read them.

Jasper Koehorst

12:19 p.m.

Hello,

I am currently downloading the latest ttl file. On a 250gig ram machine. I will see if that is sufficient to run the conversion Otherwise we have another busy one with around 310 gig. For querying I use the Jena query engine. I have created a module called HDTQuery located http://download.systemsbiology.nl/sapp/ http://download.systemsbiology.nl/sapp/ which is a simple program and under development that should be able to use the full power of SPARQL and be more advanced than grep… ;)

If this all works out I will see with our department if we can set up if it is still needed a weekly cron job to convert the TTL file. But as it is growing rapidly we might run into memory issues later?

...

On 1 Nov 2017, at 00:32, Stas Malyshev smalyshev@wikimedia.org wrote:

Hi!

...
OK. I wonder though, if it would be possible to setup a regular HDT dump alongside the already regular dumps. Looking at the dumps page, https://dumps.wikimedia.org/wikidatawiki/entities/, it looks like a new dump is generated once a week more or less. So if a HDT dump could

True, the dumps run weekly. "More or less" situation can arise only if one of the dumps fail (either due to a bug or some sort of external force majeure). -- Stas Malyshev smalyshev@wikimedia.org

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Laura Morales

12:29 p.m.

...

I am currently downloading the latest ttl file. On a 250gig ram machine. I will see if that is sufficient to run the conversion Otherwise we have another busy one with around 310 gig.

Thank you!

...

For querying I use the Jena query engine. I have created a module called HDTQuery located http://download.systemsbiology.nl/sapp/%C2%A0which is a simple program and under development that should be able to use the full power of SPARQL and be more advanced than grep… ;)

Does this tool allow to query HDT files from command-line, with SPARQL, and without the need to setup a Fuseki endpoint?

...

If this all works out I will see with our department if we can set up if it is still needed a weekly cron job to convert the TTL file. But as it is growing rapidly we might run into memory issues later?

Thank you!

Jasper Koehorst

12:30 p.m.

Yes you just run it should get a sufficient help and if not… I am more than happy to polish the code…

java -jar /Users/jasperkoehorst/Downloads/HDTQuery.jar The following option is required: -query Usage: <main class> [options] Options: --help

-debug Debug mode Default: false -e SPARQL endpoint -f Output format, csv / tsv Default: csv -i HDT input file(s) for querying (comma separated) -o Query result file * -query SPARQL Query or FILE containing the query to execute

* required parameter

...

On 1 Nov 2017, at 07:59, Laura Morales lauretas@mail.com wrote:

...
I am currently downloading the latest ttl file. On a 250gig ram machine. I will see if that is sufficient to run the conversion Otherwise we have another busy one with around 310 gig.

Thank you!

...
For querying I use the Jena query engine. I have created a module called HDTQuery located http://download.systemsbiology.nl/sapp/ which is a simple program and under development that should be able to use the full power of SPARQL and be more advanced than grep… ;)

Does this tool allow to query HDT files from command-line, with SPARQL, and without the need to setup a Fuseki endpoint?

...
If this all works out I will see with our department if we can set up if it is still needed a weekly cron job to convert the TTL file. But as it is growing rapidly we might run into memory issues later?

Thank you!

Osma Suominen

2 Nov 2 Nov

6:52 p.m.

Laura Morales kirjoitti 01.11.2017 klo 08:59:

...

...
For querying I use the Jena query engine. I have created a module called HDTQuery located http://download.systemsbiology.nl/sapp/%C2%A0which is a simple program and under development that should be able to use the full power of SPARQL and be more advanced than grep… ;)

Does this tool allow to query HDT files from command-line, with SPARQL, and without the need to setup a Fuseki endpoint?

There is also a command line tool called hdtsparql in the hdt-java distribution that allows exactly this. It used to support only SELECT queries, but I've enhanced it to support CONSTRUCT, DESCRIBE and ASK queries too. There are some limitations, for example only CSV output is supported for SELECT and N-Triples for CONSTRUCT and DESCRIBE. But it works fine, at least for my use cases, and is often more convenient than firing up Fuseki-HDT. It requires both the hdt file and the corresponding index file.

Code here: https://github.com/rdfhdt/hdt-java/blob/master/hdt-jena/src/main/java/org/rd...

The tool is in the hdt-jena package (not hdt-java-cli where the other command line tools reside), since it uses parts of Jena (e.g. ARQ). There is a wrapper script called hdtsparql.sh for executing it with the proper Java environment.

Typical usage (example from hdt-java README):

# Execute SPARQL Query against the file. $ ./hdtsparql.sh ../hdt-java/data/test.hdt "SELECT ?s ?p ?o WHERE { ?s ?p ?o . }"

-Osma

-- Osma Suominen D.Sc. (Tech), Information Systems Specialist National Library of Finland P.O. Box 26 (Kaikukatu 4) 00014 HELSINGIN YLIOPISTO Tel. +358 50 3199529 osma.suominen@helsinki.fi http://www.nationallibrary.fi

Laura Morales

7:24 p.m.

...

There is also a command line tool called hdtsparql in the hdt-java

distribution that allows exactly this. It used to support only SELECT queries, but I've enhanced it to support CONSTRUCT, DESCRIBE and ASK queries too. There are some limitations, for example only CSV output is supported for SELECT and N-Triples for CONSTRUCT and DESCRIBE.

Thank you for sharing.

...

The tool is in the hdt-jena package (not hdt-java-cli where the other

command line tools reside), since it uses parts of Jena (e.g. ARQ).

...

There is a wrapper script called hdtsparql.sh for executing it with the

proper Java environment.

Does this tool work nicely with large HDT files such as wikidata? Or does it need to load the whole graph+index into memory?

Osma Suominen

7:29 p.m.

Laura Morales kirjoitti 02.11.2017 klo 15:54:

...

...
The tool is in the hdt-jena package (not hdt-java-cli where the other

command line tools reside), since it uses parts of Jena (e.g. ARQ).

...
There is a wrapper script called hdtsparql.sh for executing it with the

proper Java environment.

Does this tool work nicely with large HDT files such as wikidata? Or does it need to load the whole graph+index into memory?

I haven't tested it with huge datasets like Wikidata. But for the moderately sized (40M triples) data that I use it for, it runs pretty fast and without using lots of memory, so I think it just memory maps the hdt and index file and reads only what it needs to answer the query.

-Osma

Laura Morales

3 Nov 3 Nov

1:18 p.m.

Hello list,

a very kind person from this list has generated the .hdt.index file for me, using the 1-year old wikidata HDT file available at the rdfhdt website. So I was finally able to setup a working local endpoint using HDT+Fuseki. Set up was easy, launch time (for Fuseki) also was quick (a few seconds), the only change I made was to replace -Xmx1024m to -Xmx4g in the Fuseki startup script (btw I'm not very proficient in Java, so I hope this is the correct way). I've ran some queries too. Simple select or traversal queries seems fast to me (I haven't measured them but the response is almost immediate), other queries such as "select distinct ?class where { [] a ?class }" takes several seconds or a few minutes to complete, which kinda tells me the HDT indexes don't work well on all queries. But otherwise for simple queries it works perfectly! At least I'm able to query the dataset! In conclusion, I think this more or less gives some positive feedback for using HDT on a "commodity computer", which means it can be very useful for people like me who want to use the dataset locally but who can't setup a full-blown server. If others want to try as well, they can offer more (hopefully positive) feedback. For all of this, I heartwarmingly plea any wikidata dev to please consider scheduling a HDT dump (.hdt + .hdt.index) along with the other regular dumps that it creates weekly.

Thank you!!

Osma Suominen

2:26 p.m.

Hi Laura,

Thank you for sharing your experience! I think your example really shows the power - and limitations - of HDT technology for querying very large RDF data sets. While I don't currently have any use case for a local, queryable Wikidata dump, I can easily see that it could be very useful for doing e.g. resource-intensive, analytic queries. Having access to a recent hdt+index dump of Wikidata would make it very easy to start doing that. So I second your plea.

-Osma

Laura Morales kirjoitti 03.11.2017 klo 09:48:

...

Hello list,

a very kind person from this list has generated the .hdt.index file for me, using the 1-year old wikidata HDT file available at the rdfhdt website. So I was finally able to setup a working local endpoint using HDT+Fuseki. Set up was easy, launch time (for Fuseki) also was quick (a few seconds), the only change I made was to replace -Xmx1024m to -Xmx4g in the Fuseki startup script (btw I'm not very proficient in Java, so I hope this is the correct way). I've ran some queries too. Simple select or traversal queries seems fast to me (I haven't measured them but the response is almost immediate), other queries such as "select distinct ?class where { [] a ?class }" takes several seconds or a few minutes to complete, which kinda tells me the HDT indexes don't work well on all queries. But otherwise for simple queries it works perfectly! At least I'm able to query the dataset! In conclusion, I think this more or less gives some positive feedback for using HDT on a "commodity computer", which means it can be very useful for people like me who want to use the dataset locally but who can't setup a full-blown server. If others want to try as well, they can offer more (hopefully positive) feedback. For all of this, I heartwarmingly plea any wikidata dev to please consider scheduling a HDT dump (.hdt + .hdt.index) along with the other regular dumps that it creates weekly.

Thank you!!

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Ettore RIZZA

2:35 p.m.

Thank you for this feedback, Laura.

Is the hdt index you got available somewhere on the cloud?

Cheers

2017-11-03 9:56 GMT+01:00 Osma Suominen osma.suominen@helsinki.fi:

...

Hi Laura,

Thank you for sharing your experience! I think your example really shows the power - and limitations - of HDT technology for querying very large RDF data sets. While I don't currently have any use case for a local, queryable Wikidata dump, I can easily see that it could be very useful for doing e.g. resource-intensive, analytic queries. Having access to a recent hdt+index dump of Wikidata would make it very easy to start doing that. So I second your plea.

-Osma

Laura Morales kirjoitti 03.11.2017 klo 09:48:

...
Hello list,

a very kind person from this list has generated the .hdt.index file for me, using the 1-year old wikidata HDT file available at the rdfhdt website. So I was finally able to setup a working local endpoint using HDT+Fuseki. Set up was easy, launch time (for Fuseki) also was quick (a few seconds), the only change I made was to replace -Xmx1024m to -Xmx4g in the Fuseki startup script (btw I'm not very proficient in Java, so I hope this is the correct way). I've ran some queries too. Simple select or traversal queries seems fast to me (I haven't measured them but the response is almost immediate), other queries such as "select distinct ?class where { [] a ?class }" takes several seconds or a few minutes to complete, which kinda tells me the HDT indexes don't work well on all queries. But otherwise for simple queries it works perfectly! At least I'm able to query the dataset! In conclusion, I think this more or less gives some positive feedback for using HDT on a "commodity computer", which means it can be very useful for people like me who want to use the dataset locally but who can't setup a full-blown server. If others want to try as well, they can offer more (hopefully positive) feedback. For all of this, I heartwarmingly plea any wikidata dev to please consider scheduling a HDT dump (.hdt + .hdt.index) along with the other regular dumps that it creates weekly.

Thank you!!

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Osma Suominen D.Sc. (Tech), Information Systems Specialist National Library of Finlan https://maps.google.com/?q=y+of+Finlan&entry=gmail&source=gd P.O. Box 26 (Kaikukatu 4) 00014 HELSINGIN YLIOPISTO Tel. +358 50 3199529 osma.suominen@helsinki.fi http://www.nationallibrary.fi

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Jasper Koehorst

2:45 p.m.

I am uploading the index file temporarily to:

http://fungen.wur.nl/~jasperk/WikiData/ http://fungen.wur.nl/~jasperk/WikiData/

Jasper

...

On 3 Nov 2017, at 10:05, Ettore RIZZA ettorerizza@gmail.com wrote:

Thank you for this feedback, Laura.

Is the hdt index you got available somewhere on the cloud?

Cheers

2017-11-03 9:56 GMT+01:00 Osma Suominen <osma.suominen@helsinki.fi mailto:osma.suominen@helsinki.fi>: Hi Laura,

Thank you for sharing your experience! I think your example really shows the power - and limitations - of HDT technology for querying very large RDF data sets. While I don't currently have any use case for a local, queryable Wikidata dump, I can easily see that it could be very useful for doing e.g. resource-intensive, analytic queries. Having access to a recent hdt+index dump of Wikidata would make it very easy to start doing that. So I second your plea.

-Osma

Laura Morales kirjoitti 03.11.2017 klo 09:48: Hello list,

a very kind person from this list has generated the .hdt.index file for me, using the 1-year old wikidata HDT file available at the rdfhdt website. So I was finally able to setup a working local endpoint using HDT+Fuseki. Set up was easy, launch time (for Fuseki) also was quick (a few seconds), the only change I made was to replace -Xmx1024m to -Xmx4g in the Fuseki startup script (btw I'm not very proficient in Java, so I hope this is the correct way). I've ran some queries too. Simple select or traversal queries seems fast to me (I haven't measured them but the response is almost immediate), other queries such as "select distinct ?class where { [] a ?class }" takes several seconds or a few minutes to complete, which kinda tells me the HDT indexes don't work well on all queries. But otherwise for simple queries it works perfectly! At least I'm able to query the dataset! In conclusion, I think this more or less gives some positive feedback for using HDT on a "commodity computer", which means it can be very useful for people like me who want to use the dataset locally but who can't setup a full-blown server. If others want to try as well, they can offer more (hopefully positive) feedback. For all of this, I heartwarmingly plea any wikidata dev to please consider scheduling a HDT dump (.hdt + .hdt.index) along with the other regular dumps that it creates weekly.

Thank you!!

Wikidata mailing list Wikidata@lists.wikimedia.org mailto:Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Osma Suominen D.Sc. (Tech), Information Systems Specialist National Library of Finlan https://maps.google.com/?q=y+of+Finlan&entry=gmail&source=gd P.O. Box 26 (Kaikukatu 4) 00014 HELSINGIN YLIOPISTO Tel. +358 50 3199529 tel:%2B358%2050%203199529 osma.suominen@helsinki.fi mailto:osma.suominen@helsinki.fi http://www.nationallibrary.fi http://www.nationallibrary.fi/

Wikidata mailing list Wikidata@lists.wikimedia.org mailto:Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Ettore RIZZA

2:46 p.m.

Thank you very much, Jasper !

2017-11-03 10:15 GMT+01:00 Jasper Koehorst jasperkoehorst@gmail.com:

...

I am uploading the index file temporarily to:

http://fungen.wur.nl/~jasperk/WikiData/

Jasper

On 3 Nov 2017, at 10:05, Ettore RIZZA ettorerizza@gmail.com wrote:

Thank you for this feedback, Laura.

Is the hdt index you got available somewhere on the cloud?

Cheers

2017-11-03 9:56 GMT+01:00 Osma Suominen osma.suominen@helsinki.fi:

...
Hi Laura,

Thank you for sharing your experience! I think your example really shows the power - and limitations - of HDT technology for querying very large RDF data sets. While I don't currently have any use case for a local, queryable Wikidata dump, I can easily see that it could be very useful for doing e.g. resource-intensive, analytic queries. Having access to a recent hdt+index dump of Wikidata would make it very easy to start doing that. So I second your plea.

-Osma

Laura Morales kirjoitti 03.11.2017 klo 09:48:

...
Hello list,

a very kind person from this list has generated the .hdt.index file for me, using the 1-year old wikidata HDT file available at the rdfhdt website. So I was finally able to setup a working local endpoint using HDT+Fuseki. Set up was easy, launch time (for Fuseki) also was quick (a few seconds), the only change I made was to replace -Xmx1024m to -Xmx4g in the Fuseki startup script (btw I'm not very proficient in Java, so I hope this is the correct way). I've ran some queries too. Simple select or traversal queries seems fast to me (I haven't measured them but the response is almost immediate), other queries such as "select distinct ?class where { [] a ?class }" takes several seconds or a few minutes to complete, which kinda tells me the HDT indexes don't work well on all queries. But otherwise for simple queries it works perfectly! At least I'm able to query the dataset! In conclusion, I think this more or less gives some positive feedback for using HDT on a "commodity computer", which means it can be very useful for people like me who want to use the dataset locally but who can't setup a full-blown server. If others want to try as well, they can offer more (hopefully positive) feedback. For all of this, I heartwarmingly plea any wikidata dev to please consider scheduling a HDT dump (.hdt + .hdt.index) along with the other regular dumps that it creates weekly.

Thank you!!

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Osma Suominen D.Sc. (Tech), Information Systems Specialist National Library of Finlan https://maps.google.com/?q=y+of+Finlan&entry=gmail&source=gd P.O. Box 26 (Kaikukatu 4) 00014 HELSINGIN YLIOPISTO Tel. +358 50 3199529 osma.suominen@helsinki.fi http://www.nationallibrary.fi

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Laura Morales

2:59 p.m.

...

Thank you for this feedback, Laura. Is the hdt index you got available somewhere on the cloud?

Unfortunately it's not. It was a private link that was temporarily shared with me by email. I guess I could re-upload the file somewhere else myself, but my uplink is really slow (1Mbps).

Lucas Werkmeister

6:14 p.m.

I’ve created a Phabricator task (https://phabricator.wikimedia.org/T179681) for providing a HDT dump, let’s see if someone else (ideally from the ops team) responds to it. (I’m not familiar with the systems we currently use for the dumps, so I can’t say if they have enough resources for this.)

Cheers, Lucas

2017-11-03 10:29 GMT+01:00 Laura Morales lauretas@mail.com:

...

...
Thank you for this feedback, Laura. Is the hdt index you got available somewhere on the cloud?

Unfortunately it's not. It was a private link that was temporarily shared with me by email. I guess I could re-upload the file somewhere else myself, but my uplink is really slow (1Mbps).

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Lucas Werkmeister Software Developer (Intern) Wikimedia Deutschland e. V. | Tempelhofer Ufer 23-24 | 10963 Berlin Phone: +49 (0)30 219 158 26-0 https://wikimedia.de Imagine a world, in which every single human being can freely share in the sum of all knowledge. That‘s our commitment. Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

Laura Morales

6:32 p.m.

...

I’ve created a Phabricator task (https://phabricator.wikimedia.org/T179681) for providing a HDT dump, let’s see if someone else (ideally from the ops team) responds to it. (I’m not familiar with the systems we currently use for the dumps, so I can’t say if they have enough resources for this.)

Thank you Lucas!

Jérémie Roquet

7 Nov 7 Nov

7:54 p.m.

Hi everyone,

I'm afraid the current implementation of HDT is not ready to handle more than 4 billions triples as it is limited to 32 bit indexes. I've opened an issue upstream: https://github.com/rdfhdt/hdt-cpp/issues/135

Until this is addressed, don't waste your time trying to convert the entire Wikidata to HDT: it can't work.

-- Jérémie

Ghislain ATEMEZING

8:02 p.m.

Hi Jeremie, Thanks for this info. In the meantime, what about making chunks of 3.5Bio triples (or any size less than 4Bio) and a script to convert the dataset? Would that be possible ?

Best, Ghislain

Provenance : Courrier pour Windows 10

De : Jérémie Roquet Envoyé le :mardi 7 novembre 2017 15:25 À : Discussion list for the Wikidata project. Objet :Re: [Wikidata] Wikidata HDT dump

Hi everyone,

Until this is addressed, don't waste your time trying to convert the entire Wikidata to HDT: it can't work.

-- Jérémie _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Jérémie Roquet

10:26 p.m.

2017-11-07 15:32 GMT+01:00 Ghislain ATEMEZING ghislain.atemezing@gmail.com:

...

In the meantime, what about making chunks of 3.5Bio triples (or any size less than 4Bio) and a script to convert the dataset? Would that be possible?

That seems possible to me, but I wonder if cutting the dataset in independent clusters is not a bigger undertaking compared to making HDT handle bigger datasets (I'm not saying it is, I've really no idea).

Best regards,

-- Jérémie

Ettore RIZZA

1 Oct 1 Oct

12:40 p.m.

Hello,

a new dump of Wikidata in HDT (with index) is available http://www.rdfhdt.org/datasets/. You will see how Wikidata has become huge compared to other datasets. it contains about twice the limit of 4B triples discussed above.

In this regard, what is in 2018 the most user friendly way to use this format?

BR,

Ettore

On Tue, 7 Nov 2017 at 15:33, Ghislain ATEMEZING < ghislain.atemezing@gmail.com> wrote:

...

Hi Jeremie,

Thanks for this info.

In the meantime, what about making chunks of 3.5Bio triples (or any size less than 4Bio) and a script to convert the dataset? Would that be possible ?

Best,

Ghislain

Provenance : Courrier https://go.microsoft.com/fwlink/?LinkId=550986 pour Windows 10

*De : *Jérémie Roquet jroquet@arkanosis.net *Envoyé le :*mardi 7 novembre 2017 15:25 *À : *Discussion list for the Wikidata project. wikidata@lists.wikimedia.org *Objet :*Re: [Wikidata] Wikidata HDT dump

Hi everyone,

I'm afraid the current implementation of HDT is not ready to handle

more than 4 billions triples as it is limited to 32 bit indexes. I've

opened an issue upstream: https://github.com/rdfhdt/hdt-cpp/issues/135

Until this is addressed, don't waste your time trying to convert the

entire Wikidata to HDT: it can't work.

--

Jérémie

Wikidata mailing list

Wikidata@lists.wikimedia.org

https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Laura Morales

10:29 p.m.

...

a new dump of Wikidata in HDT (with index) is available[http://www.rdfhdt.org/datasets/].

Thank you very much! Keep it up! Out of curiosity, what computer did you use for this? IIRC it required >512GB of RAM to function.

...

You will see how Wikidata has become huge compared to other datasets. it contains about twice the limit of 4B triples discussed above.

There is a 64-bit version of HDT that doesn't have this limitation of 4B triples.

...

In this regard, what is in 2018 the most user friendly way to use this format?

Speaking for me at least, Fuseki with a HDT store. But I know there are also some CLI tools from the HDT folks.

Ettore RIZZA

2 Oct 2 Oct

2:33 a.m.

...

what computer did you use for this? IIRC it required >512GB of RAM to

function.

Hello Laura,

Sorry for my confusing message, I am not at all a member of the HDT team. But according to its creator https://twitter.com/ciutti/status/1046849607114936320, 100 GB "with an optimized code" could be enough to produce an HDT like that.

On Mon, 1 Oct 2018 at 18:59, Laura Morales lauretas@mail.com wrote:

...

...
a new dump of Wikidata in HDT (with index) is available[

http://www.rdfhdt.org/datasets/].

Thank you very much! Keep it up! Out of curiosity, what computer did you use for this? IIRC it required

...
512GB of RAM to function.

...
You will see how Wikidata has become huge compared to other datasets. it

contains about twice the limit of 4B triples discussed above.

There is a 64-bit version of HDT that doesn't have this limitation of 4B triples.

...
In this regard, what is in 2018 the most user friendly way to use this

format?

Speaking for me at least, Fuseki with a HDT store. But I know there are also some CLI tools from the HDT folks.

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Paul Houle

5:02 a.m.

You shouldn't have to keep anything in RAM to HDT-ize something as you could make the dictionary by sorting on disk and also do the joins to look up everything against the dictionary by sorting.

------ Original Message ------ From: "Ettore RIZZA" ettorerizza@gmail.com To: "Discussion list for the Wikidata project." wikidata@lists.wikimedia.org Sent: 10/1/2018 5:03:59 PM Subject: Re: [Wikidata] Wikidata HDT dump

...

...
what computer did you use for this? IIRC it required >512GB of RAM to

function.

Hello Laura,

Sorry for my confusing message, I am not at all a member of the HDT team. But according to its creator https://twitter.com/ciutti/status/1046849607114936320, 100 GB "with an optimized code" could be enough to produce an HDT like that.

On Mon, 1 Oct 2018 at 18:59, Laura Morales lauretas@mail.com wrote:

...
...
a new dump of Wikidata in HDT (with index) is

available[http://www.rdfhdt.org/datasets/].

Thank you very much! Keep it up! Out of curiosity, what computer did you use for this? IIRC it required

...
512GB of RAM to function.

...
You will see how Wikidata has become huge compared to other

datasets. it contains about twice the limit of 4B triples discussed above.

There is a 64-bit version of HDT that doesn't have this limitation of 4B triples.

...
In this regard, what is in 2018 the most user friendly way to use

this format?

Speaking for me at least, Fuseki with a HDT store. But I know there are also some CLI tools from the HDT folks.

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Laura Morales

3 Oct 3 Oct

4:39 a.m.

...

You shouldn't have to keep anything in RAM to HDT-ize something as you could make the dictionary by sorting on disk and also do the joins to look up everything against the dictionary by sorting.

Yes but somebody has to write the code for it :) My understanding is that they keep everything in memory because it was simpler to develop. The problem is that graphs can become really huge so this approach clearly doesn't scale too well.

Laura Morales

4:35 a.m.

...

100 GB "with an optimized code" could be enough to produce an HDT like that.

The current software definitely cannot handle wikidata with 100GB. It was tried before and it failed. I'm glad to see that new code will be released to handle large files. After skimming that paper it looks like they split the RDF source into multiple files and "cat" them into a single HDT file. 100GB is still a pretty large footprint, but I'm so glad that they're working on this. A 128GB server is *way* more affordable than one with 512GB or 1TB!

I can't wait to try the new code myself.

Laura Morales

7 Nov 7 Nov

9:39 p.m.

How many triples does wikidata have? The old dump from rdfhdt seem to have about 2 billion, which means wikidata doubled the number of triples in less than a year?

Sent: Tuesday, November 07, 2017 at 3:24 PM From: "Jérémie Roquet" jroquet@arkanosis.net To: "Discussion list for the Wikidata project." wikidata@lists.wikimedia.org Subject: Re: [Wikidata] Wikidata HDT dump Hi everyone,

Until this is addressed, don't waste your time trying to convert the entire Wikidata to HDT: it can't work.

-- Jérémie

_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata%5Bhttps://lists.wikime...]

Lucas Werkmeister

9:51 p.m.

The Wikidata Query Service currently holds some 3.8 billion triples – you can see the numbers on Grafana [1]. But WDQS “munges” the dump before importing it – for instance, it merges wdata:… into wd:… and drops `a wikibase:Item` and `a wikibase:Statement` types; see [2] for details – so the triple count in the un-munged dump will be somewhat larger than the triple count in WDQS.

Cheers, Lucas

[1]: https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?panelId=7&... [2]: https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#WDQS_data_d...

On 07.11.2017 17:09, Laura Morales wrote:

...

How many triples does wikidata have? The old dump from rdfhdt seem to have about 2 billion, which means wikidata doubled the number of triples in less than a year?

Sent: Tuesday, November 07, 2017 at 3:24 PM From: "Jérémie Roquet" jroquet@arkanosis.net To: "Discussion list for the Wikidata project." wikidata@lists.wikimedia.org Subject: Re: [Wikidata] Wikidata HDT dump Hi everyone,

I'm afraid the current implementation of HDT is not ready to handle more than 4 billions triples as it is limited to 32 bit indexes. I've opened an issue upstream: https://github.com/rdfhdt/hdt-cpp/issues/135

Until this is addressed, don't waste your time trying to convert the entire Wikidata to HDT: it can't work.

-- Jérémie

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata%5Bhttps://lists.wikime...]

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Laura Morales

11:01 p.m.

...

drops `a wikibase:Item` and `a wikibase:Statement` types

off topic but... why drop `a wikibase:Item`? Without this it seems impossible to retrieve a list of items.

Wouter Beek

12 Dec 12 Dec

3:54 p.m.

Hi list,

I'm sorry, I was under the impression that I had already shared this resource with you earlier, but I haven't...

On 7 Nov I created an HDT file based on the then current download link from https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz

You can download this HDT file and it's index from the following locations: - http://lod-a-lot.lod.labs.vu.nl/data/wikidata.hdt (~45GB) - http://lod-a-lot.lod.labs.vu.nl/data/wikidata.hdt.index.v1-1 (~28GB)

You may need to compile with 64bit support, because there are more than 2B triples (https://github.com/rdfhdt/hdt-cpp/tree/develop-64). (To be exact, there are 4,579,973,187 triples in this file.)

PS: If this resource turns out to be useful to the community we can offer an updated HDT file at a to be determined interval.

--- Cheers, Wouter Beek.

Email: wouter@triply.cc WWW: http://triply.cc Tel: +31647674624

On Tue, Nov 7, 2017 at 6:31 PM, Laura Morales lauretas@mail.com wrote:

...

...
drops `a wikibase:Item` and `a wikibase:Statement` types

off topic but... why drop `a wikibase:Item`? Without this it seems impossible to retrieve a list of items.

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Laura Morales

5:28 p.m.

* T H A N K Y O U *

...

On 7 Nov I created an HDT file based on the then current download link from https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz

Thank you very very much Wouter!! This is great! Out of curiosity, could you please share some info about the machine that you've used to generate these files? In particular I mean hardware info, such as the model names of mobo/cpu/ram/disks. Also "how long" it took to generate these files would be an interesting information.

...

PS: If this resource turns out to be useful to the community we can offer an updated HDT file at a to be determined interval.

This would be fantastic! Wikidata dumps about once a week, so I think even a new HDT file every 1-2 months would be awesome. Related to this however... why not use the Laundromat for this? There are several datasets that are very large, and rdf2hdt is really expensive to run. Maybe you could schedule regular jobs for several graphs (wikidata, dbpedia, wordnet, linkedgeodata, government data, ...) and make them available at the Laundromat?

* T H A N K Y O U *

Wouter Beek

15 Dec 15 Dec

8:22 p.m.

Hi Wikidata community,

Somebody pointed me to the following issue: https://phabricator.wikimedia.org/T179681 Unfortunately I'm not able to log in there with the "Phabricator" so I cannot edit the issue directly. I'm sending this email instead.

The issue seems to be stalled because it is not possible to create HDT files that contain more than 2B triples. However, this is possible in a specific 64 bit branch, which is how I created the downloadable version I've sent a few days ago. As indicated, I can create these files for the community if there is a use case.

--- Cheers, Wouter.

Email: wouter@triply.cc WWW: http://triply.cc Tel: +31647674624

On Tue, Dec 12, 2017 at 11:24 AM, Wouter Beek wouter@triply.cc wrote:

...

Hi list,

I'm sorry, I was under the impression that I had already shared this resource with you earlier, but I haven't...

On 7 Nov I created an HDT file based on the then current download link from https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz

You can download this HDT file and it's index from the following locations:

http://lod-a-lot.lod.labs.vu.nl/data/wikidata.hdt (~45GB)

http://lod-a-lot.lod.labs.vu.nl/data/wikidata.hdt.index.v1-1 (~28GB)

You may need to compile with 64bit support, because there are more than 2B triples (https://github.com/rdfhdt/hdt-cpp/tree/develop-64). (To be exact, there are 4,579,973,187 triples in this file.)

PS: If this resource turns out to be useful to the community we can offer an updated HDT file at a to be determined interval.

Cheers, Wouter Beek.

Email: wouter@triply.cc WWW: http://triply.cc Tel: +31647674624

On Tue, Nov 7, 2017 at 6:31 PM, Laura Morales lauretas@mail.com wrote:

...
...
drops `a wikibase:Item` and `a wikibase:Statement` types

off topic but... why drop `a wikibase:Item`? Without this it seems impossible to retrieve a list of items.

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Stas Malyshev

16 Dec 16 Dec

6:14 a.m.

Hi!

...

Somebody pointed me to the following issue: https://phabricator.wikimedia.org/T179681 Unfortunately I'm not able to log in there with the "Phabricator" so I cannot edit the issue directly. I'm sending this email instead.

Thank you, I've updated the task with references to your comments.

-- Stas Malyshev smalyshev@wikimedia.org

Jérémie Roquet

7 Nov 7 Nov

10:18 p.m.

2017-11-07 17:09 GMT+01:00 Laura Morales lauretas@mail.com:

...

How many triples does wikidata have? The old dump from rdfhdt seem to have about 2 billion, which means wikidata doubled the number of triples in less than a year?

A naive grep | wc -l on the last turtle dump gives me an estimate of 4.65 billions triples.

Looking at https://tools.wmflabs.org/wikidata-todo/stats.php it seems that Wikidata is indeed more than twice as big as only six months ago.

-- Jérémie

Laura Morales

27 Oct 27 Oct

9:05 p.m.

...

Dear Laura, others,

If somebody points me to the RDF datadump of Wikidata I can deliver an HDT version for it, no problem. (Given the current cost of memory I do not believe that the memory consumption for HDT creation is a blocker.)

This would be awesome! Thanks Wouter. To the best of my knowledge, the most up to date dump is this one [1]. Let me know if you need any help with anything. Thank you again!

[1] https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz

--- Cheers, Wouter Beek.

Email: wouter@triply.cc WWW: http://triply.cc Tel: +31647674624

On Fri, Oct 27, 2017 at 5:08 PM, Laura Morales lauretas@mail.com wrote:

...

Hello everyone,

I'd like to ask if Wikidata could please offer a HDT [1] dump along with the already available Turtle dump [2]. HDT is a binary format to store RDF data, which is pretty useful because it can be queried from command line, it can be used as a Jena/Fuseki source, and it also uses orders-of-magnitude less space to store the same data. The problem is that it's very impractical to generate a HDT, because the current implementation requires a lot of RAM processing to convert a file. For Wikidata it will probably require a machine with 100-200GB of RAM. This is unfeasible for me because I don't have such a machine, but if you guys have one to share, I can help setup the rdf2hdt software required to convert Wikidata Turtle to HDT.

Thank you.

[1] http://www.rdfhdt.org/%5Bhttp://www.rdfhdt.org/] [2] https://dumps.wikimedia.org/wikidatawiki/entities/%5Bhttps://dumps.wikimedia...]

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata%5Bhttps://lists.wikime...]

_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata%5Bhttps://lists.wikime...]

2274

Age (days ago)

2614

Last active (days ago)

wikidata@lists.wikimedia.org

81 comments

15 participants

tags (0)

participants (15)

Edgard Marx
Edgard Marx
Ettore RIZZA
Ghislain ATEMEZING
Jasper Koehorst
Jérémie Roquet
Laura Morales
Lucas Werkmeister
Lucas Werkmeister
Luigi Assom
Osma Suominen
Paul Houle
Stas Malyshev
sushil dutt
Wouter Beek