Hi,

+1 to not share the jrnl file !

I agree with Stas that it doesn’t seem a best practice to publish a specific journal file for a given RDF store (here for blazegraph).

Regarding the size of that jrnl file, I remember having one project with almost 500M for 1 billion triples (~ 1/2 size of disk of the dataset).

 

Best,

Ghislain

 

 

Provenance : Courrier pour Windows 10

 

De : Stas Malyshev
Envoyé le :samedi 28 octobre 2017 08:42
À : Discussion list for the Wikidata project.; Jasper Koehorst
Objet :Re: [Wikidata] Wikidata HDT dump

 

Hi!

 

> I will look into the size of the jnl file but should that not be

> located where the blazegraph is running from the sparql endpoint or

> is this a special flavour? Was also thinking of looking into a gitlab

> runner which occasionally could generate a HDT file from the ttl dump

> if our server can handle it but for this an md5 sum file would be

> preferable or should a timestamp be sufficient?

 

Publishing jnl file for Blazegraph may be not as useful as one would

think, because jnl file is specific for a specific vocabulary and

certain other settings - i.e., unless you run the same WDQS code (which

customizes some of these) of the same version, you won't be able to use

the same file. Of course, since WDQS code is open source, it may be good

enough, so in general publishing such file may be possible.

 

Currently, it's about 300G size uncompressed. No idea how much

compressed. Loading it takes a couple of days on reasonably powerful

machine, more on labs ones (I haven't tried to load full dump on labs

for a while, since labs VMs are too weak for that).

 

In general, I'd say it'd take about 100M per million of triples. Less if

triples are using repeated URIs, probably more if they contain ton of

text data.

 

--

Stas Malyshev

smalyshev@wikimedia.org

 

_______________________________________________

Wikidata mailing list

Wikidata@lists.wikimedia.org

https://lists.wikimedia.org/mailman/listinfo/wikidata