Hey all,
I've released version 2.0 of JSON Dump Reader, the PHP library for reading from Wikidata JSON dumps. This version has been tested with the latest dumps and is optimized for PHP 7.x users.
You can find usage and installation instructions at https://github.com/JeroenDeDauw/JsonDumpReader#jsondumpreader
Cheers
-- Jeroen De Dauw | https://entropywins.wtf | https://keybase.io/jeroendedauw Software Crafter | Speaker | Student | Strategist | Contributor to Wikimedia and Open Source ~=[,,_,,]:3
Nice!
I recently was doing similar thing in Rust, but I had a bit different idea in mind. If you can implement it in PHP that would be awesome. Here is the idea:
I have very little free space on my laptop, so I cannot really download the dump. What I wanted is to read dump directly from the server, unzip it in stream and parse each line separately (more or less what your tool does, except downloading the dump). The problem with this approach that it may take some time to download dump and in that time network may have hiccups what will result in failure somewhere in the middle. Then I need to restart the process and download it again, which results in wasted bandwidth and CPU time. What if the tool were storing the position and in case of failure would resume download from the needed offset? That would make this package just awesome!
On 14 August 2018 at 22:26, Jeroen De Dauw jeroendedauw@gmail.com wrote:
Hey all,
I've released version 2.0 of JSON Dump Reader, the PHP library for reading from Wikidata JSON dumps. This version has been tested with the latest dumps and is optimized for PHP 7.x users.
You can find usage and installation instructions at https://github.com/ JeroenDeDauw/JsonDumpReader#jsondumpreader
Cheers
-- Jeroen De Dauw | https://entropywins.wtf | https://keybase.io/jeroendedauw Software Crafter | Speaker | Student | Strategist | Contributor to Wikimedia and Open Source ~=[,,_,,]:3
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Hey Aleksey,
The library allows you to access the position of the DumpReader and to resume from a stored position.
In the docs: https://github.com/JeroenDeDauw/JsonDumpReader#resume-reading-from-a-previou... PHP interface: https://github.com/JeroenDeDauw/JsonDumpReader/blob/master/src/SeekableDumpR...
This functionality is used by Replicator, a CLI tool build on top of the JSON Dump Reader library.
Replicator: https://github.com/JeroenDeDauw/Replicator#replicator Aborting and resuming imports with Replicator: https://github.com/JeroenDeDauw/Replicator#importing-extracted-json-dumps
Neither the library or the CLI tool support streaming dumps (unless that somehow magically ends up working). I'm happy to review pull requests with additions or enhancements.
Note that Replicator supports import from the Wikidata web API, including automatic fetching of dependencies. This works if you want to get a specific set of entities or just a few thousand for testing purposes. If you want all entities from Wikidata this approach is of course not viable.
https://github.com/JeroenDeDauw/Replicator#importing-from-the-wikidataorg-ap...
Cheers
-- Jeroen De Dauw | https://entropywins.wtf | https://keybase.io/jeroendedauw Software Crafter | Speaker | Student | Strategist | Contributor to Wikimedia and Open Source ~=[,,_,,]:3
I've looked at the code and it seems that using URL instead of file path will just work, but the connection failure use case won't be handled.
I have an idea how it can be implemented using stream_wrapper_register(), but I don't have time to implement it, sorry.
Example class registered as stream wrapper https://secure.php.net/manual/en/stream.streamwrapper.example-1.php
On 15 August 2018 at 23:23, Jeroen De Dauw jeroendedauw@gmail.com wrote:
Hey Aleksey,
The library allows you to access the position of the DumpReader and to resume from a stored position.
In the docs: https://github.com/JeroenDeDauw/JsonDumpReader# resume-reading-from-a-previous-position PHP interface: https://github.com/JeroenDeDauw/JsonDumpReader/ blob/master/src/SeekableDumpReader.php
This functionality is used by Replicator, a CLI tool build on top of the JSON Dump Reader library.
Replicator: https://github.com/JeroenDeDauw/Replicator#replicator Aborting and resuming imports with Replicator: https://github.com/ JeroenDeDauw/Replicator#importing-extracted-json-dumps
Neither the library or the CLI tool support streaming dumps (unless that somehow magically ends up working). I'm happy to review pull requests with additions or enhancements.
Note that Replicator supports import from the Wikidata web API, including automatic fetching of dependencies. This works if you want to get a specific set of entities or just a few thousand for testing purposes. If you want all entities from Wikidata this approach is of course not viable.
https://github.com/JeroenDeDauw/Replicator#importing-from-the- wikidataorg-api
Cheers
-- Jeroen De Dauw | https://entropywins.wtf | https://keybase.io/jeroendedauw Software Crafter | Speaker | Student | Strategist | Contributor to Wikimedia and Open Source ~=[,,_,,]:3
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
wikidata-tech@lists.wikimedia.org