On Wed, Nov 25, 2020 at 1:22 PM Daniel Garijo dgarijo@isi.edu wrote:
Hello,
I am writing this message because I am analyzing the Wikidata JSON dumps available in the Internet Archive and I have found there are no dumps available after Feb 8th, 2019 (see https://archive.org/details/wikimediadownloads?and%5B%5D=%22Wikidata%20entit...). I know the latest dumps are available at https://dumps.wikimedia.org/wikidatawiki/entities/, but unfortunately they only cover the last few months.
Which dump files are exactly looking for? Dumps like
https://dumps.wikimedia.org/wikidatawiki/entities/20201116/wikidata-20201116...
which can also be found on https://dumps.wikimedia.org/other/wikidata/ as 20201116.json.gz ?
[...] Does anyone on this list know where some of these missing Wikidata dumps may be found? If anyone has pointers to a server where they can be downloaded, I would highly appreciate it.
If you are looking for these dumps, I have about 8 TB stored on external disks. Transferring these over the network might be difficult, however. Please contact me off-list, if this you need any of these dumps, maybe we can arrange something.
I'm curious, what are you trying to do with all of these files? Processing all of them must take months. My processor usually picks up the dump on Wednesday and takes 80 hours to comb through it. But my processor is written in Perl, something in C or Rust might be a lot faster...
regards, Gerhard Gonter
Gerhard,
I'm curious what you mean by "processing" and "comb through". Can you describe how your processing and what system or database the output gets loaded into? Perhaps you have your scripts publicly available on something like GitHub?
It would be nice to know a bit more on what you also are doing. Thanks in advance!
Thad https://www.linkedin.com/in/thadguidry/
On Wed, Nov 25, 2020 at 9:14 AM Gerhard Gonter ggonter@gmail.com wrote:
On Wed, Nov 25, 2020 at 1:22 PM Daniel Garijo dgarijo@isi.edu wrote:
Hello,
I am writing this message because I am analyzing the Wikidata JSON dumps available in the Internet Archive and I have found there are no dumps available after Feb 8th, 2019 (see
https://archive.org/details/wikimediadownloads?and%5B%5D=%22Wikidata%20entit... ).
I know the latest dumps are available at https://dumps.wikimedia.org/wikidatawiki/entities/, but unfortunately they only cover the last few months.
Which dump files are exactly looking for? Dumps like
https://dumps.wikimedia.org/wikidatawiki/entities/20201116/wikidata-20201116...
which can also be found on https://dumps.wikimedia.org/other/wikidata/ as 20201116.json.gz ?
[...] Does anyone on this list know where some of these missing Wikidata dumps may be found? If anyone has pointers to a server where they can be downloaded, I would highly appreciate it.
If you are looking for these dumps, I have about 8 TB stored on external disks. Transferring these over the network might be difficult, however. Please contact me off-list, if this you need any of these dumps, maybe we can arrange something.
I'm curious, what are you trying to do with all of these files? Processing all of them must take months. My processor usually picks up the dump on Wednesday and takes 80 hours to comb through it. But my processor is written in Perl, something in C or Rust might be a lot faster...
regards, Gerhard Gonter
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On Wed, Nov 25, 2020 at 4:41 PM Thad Guidry thadguidry@gmail.com wrote:
Gerhard,
I'm curious what you mean by "processing" and "comb through". Can you describe how your processing and what system or database the output gets loaded into?
I'm doing embarrassingly little with the data yet and there is no regular database involved. The processor mainly looks for properties I defined beforehand and transcribes relevant information into TSV files. That's the "comb through" part of my mail. The processor reads each line of the decompressed dump stream which represents one Wikidata item, looks for those properties and also writes each item individually compressed into a output files which are later indexed to access items directly if I later want to look at one of them. That's about all.
Perhaps you have your scripts publicly available on something like GitHub?
Yes, it is available from Github: https://github.com/gonter/wikidata-dump-processor
It would be nice to know a bit more on what you also are doing. Thanks in advance!
Mainly I'm looking for items with GND identifieiers and related identifiers such as VIAF, ORCID, etc. However, this data is currently not used anywhere, but maybe I'll do that later.
regards, Gerhard Gonter