Hello,
I am writing this message because I am analyzing the Wikidata JSON dumps available in the Internet Archive and I have found there are no dumps available after Feb 8th, 2019 (see https://archive.org/details/wikimediadownloads?and%5B%5D=%22Wikidata%20entit...). I know the latest dumps are available at https://dumps.wikimedia.org/wikidatawiki/entities/, but unfortunately they only cover the last few months.
I also noticed some gaps in the years where there are JSON dumps available. For example, there are no JSON dumps available between end of Feb, 2017 and Aug 21st, 2017; or between August 21st, 2017 and Nov 16, 2017.
Another strange finding is that while there are some entries for the dumps in the Internet Archive between March 19th, 2018 and Nov 26th, 2018 (e.g., https://archive.org/details/wikibase-wikidatawiki-20181104), none of them contain a JSON dump. That's another gap of more than 8 months.
Does anyone on this list know where some of these missing Wikidata dumps may be found? If anyone has pointers to a server where they can be downloaded, I would highly appreciate it.
Thanks in advance, Daniel
Hi Daniel,
I am the one managing the archival process and indeed, it was around end-2018 when the archival process just died (you can see the status here: https://dumps.wmflabs.org/status.php).
The current status is that the software behind the archival process is being reworked and will come with features that I will be announcing once it is ready. The Wikidata JSON dumps will resume archival starting next week, so unfortunately all information between end-2018 till around October 2020 will be lost (unless someone has a copy somewhere). As for the dumps in 2017, there were other issues that caused the archival process to stall as well (you can see the list of available and archived dumps here: https://dumps.wmflabs.org/wikidata.txt).
I sincerely apologize for the lost information. The new version that I'm currently working on right now will definitely be much better and more robust to handle failures.
Warmest regards, Hydriz
On Wed, 25 Nov 2020 at 20:22, Daniel Garijo dgarijo@isi.edu wrote:
Hello,
I am writing this message because I am analyzing the Wikidata JSON dumps available in the Internet Archive and I have found there are no dumps available after Feb 8th, 2019 (see
https://archive.org/details/wikimediadownloads?and%5B%5D=%22Wikidata%20entit...).
I know the latest dumps are available at https://dumps.wikimedia.org/wikidatawiki/entities/, but unfortunately they only cover the last few months.
I also noticed some gaps in the years where there are JSON dumps available. For example, there are no JSON dumps available between end of Feb, 2017 and Aug 21st, 2017; or between August 21st, 2017 and Nov 16, 2017.
Another strange finding is that while there are some entries for the dumps in the Internet Archive between March 19th, 2018 and Nov 26th, 2018 (e.g., https://archive.org/details/wikibase-wikidatawiki-20181104), none of them contain a JSON dump. That's another gap of more than 8 months.
Does anyone on this list know where some of these missing Wikidata dumps may be found? If anyone has pointers to a server where they can be downloaded, I would highly appreciate it.
Thanks in advance, Daniel
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Hi Hydriz,
thanks for your answer. These are quite unfortunate news. I look forward to the updated service.
Best,
Daniel
On 11/25/2020 5:29 AM, Hydriz Scholz wrote:
Hi Daniel,
I am the one managing the archival process and indeed, it was around end-2018 when the archival process just died (you can see the status here: https://dumps.wmflabs.org/status.php https://dumps.wmflabs.org/status.php).
The current status is that the software behind the archival process is being reworked and will come with features that I will be announcing once it is ready. The Wikidata JSON dumps will resume archival starting next week, so unfortunately all information between end-2018 till around October 2020 will be lost (unless someone has a copy somewhere). As for the dumps in 2017, there were other issues that caused the archival process to stall as well (you can see the list of available and archived dumps here: https://dumps.wmflabs.org/wikidata.txt https://dumps.wmflabs.org/wikidata.txt).
I sincerely apologize for the lost information. The new version that I'm currently working on right now will definitely be much better and more robust to handle failures.
Warmest regards, Hydriz
On Wed, 25 Nov 2020 at 20:22, Daniel Garijo <dgarijo@isi.edu mailto:dgarijo@isi.edu> wrote:
Hello, I am writing this message because I am analyzing the Wikidata JSON dumps available in the Internet Archive and I have found there are no dumps available after Feb 8th, 2019 (see https://archive.org/details/wikimediadownloads?and%5B%5D=%22Wikidata%20entity%20dumps%22 <https://archive.org/details/wikimediadownloads?and%5B%5D=%22Wikidata%20entity%20dumps%22>). I know the latest dumps are available at https://dumps.wikimedia.org/wikidatawiki/entities/ <https://dumps.wikimedia.org/wikidatawiki/entities/>, but unfortunately they only cover the last few months. I also noticed some gaps in the years where there are JSON dumps available. For example, there are no JSON dumps available between end of Feb, 2017 and Aug 21st, 2017; or between August 21st, 2017 and Nov 16, 2017. Another strange finding is that while there are some entries for the dumps in the Internet Archive between March 19th, 2018 and Nov 26th, 2018 (e.g., https://archive.org/details/wikibase-wikidatawiki-20181104 <https://archive.org/details/wikibase-wikidatawiki-20181104>), none of them contain a JSON dump. That's another gap of more than 8 months. Does anyone on this list know where some of these missing Wikidata dumps may be found? If anyone has pointers to a server where they can be downloaded, I would highly appreciate it. Thanks in advance, Daniel _______________________________________________ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org <mailto:Xmldatadumps-l@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l <https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l>
-- Hydriz Scholz
On Wed, Nov 25, 2020 at 1:22 PM Daniel Garijo dgarijo@isi.edu wrote:
Hello,
I am writing this message because I am analyzing the Wikidata JSON dumps available in the Internet Archive and I have found there are no dumps available after Feb 8th, 2019 (see https://archive.org/details/wikimediadownloads?and%5B%5D=%22Wikidata%20entit...). I know the latest dumps are available at https://dumps.wikimedia.org/wikidatawiki/entities/, but unfortunately they only cover the last few months.
Which dump files are exactly looking for? Dumps like
https://dumps.wikimedia.org/wikidatawiki/entities/20201116/wikidata-20201116...
which can also be found on https://dumps.wikimedia.org/other/wikidata/ as 20201116.json.gz ?
[...] Does anyone on this list know where some of these missing Wikidata dumps may be found? If anyone has pointers to a server where they can be downloaded, I would highly appreciate it.
If you are looking for these dumps, I have about 8 TB stored on external disks. Transferring these over the network might be difficult, however. Please contact me off-list, if this you need any of these dumps, maybe we can arrange something.
I'm curious, what are you trying to do with all of these files? Processing all of them must take months. My processor usually picks up the dump on Wednesday and takes 80 hours to comb through it. But my processor is written in Perl, something in C or Rust might be a lot faster...
regards, Gerhard Gonter
Gerhard,
I'm curious what you mean by "processing" and "comb through". Can you describe how your processing and what system or database the output gets loaded into? Perhaps you have your scripts publicly available on something like GitHub?
It would be nice to know a bit more on what you also are doing. Thanks in advance!
Thad https://www.linkedin.com/in/thadguidry/
On Wed, Nov 25, 2020 at 9:14 AM Gerhard Gonter ggonter@gmail.com wrote:
On Wed, Nov 25, 2020 at 1:22 PM Daniel Garijo dgarijo@isi.edu wrote:
Hello,
I am writing this message because I am analyzing the Wikidata JSON dumps available in the Internet Archive and I have found there are no dumps available after Feb 8th, 2019 (see
https://archive.org/details/wikimediadownloads?and%5B%5D=%22Wikidata%20entit... ).
I know the latest dumps are available at https://dumps.wikimedia.org/wikidatawiki/entities/, but unfortunately they only cover the last few months.
Which dump files are exactly looking for? Dumps like
https://dumps.wikimedia.org/wikidatawiki/entities/20201116/wikidata-20201116...
which can also be found on https://dumps.wikimedia.org/other/wikidata/ as 20201116.json.gz ?
[...] Does anyone on this list know where some of these missing Wikidata dumps may be found? If anyone has pointers to a server where they can be downloaded, I would highly appreciate it.
If you are looking for these dumps, I have about 8 TB stored on external disks. Transferring these over the network might be difficult, however. Please contact me off-list, if this you need any of these dumps, maybe we can arrange something.
I'm curious, what are you trying to do with all of these files? Processing all of them must take months. My processor usually picks up the dump on Wednesday and takes 80 hours to comb through it. But my processor is written in Perl, something in C or Rust might be a lot faster...
regards, Gerhard Gonter
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Thanks Gerhard, I will be touching base off-list.
I am looking for those json dumps precisely. We have been developing a toolkit that can process them in 12 hours (at least for the tests I have done with 2020 dumps). I will be happy to share more information with you (or anyone who is interested).
Best,
Daniel
On 11/25/2020 7:13 AM, Gerhard Gonter wrote:
On Wed, Nov 25, 2020 at 1:22 PM Daniel Garijo dgarijo@isi.edu wrote:
Hello,
I am writing this message because I am analyzing the Wikidata JSON dumps available in the Internet Archive and I have found there are no dumps available after Feb 8th, 2019 (see https://archive.org/details/wikimediadownloads?and%5B%5D=%22Wikidata%20entit...). I know the latest dumps are available at https://dumps.wikimedia.org/wikidatawiki/entities/, but unfortunately they only cover the last few months.
Which dump files are exactly looking for? Dumps like
https://dumps.wikimedia.org/wikidatawiki/entities/20201116/wikidata-20201116...
which can also be found on https://dumps.wikimedia.org/other/wikidata/ as 20201116.json.gz ?
[...] Does anyone on this list know where some of these missing Wikidata dumps may be found? If anyone has pointers to a server where they can be downloaded, I would highly appreciate it.
If you are looking for these dumps, I have about 8 TB stored on external disks. Transferring these over the network might be difficult, however. Please contact me off-list, if this you need any of these dumps, maybe we can arrange something.
I'm curious, what are you trying to do with all of these files? Processing all of them must take months. My processor usually picks up the dump on Wednesday and takes 80 hours to comb through it. But my processor is written in Perl, something in C or Rust might be a lot faster...
regards, Gerhard Gonter
xmldatadumps-l@lists.wikimedia.org