Is there a problem with the wikidata-dump?

List overview All Threads
Download

newer

older

[Breaking Change Announcement]...

XML Dumps FAQ monthly update

Wurgl

5 Jan 2024 5 Jan '24

5:02 p.m.

Hello!

I am having some unexpected messages, so I tried the following:

curl -s https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-ar... | bzip2 -d | tail

an got this:

bzip2: Compressed file ends unexpectedly; perhaps it is corrupted? *Possible* reason follows. bzip2: Inappropriate ioctl for device Input file = (stdin), output file = (stdout)

It is possible that the compressed file(s) have become corrupted. You can use the -tvv option to test integrity of such files.

You can use the `bzip2recover' program to attempt to recover data from undamaged sections of corrupted files.

<parentid>1227967782</parentid> <timestamp>2023-12-07T00:22:05Z</timestamp> <contributor> <username>Renamerr</username> <id>2883061</id> </contributor> <comment>/* wbsetdescription-add:1|uk */ бактеріальний білок, наявний у Listeria monocytogenes EGD-e, [[:toollabs:quickstatements/#/batch/218434|batch #218434]]</comment> <model>wikibase-item</model> <format>application/json</format>

The first part is an error message which I could also see when running my PHP-script from within the toolserver-cloud (php 7.4 because class XMLReader with the installed php 8.2 simple core dumps, see T352886). The second part is the output from the "tail" command.

Just as a crosschek: I have no such problem with curl -s https://dumps.wikimedia.org/dewiki/latest/dewiki-latest-pages-meta-current.x... | bzip2 -d | tail

No error and the last line is "</mediawiki>"

Cheers, Wolfgang

Attachments:

attachment.htm (text/html — 2.2 KB)

Show replies by date

Xabriel Collazo Mojica

9 Jan 9 Jan

6:17 p.m.

Hello Wolfgang,

I am trying to repro your issue. The file is ~140gb so doing a `bzcat` takes a long while. Will get back to you with the result.

For now, here is the sha1 hash of that file so that you can compare against your local copy, see if it was corrupted in flight?

$ sha1sum wikidatawiki-20240101-pages-articles-multistream.xml.bz2

1be753ba90e0390c8b65f9b80b08015922da12f1

On Fri, Jan 5, 2024 at 12:03 PM Wurgl heisewurgl@gmail.com wrote:

...

Hello!

I am having some unexpected messages, so I tried the following:

curl -s https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-ar... | bzip2 -d | tail

an got this:

bzip2: Compressed file ends unexpectedly; perhaps it is corrupted? *Possible* reason follows. bzip2: Inappropriate ioctl for device Input file = (stdin), output file = (stdout)

It is possible that the compressed file(s) have become corrupted. You can use the -tvv option to test integrity of such files.

You can use the `bzip2recover' program to attempt to recover data from undamaged sections of corrupted files.
  <parentid>1227967782</parentid>
  <timestamp>2023-12-07T00:22:05Z</timestamp>
  <contributor>
    <username>Renamerr</username>
    <id>2883061</id>
  </contributor>
  <comment>/* wbsetdescription-add:1|uk */ бактеріальний білок,
наявний у Listeria monocytogenes EGD-e, [[:toollabs:quickstatements/#/batch/218434|batch #218434]]</comment> <model>wikibase-item</model> <format>application/json</format>

The first part is an error message which I could also see when running my PHP-script from within the toolserver-cloud (php 7.4 because class XMLReader with the installed php 8.2 simple core dumps, see T352886). The second part is the output from the "tail" command.

Just as a crosschek: I have no such problem with curl -s https://dumps.wikimedia.org/dewiki/latest/dewiki-latest-pages-meta-current.x... | bzip2 -d | tail

No error and the last line is "</mediawiki>"

Cheers, Wolfgang

Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org To unsubscribe send an email to xmldatadumps-l-leave@lists.wikimedia.org

-- Xabriel J. Collazo Mojica (he/him, pronunciation https://commons.wikimedia.org/wiki/File:Xabriel_Collazo_Mojica_-_pronunciation.ogg ) Sr Software Engineer Wikimedia Foundation

Gerhard Gonter

10 Jan 10 Jan

3:50 p.m.

On Fri, Jan 5, 2024 at 5:03 PM Wurgl heisewurgl@gmail.com wrote:

...

Hello!

I am having some unexpected messages, so I tried the following:

curl -s https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-ar... | bzip2 -d | tail

an got this:

bzip2: Compressed file ends unexpectedly; perhaps it is corrupted? *Possible* reason follows. bzip2: Inappropriate ioctl for device Input file = (stdin), output file = (stdout)

It is possible that the compressed file(s) have become corrupted.

The file I received was fine and the sha1sum matches that of wikidatawiki-20240101-pages-articles-multistream.xml.bz2 mention in the posting of Xabriel Collazo Mojica:

--- 8< --- $ sha1sum wikidatawiki-latest-pages-articles-multistream.xml.bz2 1be753ba90e0390c8b65f9b80b08015922da12f1 wikidatawiki-latest-pages-articles-multistream.xml.bz2 --- >8 ---

bunzip2 did not report any problem, however, my first attempt to decompress ended with a full disk after more that 2.3 TB of xml.

The second attempt --- 8< --- $ bunzip2 -cv wikidatawiki-latest-pages-articles-multistream.xml.bz2 | tail -n 10000 > wikidatawiki-latest-pages-articles-multistream_tail_-n_10000.xml wikidatawiki-latest-pages-articles-multistream.xml.bz2: done --- >8 ---

resulted in nice XML fragment which ends with --- 8< --- <page> <title>Q124069752</title> <ns>0</ns> <id>118244259</id> <revision> <id>2042727399</id> <parentid>2042727216</parentid> <timestamp>2024-01-01T20:37:28Z</timestamp> <contributor> <username>Kalepom</username> <id>1900170</id> </contributor> <comment>/* wbsetclaim-create:2||1 */ [[Property:P2789]]: [[Q16506931]]</comment> <model>wikibase-item</model> <format>application/json</format> <text bytes="2535" xml:space="preserve">...</text> <sha1>9gw926vh84k1b5h6wnuvlvnd2zc3a9b</sha1> </revision> </page> </mediawiki> --- >8 ---

So, I assume, your curl did not return the full 142 GB of wikidatawiki-latest-pages-articles-multistream.xml.bz2 .

P.S.: I'll start a new bunzip2 to a larger scratch disk just to find out, how big this xml file really is.

regards, Gerhard

Xabriel Collazo Mojica

4:15 p.m.

Gerhad: Thanks for the extra checks!

Wolfgang: I can confirm Gerhad's findings. The file appears correct, and ends with the right footer.

On Wed, Jan 10, 2024 at 10:50 AM Gerhard Gonter ggonter@gmail.com wrote:

...

On Fri, Jan 5, 2024 at 5:03 PM Wurgl heisewurgl@gmail.com wrote:

...
Hello!

I am having some unexpected messages, so I tried the following:

curl -s

https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-ar... | bzip2 -d | tail

...
an got this:

bzip2: Compressed file ends unexpectedly; perhaps it is corrupted? *Possible* reason follows. bzip2: Inappropriate ioctl for device Input file = (stdin), output file = (stdout)

It is possible that the compressed file(s) have become corrupted.

The file I received was fine and the sha1sum matches that of wikidatawiki-20240101-pages-articles-multistream.xml.bz2 mention in the posting of Xabriel Collazo Mojica:

--- 8< --- $ sha1sum wikidatawiki-latest-pages-articles-multistream.xml.bz2 1be753ba90e0390c8b65f9b80b08015922da12f1 wikidatawiki-latest-pages-articles-multistream.xml.bz2 --- >8 ---

bunzip2 did not report any problem, however, my first attempt to decompress ended with a full disk after more that 2.3 TB of xml.

The second attempt --- 8< --- $ bunzip2 -cv wikidatawiki-latest-pages-articles-multistream.xml.bz2 | tail -n 10000 > wikidatawiki-latest-pages-articles-multistream_tail_-n_10000.xml wikidatawiki-latest-pages-articles-multistream.xml.bz2: done --- >8 ---

resulted in nice XML fragment which ends with --- 8< ---

<page> <title>Q124069752</title> <ns>0</ns> <id>118244259</id> <revision> <id>2042727399</id> <parentid>2042727216</parentid> <timestamp>2024-01-01T20:37:28Z</timestamp> <contributor> <username>Kalepom</username> <id>1900170</id> </contributor> <comment>/* wbsetclaim-create:2||1 */ [[Property:P2789]]: [[Q16506931]]</comment> <model>wikibase-item</model> <format>application/json</format> <text bytes="2535" xml:space="preserve">...</text> <sha1>9gw926vh84k1b5h6wnuvlvnd2zc3a9b</sha1> </revision> </page> </mediawiki> --- >8 ---

So, I assume, your curl did not return the full 142 GB of wikidatawiki-latest-pages-articles-multistream.xml.bz2 .

P.S.: I'll start a new bunzip2 to a larger scratch disk just to find out, how big this xml file really is.

regards, Gerhard _______________________________________________ Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org To unsubscribe send an email to xmldatadumps-l-leave@lists.wikimedia.org

-- Xabriel J. Collazo Mojica (he/him, pronunciation https://commons.wikimedia.org/wiki/File:Xabriel_Collazo_Mojica_-_pronunciation.ogg ) Sr Software Engineer Wikimedia Foundation

Ariel Glenn WMF

4:28 p.m.

I would hazard a guess that your bz2 unzip app does not handle multistream files in an appropriate way, Wurgl. The multistream files consist of several bzip2-compressed files concatenated together; see https://meta.wikimedia.org/wiki/Data_dumps/Dump_format#Multistream_dumps for details. Try downloading the entire file via curl, and then look into the question of the bzip app issues separately. Maybe it will turn out that you are encountering some other problem. But first, see if you can download the entire file and get its hash to check out.

Ariel

On Wed, Jan 10, 2024 at 5:15 PM Xabriel Collazo Mojica < xcollazo@wikimedia.org> wrote:

...

Gerhad: Thanks for the extra checks!

Wolfgang: I can confirm Gerhad's findings. The file appears correct, and ends with the right footer.

On Wed, Jan 10, 2024 at 10:50 AM Gerhard Gonter ggonter@gmail.com wrote:

...
On Fri, Jan 5, 2024 at 5:03 PM Wurgl heisewurgl@gmail.com wrote:

...
Hello!

I am having some unexpected messages, so I tried the following:

curl -s

https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-ar... | bzip2 -d | tail

...
an got this:

bzip2: Compressed file ends unexpectedly; perhaps it is corrupted? *Possible* reason follows. bzip2: Inappropriate ioctl for device Input file = (stdin), output file = (stdout)

It is possible that the compressed file(s) have become corrupted.

The file I received was fine and the sha1sum matches that of wikidatawiki-20240101-pages-articles-multistream.xml.bz2 mention in the posting of Xabriel Collazo Mojica:

--- 8< --- $ sha1sum wikidatawiki-latest-pages-articles-multistream.xml.bz2 1be753ba90e0390c8b65f9b80b08015922da12f1 wikidatawiki-latest-pages-articles-multistream.xml.bz2 --- >8 ---

bunzip2 did not report any problem, however, my first attempt to decompress ended with a full disk after more that 2.3 TB of xml.

The second attempt --- 8< --- $ bunzip2 -cv wikidatawiki-latest-pages-articles-multistream.xml.bz2 | tail -n 10000 > wikidatawiki-latest-pages-articles-multistream_tail_-n_10000.xml wikidatawiki-latest-pages-articles-multistream.xml.bz2: done --- >8 ---

resulted in nice XML fragment which ends with --- 8< ---

<page> <title>Q124069752</title> <ns>0</ns> <id>118244259</id> <revision> <id>2042727399</id> <parentid>2042727216</parentid> <timestamp>2024-01-01T20:37:28Z</timestamp> <contributor> <username>Kalepom</username> <id>1900170</id> </contributor> <comment>/* wbsetclaim-create:2||1 */ [[Property:P2789]]: [[Q16506931]]</comment> <model>wikibase-item</model> <format>application/json</format> <text bytes="2535" xml:space="preserve">...</text> <sha1>9gw926vh84k1b5h6wnuvlvnd2zc3a9b</sha1> </revision> </page> </mediawiki> --- >8 ---

So, I assume, your curl did not return the full 142 GB of wikidatawiki-latest-pages-articles-multistream.xml.bz2 .

P.S.: I'll start a new bunzip2 to a larger scratch disk just to find out, how big this xml file really is.

regards, Gerhard _______________________________________________ Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org To unsubscribe send an email to xmldatadumps-l-leave@lists.wikimedia.org

-- Xabriel J. Collazo Mojica (he/him, pronunciation https://commons.wikimedia.org/wiki/File:Xabriel_Collazo_Mojica_-_pronunciation.ogg ) Sr Software Engineer Wikimedia Foundation _______________________________________________ Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org To unsubscribe send an email to xmldatadumps-l-leave@lists.wikimedia.org

Wurgl

6:18 p.m.

Hello Ariel!

It is not "my bzip2", it is bzip2 on tools-sgebastion-11 in the toolserver-cloud … well, actually one of the servers which are used, when I start a script within the kubernetes environment there (with php 7.4) When you have an account there, you can look at: /data/project/persondata/dumps/wikidata_sitelinks.sh

The relevant line is this one: curl -s https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-ar... | bzip2 -d | php ~/dumps/wikidata_sitelinks.php

Yes, I double-checked it on my machine at home and the same type of error happened.

Wolfgang

Am Mi., 10. Jan. 2024 um 16:29 Uhr schrieb Ariel Glenn WMF < ariel@wikimedia.org>:

...

I would hazard a guess that your bz2 unzip app does not handle multistream files in an appropriate way, Wurgl. The multistream files consist of several bzip2-compressed files concatenated together; see https://meta.wikimedia.org/wiki/Data_dumps/Dump_format#Multistream_dumps for details. Try downloading the entire file via curl, and then look into the question of the bzip app issues separately. Maybe it will turn out that you are encountering some other problem. But first, see if you can download the entire file and get its hash to check out.

Ariel

On Wed, Jan 10, 2024 at 5:15 PM Xabriel Collazo Mojica < xcollazo@wikimedia.org> wrote:

...
Gerhad: Thanks for the extra checks!

Wolfgang: I can confirm Gerhad's findings. The file appears correct, and ends with the right footer.

On Wed, Jan 10, 2024 at 10:50 AM Gerhard Gonter ggonter@gmail.com wrote:

...
On Fri, Jan 5, 2024 at 5:03 PM Wurgl heisewurgl@gmail.com wrote:

...
Hello!

I am having some unexpected messages, so I tried the following:

curl -s

https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-ar... | bzip2 -d | tail

...
an got this:

bzip2: Compressed file ends unexpectedly; perhaps it is corrupted? *Possible* reason follows. bzip2: Inappropriate ioctl for device Input file = (stdin), output file = (stdout)

It is possible that the compressed file(s) have become corrupted.

The file I received was fine and the sha1sum matches that of wikidatawiki-20240101-pages-articles-multistream.xml.bz2 mention in the posting of Xabriel Collazo Mojica:

--- 8< --- $ sha1sum wikidatawiki-latest-pages-articles-multistream.xml.bz2 1be753ba90e0390c8b65f9b80b08015922da12f1 wikidatawiki-latest-pages-articles-multistream.xml.bz2 --- >8 ---

bunzip2 did not report any problem, however, my first attempt to decompress ended with a full disk after more that 2.3 TB of xml.

The second attempt --- 8< --- $ bunzip2 -cv wikidatawiki-latest-pages-articles-multistream.xml.bz2 | tail -n 10000 > wikidatawiki-latest-pages-articles-multistream_tail_-n_10000.xml wikidatawiki-latest-pages-articles-multistream.xml.bz2: done --- >8 ---

resulted in nice XML fragment which ends with --- 8< ---

<page> <title>Q124069752</title> <ns>0</ns> <id>118244259</id> <revision> <id>2042727399</id> <parentid>2042727216</parentid> <timestamp>2024-01-01T20:37:28Z</timestamp> <contributor> <username>Kalepom</username> <id>1900170</id> </contributor> <comment>/* wbsetclaim-create:2||1 */ [[Property:P2789]]: [[Q16506931]]</comment> <model>wikibase-item</model> <format>application/json</format> <text bytes="2535" xml:space="preserve">...</text> <sha1>9gw926vh84k1b5h6wnuvlvnd2zc3a9b</sha1> </revision> </page> </mediawiki> --- >8 ---

So, I assume, your curl did not return the full 142 GB of wikidatawiki-latest-pages-articles-multistream.xml.bz2 .

P.S.: I'll start a new bunzip2 to a larger scratch disk just to find out, how big this xml file really is.

regards, Gerhard _______________________________________________ Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org To unsubscribe send an email to xmldatadumps-l-leave@lists.wikimedia.org

-- Xabriel J. Collazo Mojica (he/him, pronunciation https://commons.wikimedia.org/wiki/File:Xabriel_Collazo_Mojica_-_pronunciation.ogg ) Sr Software Engineer Wikimedia Foundation _______________________________________________ Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org To unsubscribe send an email to xmldatadumps-l-leave@lists.wikimedia.org

Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org To unsubscribe send an email to xmldatadumps-l-leave@lists.wikimedia.org

Gerhard Gonter

6:45 p.m.

On Wed, Jan 10, 2024 at 6:19 PM Wurgl heisewurgl@gmail.com wrote:

...

The relevant line is this one: curl -s https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-ar... | bzip2 -d | php ~/dumps/wikidata_sitelinks.php

Yes, I double-checked it on my machine at home and the same type of error happened.

Well, we now know that the xml.bz2 file itself is ok. The usual way to debug this would be to perform each step of the above pipe in isolation, which I more or less did. The xml.bz2 file arrived ok, but I used wget for that and that job alone ran for about 12 hours to retrieve the ~150 GB file. Also, bunzip2 worked for me, as mentioned in an earlier posting and I found the expected closing tag "</mediawiki>" in the last line. So, also at least my bunzip2 (Version 1.0.6, 6-Sept-2010) seems to be ok or ok with that file.

As I already mentioned, from the messages in your original mail, I can only venture a guess here, is that you curl -s simply did not retrieve the full file. Try ommitting the -s for a test.

regards, Gerhard

Gerhard Gonter

6:50 p.m.

On Wed, Jan 10, 2024 at 6:19 PM Wurgl heisewurgl@gmail.com wrote:

...

The relevant line is this one: curl -s https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-ar... | bzip2 -d | php ~/dumps/wikidata_sitelinks.php

Btw, just out of curiosity, is wikidata_sitelinks.php available somewhere? What is it supposed to do?

regards, Gerhard

Wurgl

7:16 p.m.

Hello Gerhard!

It is just used to build a database for checking de-wikipedia commons/commonscat-Links … A long time ago someone asked for it. https://de.wikipedia.org/wiki/Benutzer:Wurgl/Probleme_Commons

Wolfgang

Am Mi., 10. Jan. 2024 um 18:50 Uhr schrieb Gerhard Gonter <ggonter@gmail.com

...

:

...

On Wed, Jan 10, 2024 at 6:19 PM Wurgl heisewurgl@gmail.com wrote:

...
The relevant line is this one: curl -s

https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-ar... | bzip2 -d | php ~/dumps/wikidata_sitelinks.php

Btw, just out of curiosity, is wikidata_sitelinks.php available somewhere? What is it supposed to do?

regards, Gerhard

Gerhard Gonter

7:41 p.m.

Thanks for the link to your Wikipedia page, but can I also find the php program itself somewhere? I now know that it focuses on two properties, namely (P935 (Commons gallery), P373 (Commons category) but what it does with them is not described.

regards, Gerhard

Wurgl

11 Jan 11 Jan

8:25 a.m.

Okay,

yesterday evening I did the following:

I started this script ## #!/bin/bash curl https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-ar... | bzip2 -d | tail -200 ##

With this command: tools.persondata@tools-sgebastion-11:~$ toolforge jobs run --command /data/project/persondata/spielwiese/curltest.sh --image php7.4 -o /data/project/persondata/logs/curltest.out -e /data/project/persondata/logs/curltest.err startcurltest

The errorfile curltest.err looks like: ## tools.persondata@tools-sgebastion-11:~$ tr '\r' '\n' </data/project/persondata/logs/curltest.err | head -2 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed tools.persondata@tools-sgebastion-11:~$ tr '\r' '\n' </data/project/persondata/logs/curltest.err | tail -20 22 141G 22 31.6G 0 0 748k 0 55:09:59 12:18:21 42:51:38 755k 22 141G 22 31.6G 0 0 748k 0 55:09:59 12:18:22 42:51:37 787k 22 141G 22 31.6G 0 0 748k 0 55:09:59 12:18:23 42:51:36 770k 22 141G 22 31.6G 0 0 748k 0 55:09:59 12:18:24 42:51:35 764k 22 141G 22 31.6G 0 0 748k 0 55:10:00 12:18:25 42:51:35 727k 22 141G 22 31.6G 0 0 748k 0 55:10:00 12:18:26 42:51:34 708k 22 141G 22 31.6G 0 0 748k 0 55:10:00 12:18:26 42:51:34 698k curl: (18) transfer closed with 118232009816 bytes remaining to read

bzip2: Compressed file ends unexpectedly; perhaps it is corrupted? *Possible* reason follows. bzip2: Inappropriate ioctl for device Input file = (stdin), output file = (stdout)

It is possible that the compressed file(s) have become corrupted. You can use the -tvv option to test integrity of such files.

You can use the `bzip2recover' program to attempt to recover data from undamaged sections of corrupted files. ##

The stdout-file curltest.out looks like ## tools.persondata@tools-sgebastion-11:~$ tail -3 /data/project/persondata/logs/curltest.out <sha1>s3raizvae6sd42yw49j2gy63ecyqclk</sha1> </revision> </page> ##

Something does not like me very much :-( Maybe some timeout? Maybe some transfer-limitation? Maybe something different.

Wolfgang

Am Mi., 10. Jan. 2024 um 16:29 Uhr schrieb Ariel Glenn WMF < ariel@wikimedia.org>:

...

I would hazard a guess that your bz2 unzip app does not handle multistream files in an appropriate way, Wurgl. The multistream files consist of several bzip2-compressed files concatenated together; see https://meta.wikimedia.org/wiki/Data_dumps/Dump_format#Multistream_dumps for details. Try downloading the entire file via curl, and then look into the question of the bzip app issues separately. Maybe it will turn out that you are encountering some other problem. But first, see if you can download the entire file and get its hash to check out.

Ariel

On Wed, Jan 10, 2024 at 5:15 PM Xabriel Collazo Mojica < xcollazo@wikimedia.org> wrote:

...
Gerhad: Thanks for the extra checks!

Wolfgang: I can confirm Gerhad's findings. The file appears correct, and ends with the right footer.

On Wed, Jan 10, 2024 at 10:50 AM Gerhard Gonter ggonter@gmail.com wrote:

...
On Fri, Jan 5, 2024 at 5:03 PM Wurgl heisewurgl@gmail.com wrote:

...
Hello!

I am having some unexpected messages, so I tried the following:

curl -s

https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-ar... | bzip2 -d | tail

...
an got this:

bzip2: Compressed file ends unexpectedly; perhaps it is corrupted? *Possible* reason follows. bzip2: Inappropriate ioctl for device Input file = (stdin), output file = (stdout)

It is possible that the compressed file(s) have become corrupted.

The file I received was fine and the sha1sum matches that of wikidatawiki-20240101-pages-articles-multistream.xml.bz2 mention in the posting of Xabriel Collazo Mojica:

--- 8< --- $ sha1sum wikidatawiki-latest-pages-articles-multistream.xml.bz2 1be753ba90e0390c8b65f9b80b08015922da12f1 wikidatawiki-latest-pages-articles-multistream.xml.bz2 --- >8 ---

bunzip2 did not report any problem, however, my first attempt to decompress ended with a full disk after more that 2.3 TB of xml.

The second attempt --- 8< --- $ bunzip2 -cv wikidatawiki-latest-pages-articles-multistream.xml.bz2 | tail -n 10000 > wikidatawiki-latest-pages-articles-multistream_tail_-n_10000.xml wikidatawiki-latest-pages-articles-multistream.xml.bz2: done --- >8 ---

resulted in nice XML fragment which ends with --- 8< ---

<page> <title>Q124069752</title> <ns>0</ns> <id>118244259</id> <revision> <id>2042727399</id> <parentid>2042727216</parentid> <timestamp>2024-01-01T20:37:28Z</timestamp> <contributor> <username>Kalepom</username> <id>1900170</id> </contributor> <comment>/* wbsetclaim-create:2||1 */ [[Property:P2789]]: [[Q16506931]]</comment> <model>wikibase-item</model> <format>application/json</format> <text bytes="2535" xml:space="preserve">...</text> <sha1>9gw926vh84k1b5h6wnuvlvnd2zc3a9b</sha1> </revision> </page> </mediawiki> --- >8 ---

So, I assume, your curl did not return the full 142 GB of wikidatawiki-latest-pages-articles-multistream.xml.bz2 .

P.S.: I'll start a new bunzip2 to a larger scratch disk just to find out, how big this xml file really is.

regards, Gerhard _______________________________________________ Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org To unsubscribe send an email to xmldatadumps-l-leave@lists.wikimedia.org

-- Xabriel J. Collazo Mojica (he/him, pronunciation https://commons.wikimedia.org/wiki/File:Xabriel_Collazo_Mojica_-_pronunciation.ogg ) Sr Software Engineer Wikimedia Foundation _______________________________________________ Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org To unsubscribe send an email to xmldatadumps-l-leave@lists.wikimedia.org

Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org To unsubscribe send an email to xmldatadumps-l-leave@lists.wikimedia.org

Platonides

13 Jan 13 Jan

2:20 a.m.

Gerhard said that for him the downloading job ran for about 12 hours. It seems the connection was closed. I wouldn't be surprised if this was facing a similar problem as https://phabricator.wikimedia.org/T351876

With such long download time, it isn't that strange that there could be connection errors (still something to look into, though, toolserver-to-Prod shouldn't be suffering that).

wget (used by Gerhard) retries automatically, perhaps curl isn't and is thus more susceptible to these errors.

Try changing your job to wget -O - https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-ar... | bzip2 -d | tail

Wurgl

9:42 a.m.

Hello!

wget was the tool I was using with jsub-Environment, but wget is not available any more in kubernetes (with toolforge jobs start …) :-(

$ webservice php7.4 shell tools.persondata@shell-1705135256:~$ wget bash: wget: command not found

Wolfgang

Am Sa., 13. Jan. 2024 um 02:20 Uhr schrieb Platonides <platonides@gmail.com

...

:

...

Gerhard said that for him the downloading job ran for about 12 hours. It seems the connection was closed. I wouldn't be surprised if this was facing a similar problem as https://phabricator.wikimedia.org/T351876

With such long download time, it isn't that strange that there could be connection errors (still something to look into, though, toolserver-to-Prod shouldn't be suffering that).

wget (used by Gerhard) retries automatically, perhaps curl isn't and is thus more susceptible to these errors.

Try changing your job to wget -O - https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-ar... | bzip2 -d | tail

Platonides

11:48 p.m.

I would probably open a task to have wget available in the kubernetes cluster and another, low-priority one, for investigating why connection gets dropped between toolforge and dumps.w.o

On Sat, 13 Jan 2024 at 08:42, Wurgl heisewurgl@gmail.com wrote:

...

Hello!

wget was the tool I was using with jsub-Environment, but wget is not available any more in kubernetes (with toolforge jobs start …) :-(

$ webservice php7.4 shell tools.persondata@shell-1705135256:~$ wget bash: wget: command not found

Wolfgang

Am Sa., 13. Jan. 2024 um 02:20 Uhr schrieb Platonides < platonides@gmail.com>:

...
Gerhard said that for him the downloading job ran for about 12 hours. It seems the connection was closed. I wouldn't be surprised if this was facing a similar problem as https://phabricator.wikimedia.org/T351876

With such long download time, it isn't that strange that there could be connection errors (still something to look into, though, toolserver-to-Prod shouldn't be suffering that).

wget (used by Gerhard) retries automatically, perhaps curl isn't and is thus more susceptible to these errors.

Try changing your job to wget -O - https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-ar... | bzip2 -d | tail

Gerhard Gonter

14 Jan 14 Jan

12:47 p.m.

On Thu, Jan 11, 2024 at 8:26 AM Wurgl heisewurgl@gmail.com wrote:

...

22 141G 22 31.6G 0 0 748k 0 55:10:00 12:18:26 42:51:34 698k curl: (18) transfer closed with 118232009816 bytes remaining to read

There you have it: curl only got 22 GB, 118 GB are missing.

...

Something does not like me very much :-( Maybe some timeout? Maybe some transfer-limitation? Maybe something different.

On my end, the job to uncompress the xml.bz2 file finished without a problem a few days ago. I could try to run the PHP script, but I do not have access to that or to the environment you mentioned.

regards, Gerhard Gonter

345

Age (days ago)

354

Last active (days ago)

xmldatadumps-l@lists.wikimedia.org

14 comments

5 participants

tags (0)

participants (5)

Ariel Glenn WMF
Gerhard Gonter
Platonides
Wurgl
Xabriel Collazo Mojica