Okay,

yesterday evening I did the following:

I started this script
##
#!/bin/bash
curl https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-articles-multistream.xml.bz2 | bzip2 -d | tail -200
##

With this command:
tools.persondata@tools-sgebastion-11:~$ toolforge jobs run --command /data/project/persondata/spielwiese/curltest.sh  --image php7.4 -o /data/project/persondata/logs/curltest.out -e /data/project/persondata/logs/curltest.err startcurltest

The errorfile curltest.err looks like:
##
tools.persondata@tools-sgebastion-11:~$ tr '\r' '\n' </data/project/persondata/logs/curltest.err | head -2
 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
tools.persondata@tools-sgebastion-11:~$ tr '\r' '\n' </data/project/persondata/logs/curltest.err | tail -20
 22  141G   22 31.6G    0     0   748k      0 55:09:59 12:18:21 42:51:38  755k
 22  141G   22 31.6G    0     0   748k      0 55:09:59 12:18:22 42:51:37  787k
 22  141G   22 31.6G    0     0   748k      0 55:09:59 12:18:23 42:51:36  770k
 22  141G   22 31.6G    0     0   748k      0 55:09:59 12:18:24 42:51:35  764k
 22  141G   22 31.6G    0     0   748k      0 55:10:00 12:18:25 42:51:35  727k
 22  141G   22 31.6G    0     0   748k      0 55:10:00 12:18:26 42:51:34  708k
 22  141G   22 31.6G    0     0   748k      0 55:10:00 12:18:26 42:51:34  698k
curl: (18) transfer closed with 118232009816 bytes remaining to read

bzip2: Compressed file ends unexpectedly;
        perhaps it is corrupted?  *Possible* reason follows.
bzip2: Inappropriate ioctl for device
        Input file = (stdin), output file = (stdout)

It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.
##

The stdout-file curltest.out looks like
##
tools.persondata@tools-sgebastion-11:~$ tail -3 /data/project/persondata/logs/curltest.out
      <sha1>s3raizvae6sd42yw49j2gy63ecyqclk</sha1>
    </revision>
  </page>
##

Something does not like me very much :-( Maybe some timeout? Maybe some transfer-limitation? Maybe something different.

Wolfgang


Am Mi., 10. Jan. 2024 um 16:29 Uhr schrieb Ariel Glenn WMF <ariel@wikimedia.org>:
I would hazard a guess that your bz2 unzip app does not handle multistream files in an appropriate way, Wurgl. The multistream files consist of several bzip2-compressed files concatenated together; see https://meta.wikimedia.org/wiki/Data_dumps/Dump_format#Multistream_dumps for details.  Try downloading the entire file via curl, and then look into the question of the bzip app issues separately. Maybe it will turn out that you are encountering some other problem. But first, see if you can download the entire file and get its hash to check out.

Ariel

On Wed, Jan 10, 2024 at 5:15 PM Xabriel Collazo Mojica <xcollazo@wikimedia.org> wrote:
Gerhad: Thanks for the extra checks!

Wolfgang: I can confirm Gerhad's findings. The file appears correct, and ends with the right footer.

On Wed, Jan 10, 2024 at 10:50 AM Gerhard Gonter <ggonter@gmail.com> wrote:
On Fri, Jan 5, 2024 at 5:03 PM Wurgl <heisewurgl@gmail.com> wrote:
>
> Hello!
>
> I am having some unexpected messages, so I tried the following:
>
> curl -s https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-articles-multistream.xml.bz2 | bzip2 -d | tail
>
> an got this:
>
> bzip2: Compressed file ends unexpectedly;
>         perhaps it is corrupted?  *Possible* reason follows.
> bzip2: Inappropriate ioctl for device
>         Input file = (stdin), output file = (stdout)
>
> It is possible that the compressed file(s) have become corrupted.

The file I received was fine and the sha1sum matches that of
wikidatawiki-20240101-pages-articles-multistream.xml.bz2 mention in
the posting of Xabriel Collazo Mojica:

--- 8< ---
$ sha1sum wikidatawiki-latest-pages-articles-multistream.xml.bz2
1be753ba90e0390c8b65f9b80b08015922da12f1
wikidatawiki-latest-pages-articles-multistream.xml.bz2
--- >8 ---

bunzip2 did not report any problem, however, my first attempt to
decompress ended with a full disk after more that 2.3 TB of xml.

The second attempt
--- 8< ---
$  bunzip2 -cv wikidatawiki-latest-pages-articles-multistream.xml.bz2
| tail -n 10000 >
wikidatawiki-latest-pages-articles-multistream_tail_-n_10000.xml
  wikidatawiki-latest-pages-articles-multistream.xml.bz2: done
--- >8 ---

resulted in nice XML fragment which ends with
--- 8< ---
  <page>
    <title>Q124069752</title>
    <ns>0</ns>
    <id>118244259</id>
    <revision>
      <id>2042727399</id>
      <parentid>2042727216</parentid>
      <timestamp>2024-01-01T20:37:28Z</timestamp>
      <contributor>
        <username>Kalepom</username>
        <id>1900170</id>
      </contributor>
      <comment>/* wbsetclaim-create:2||1 */ [[Property:P2789]]:
[[Q16506931]]</comment>
      <model>wikibase-item</model>
      <format>application/json</format>
      <text bytes="2535" xml:space="preserve">...</text>
      <sha1>9gw926vh84k1b5h6wnuvlvnd2zc3a9b</sha1>
    </revision>
  </page>
</mediawiki>
--- >8 ---

So, I assume, your curl did not return the full 142 GB of
wikidatawiki-latest-pages-articles-multistream.xml.bz2 .

P.S.: I'll start a new bunzip2 to a larger scratch disk just to find
out, how big this xml file really is.

regards, Gerhard
_______________________________________________
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-leave@lists.wikimedia.org


--
Xabriel J. Collazo Mojica (he/him, pronunciation)
Sr Software Engineer
Wikimedia Foundation
_______________________________________________
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-leave@lists.wikimedia.org
_______________________________________________
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-leave@lists.wikimedia.org