Gerhad: Thanks for the extra checks!

Wolfgang: I can confirm Gerhad's findings. The file appears correct, and ends with the right footer.

On Wed, Jan 10, 2024 at 10:50 AM Gerhard Gonter <ggonter@gmail.com> wrote:
On Fri, Jan 5, 2024 at 5:03 PM Wurgl <heisewurgl@gmail.com> wrote:
>
> Hello!
>
> I am having some unexpected messages, so I tried the following:
>
> curl -s https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-articles-multistream.xml.bz2 | bzip2 -d | tail
>
> an got this:
>
> bzip2: Compressed file ends unexpectedly;
>         perhaps it is corrupted?  *Possible* reason follows.
> bzip2: Inappropriate ioctl for device
>         Input file = (stdin), output file = (stdout)
>
> It is possible that the compressed file(s) have become corrupted.

The file I received was fine and the sha1sum matches that of
wikidatawiki-20240101-pages-articles-multistream.xml.bz2 mention in
the posting of Xabriel Collazo Mojica:

--- 8< ---
$ sha1sum wikidatawiki-latest-pages-articles-multistream.xml.bz2
1be753ba90e0390c8b65f9b80b08015922da12f1
wikidatawiki-latest-pages-articles-multistream.xml.bz2
--- >8 ---

bunzip2 did not report any problem, however, my first attempt to
decompress ended with a full disk after more that 2.3 TB of xml.

The second attempt
--- 8< ---
$  bunzip2 -cv wikidatawiki-latest-pages-articles-multistream.xml.bz2
| tail -n 10000 >
wikidatawiki-latest-pages-articles-multistream_tail_-n_10000.xml
  wikidatawiki-latest-pages-articles-multistream.xml.bz2: done
--- >8 ---

resulted in nice XML fragment which ends with
--- 8< ---
  <page>
    <title>Q124069752</title>
    <ns>0</ns>
    <id>118244259</id>
    <revision>
      <id>2042727399</id>
      <parentid>2042727216</parentid>
      <timestamp>2024-01-01T20:37:28Z</timestamp>
      <contributor>
        <username>Kalepom</username>
        <id>1900170</id>
      </contributor>
      <comment>/* wbsetclaim-create:2||1 */ [[Property:P2789]]:
[[Q16506931]]</comment>
      <model>wikibase-item</model>
      <format>application/json</format>
      <text bytes="2535" xml:space="preserve">...</text>
      <sha1>9gw926vh84k1b5h6wnuvlvnd2zc3a9b</sha1>
    </revision>
  </page>
</mediawiki>
--- >8 ---

So, I assume, your curl did not return the full 142 GB of
wikidatawiki-latest-pages-articles-multistream.xml.bz2 .

P.S.: I'll start a new bunzip2 to a larger scratch disk just to find
out, how big this xml file really is.

regards, Gerhard
_______________________________________________
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-leave@lists.wikimedia.org


--
Xabriel J. Collazo Mojica (he/him, pronunciation)
Sr Software Engineer
Wikimedia Foundation