On Fri, Jan 5, 2024 at 5:03 PM Wurgl heisewurgl@gmail.com wrote:
Hello!
I am having some unexpected messages, so I tried the following:
curl -s https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-ar... | bzip2 -d | tail
an got this:
bzip2: Compressed file ends unexpectedly; perhaps it is corrupted? *Possible* reason follows. bzip2: Inappropriate ioctl for device Input file = (stdin), output file = (stdout)
It is possible that the compressed file(s) have become corrupted.
The file I received was fine and the sha1sum matches that of wikidatawiki-20240101-pages-articles-multistream.xml.bz2 mention in the posting of Xabriel Collazo Mojica:
--- 8< --- $ sha1sum wikidatawiki-latest-pages-articles-multistream.xml.bz2 1be753ba90e0390c8b65f9b80b08015922da12f1 wikidatawiki-latest-pages-articles-multistream.xml.bz2 --- >8 ---
bunzip2 did not report any problem, however, my first attempt to decompress ended with a full disk after more that 2.3 TB of xml.
The second attempt --- 8< --- $ bunzip2 -cv wikidatawiki-latest-pages-articles-multistream.xml.bz2 | tail -n 10000 > wikidatawiki-latest-pages-articles-multistream_tail_-n_10000.xml wikidatawiki-latest-pages-articles-multistream.xml.bz2: done --- >8 ---
resulted in nice XML fragment which ends with --- 8< --- <page> <title>Q124069752</title> <ns>0</ns> <id>118244259</id> <revision> <id>2042727399</id> <parentid>2042727216</parentid> <timestamp>2024-01-01T20:37:28Z</timestamp> <contributor> <username>Kalepom</username> <id>1900170</id> </contributor> <comment>/* wbsetclaim-create:2||1 */ [[Property:P2789]]: [[Q16506931]]</comment> <model>wikibase-item</model> <format>application/json</format> <text bytes="2535" xml:space="preserve">...</text> <sha1>9gw926vh84k1b5h6wnuvlvnd2zc3a9b</sha1> </revision> </page> </mediawiki> --- >8 ---
So, I assume, your curl did not return the full 142 GB of wikidatawiki-latest-pages-articles-multistream.xml.bz2 .
P.S.: I'll start a new bunzip2 to a larger scratch disk just to find out, how big this xml file really is.
regards, Gerhard