On Wed, Jan 10, 2024 at 6:19 PM Wurgl heisewurgl@gmail.com wrote:
The relevant line is this one: curl -s https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-ar... | bzip2 -d | php ~/dumps/wikidata_sitelinks.php
Yes, I double-checked it on my machine at home and the same type of error happened.
Well, we now know that the xml.bz2 file itself is ok. The usual way to debug this would be to perform each step of the above pipe in isolation, which I more or less did. The xml.bz2 file arrived ok, but I used wget for that and that job alone ran for about 12 hours to retrieve the ~150 GB file. Also, bunzip2 worked for me, as mentioned in an earlier posting and I found the expected closing tag "</mediawiki>" in the last line. So, also at least my bunzip2 (Version 1.0.6, 6-Sept-2010) seems to be ok or ok with that file.
As I already mentioned, from the messages in your original mail, I can only venture a guess here, is that you curl -s simply did not retrieve the full file. Try ommitting the -s for a test.
regards, Gerhard