Actually, it seems quite a few cryptocurrency related articles are missing. Litecoin, Ethereum, Namecoin, Dogecoin, CryptoNote, etc. Most of these articles transclude Template:Cryptocurrencies, but I'm not sure why they wouldn't appear in the XML dump.
I checked my local copy of enwiki-20160407-pages-articles.xml.bz2 and all those articles are present. For reference, my copy was downloaded on 4-27, has a size of 12,878,552,649 and an md5 of cff68321a17392fbb3322b34b61b0402. Also, the following grep command found the page:
grep "<title>Litecoin</title>" enwiki-latest-pages-articles.xml
<title>Litecoin</title>
Did you download your dump earlier last month? There was a known bad version that was truncated by about 2 GB (somewhere around 10.8 GB). It was missing most of the pages in the Module namespace, and may have been missing these as well. See: https://lists.wikimedia.org/pipermail/xmldatadumps-l/2016-April/001301.html
If yours is 10.8 GB, then you should redownload the latest version. The modified time haven't changed, but it has been updated. See: https://phabricator.wikimedia.org/T133416#2252100
Hope this helps.
On Sun, May 1, 2016 at 2:00 AM, Marcus Truscello marcus.truscello@gmail.com wrote:
Actually, it seems quite a few cryptocurrency related articles are missing. Litecoin, Ethereum, Namecoin, Dogecoin, CryptoNote, etc. Most of these articles transclude Template:Cryptocurrencies, but I'm not sure why they wouldn't appear in the XML dump.
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
This looks to be what happened. When I had downloaded the file it was currently the "latest" version. I've since gunzipped it, so I don't have the original md5sum, but the timestamp on the file I have is before the Phabricator issue was opened so there's a good chance this is one of the "bad" dumps.
Thanks much for your help!
On Sun, May 1, 2016 at 10:07 PM, gnosygnu gnosygnu@gmail.com wrote:
I checked my local copy of enwiki-20160407-pages-articles.xml.bz2 and all those articles are present. For reference, my copy was downloaded on 4-27, has a size of 12,878,552,649 and an md5 of cff68321a17392fbb3322b34b61b0402. Also, the following grep command found the page:
grep "<title>Litecoin</title>" enwiki-latest-pages-articles.xml
<title>Litecoin</title>
Did you download your dump earlier last month? There was a known bad version that was truncated by about 2 GB (somewhere around 10.8 GB). It was missing most of the pages in the Module namespace, and may have been missing these as well. See: https://lists.wikimedia.org/pipermail/xmldatadumps-l/2016-April/001301.html
If yours is 10.8 GB, then you should redownload the latest version. The modified time haven't changed, but it has been updated. See: https://phabricator.wikimedia.org/T133416#2252100
Hope this helps.
On Sun, May 1, 2016 at 2:00 AM, Marcus Truscello marcus.truscello@gmail.com wrote:
Actually, it seems quite a few cryptocurrency related articles are
missing.
Litecoin, Ethereum, Namecoin, Dogecoin, CryptoNote, etc. Most of these articles transclude Template:Cryptocurrencies, but I'm not sure why they wouldn't appear in the XML dump.
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
FYI: The Phabricator task will be closed: https://phabricator.wikimedia.org/T133416 . The April dump looks good to me. Also, the issue is probably moot as the May dump has completed, and it too looks fine: https://dumps.wikimedia.org/enwiki/20160501/
Thanks Ariel for taking care of it!
On Mon, May 2, 2016 at 12:21 PM, Marcus Truscello marcus.truscello@gmail.com wrote:
This looks to be what happened. When I had downloaded the file it was currently the "latest" version. I've since gunzipped it, so I don't have the original md5sum, but the timestamp on the file I have is before the Phabricator issue was opened so there's a good chance this is one of the "bad" dumps.
Thanks much for your help!
On Sun, May 1, 2016 at 10:07 PM, gnosygnu gnosygnu@gmail.com wrote:
I checked my local copy of enwiki-20160407-pages-articles.xml.bz2 and all those articles are present. For reference, my copy was downloaded on 4-27, has a size of 12,878,552,649 and an md5 of cff68321a17392fbb3322b34b61b0402. Also, the following grep command found the page:
grep "<title>Litecoin</title>" enwiki-latest-pages-articles.xml
<title>Litecoin</title>
Did you download your dump earlier last month? There was a known bad version that was truncated by about 2 GB (somewhere around 10.8 GB). It was missing most of the pages in the Module namespace, and may have been missing these as well. See:
https://lists.wikimedia.org/pipermail/xmldatadumps-l/2016-April/001301.html
If yours is 10.8 GB, then you should redownload the latest version. The modified time haven't changed, but it has been updated. See: https://phabricator.wikimedia.org/T133416#2252100
Hope this helps.
On Sun, May 1, 2016 at 2:00 AM, Marcus Truscello marcus.truscello@gmail.com wrote:
Actually, it seems quite a few cryptocurrency related articles are missing. Litecoin, Ethereum, Namecoin, Dogecoin, CryptoNote, etc. Most of these articles transclude Template:Cryptocurrencies, but I'm not sure why they wouldn't appear in the XML dump.
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
xmldatadumps-l@lists.wikimedia.org