Multiple pages are missing from the enwiki pages-articles-multistream dumps from 20180201 and 20180220.
Page id 88444: "Phosphor" doesn't appear in the index or in the data stream. This also happens for TARDIS, Psalm 132, and many others
Why would the dump be partial?
It turns out that this happens for exactly 27 pages, those at the end of each enwiki-20180220-stub-articlesXX.xml.gz file. Tracking here: https://phabricator.wikimedia.org/T188388
Ariel
On Tue, Feb 27, 2018 at 10:45 AM, Ryan Hitchman hitchmanr@gmail.com wrote:
Multiple pages are missing from the enwiki pages-articles-multistream dumps from 20180201 and 20180220.
Page id 88444: "Phosphor" doesn't appear in the index or in the data stream. This also happens for TARDIS, Psalm 132, and many others
Why would the dump be partial?
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Thanks for the quick fix! I'll verify it too with the next run.
I discovered this while building a link graph directly from the pages-articles dump, and finding that I had more broken links (missing target articles) than expected.
On Tue, Feb 27, 2018 at 4:10 AM, Ariel Glenn WMF ariel@wikimedia.org wrote:
It turns out that this happens for exactly 27 pages, those at the end of each enwiki-20180220-stub-articlesXX.xml.gz file. Tracking here: https://phabricator.wikimedia.org/T188388
Ariel
On Tue, Feb 27, 2018 at 10:45 AM, Ryan Hitchman hitchmanr@gmail.com wrote:
Multiple pages are missing from the enwiki pages-articles-multistream dumps from 20180201 and 20180220.
Page id 88444: "Phosphor" doesn't appear in the index or in the data stream. This also happens for TARDIS, Psalm 132, and many others
Why would the dump be partial?
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
xmldatadumps-l@lists.wikimedia.org