Hi,
As far as I understand, pages in an XML dump are in the order of their original creation. This does not correspond to the page ID, because if a page gets a new id after deletion and restore or renaming to that title or anything, the order still remains the original. But this sortkey itself is not stored. In other words, a dump is not sorted by any key one could finf in the dump, and behaves as an unosorted structure.
Is this true? Can I use any non-linear (e.g. binary) search in a dump?
Not that this is offtopic here, but you will find probably more knowledgeable people and probably a quicker response at the specialized list https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
On Mon, Sep 3, 2018 at 3:06 PM Bináris wikiposta@gmail.com wrote:
Hi,
As far as I understand, pages in an XML dump are in the order of their original creation. This does not correspond to the page ID, because if a page gets a new id after deletion and restore or renaming to that title or anything, the order still remains the original. But this sortkey itself is not stored. In other words, a dump is not sorted by any key one could finf in the dump, and behaves as an unosorted structure.
Is this true? Can I use any non-linear (e.g. binary) search in a dump?
-- Bináris _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
If I read the code in WikiExporter.php correctly, dumps are currently ordered by page ID.
However, I would not consider this a guarantee. I'd recommend to assume that the content of a dump are in no particular order, and that the order is subject to change without notice.
-- daniel
Am 03.09.2018 um 15:05 schrieb Bináris:
Hi,
As far as I understand, pages in an XML dump are in the order of their original creation. This does not correspond to the page ID, because if a page gets a new id after deletion and restore or renaming to that title or anything, the order still remains the original. But this sortkey itself is not stored. In other words, a dump is not sorted by any key one could finf in the dump, and behaves as an unosorted structure.
Is this true? Can I use any non-linear (e.g. binary) search in a dump?
wikitech-l@lists.wikimedia.org