Basically, the xml dumps have 2 IDs: page_id and revision_id.
The page_id points to the article. In this case, 14640471 is the page_id for Mars (https://en.wikipedia.org/wiki/Mars)
The revision_id points to the latest revision for the article. For Mars, the latest revision_id is 699008434 which was generated on 2016-01-09 ( https://en.wikipedia.org/w/index.php?title=Mars&oldid=699008434). Note that a revision_id is generated every time a page is edited.
So, to answer your question, the IDs never change. 14640471 will always point to Mars, while 699008434 points to the 2016-01-09 revision for Mars.
That said, different dumps will have different revision_ids, because an article may be updated. If Mars gets updated tomorrow, and the English Wikipedia dump is generated afterwards, then that dump will list Mars with a new revision_id (something higher than 6999008434). However, that dump will still show Mars with a page_id of 1460471. You're probably better off using the page_id.
Finally, you can see also reference the Wikimedia API to get a similar view to the dump: For example: https://en.wikipedia.org/w/api.php?action=query&prop=revisions&title...
Hope this helps.
On Mon, Jan 11, 2016 at 5:09 AM, Luigi Assom luigi.assom@gmail.com wrote:
yep, same here!
Also another question about consistency of _IDs in time. I was working with an old version of wikipedia dump, and testing some data models I built on the dumpusing as pivot a few topics. I might have data corrupted on my side, but just to be sure: are _IDs of article *persistent* over time, or are they subjected to change?
Might happen that due any fallback or merge in an article history, ID would change? E.g. as test article "Mars" would first point to a version _ID ="4285430" and then changed to "14640471"
I need to ensure _IDs will persist. thank you!
*P.S. sorry for cross posting - I've replied from wrong email - could you please delete the other message and keep only this email address? thank you! *
On Mon, Jan 11, 2016 at 11:05 AM, XDiscovery Team info@xdiscovery.com wrote:
yep, same here!
Also another question about consistency of _IDs in time. I was working with an old version of wikipedia dump, and testing some data models I built on the dump using as pivot a few topics. I might have data corrupted on my side, but just to be sure: are _IDs of article *persistent* over time, or are they subjected to change?
Might happen that due any fallback or merge in an article history, ID would change? E.g. as test article "Mars" would first point to a version _ID ="4285430" and then changed to "14640471"
I need to ensure _IDs will persist. thank you!
On Mon, Jan 11, 2016 at 6:22 AM, Tilman Bayer tbayer@wikimedia.org wrote:
On Sun, Jan 10, 2016 at 4:05 PM, Bernardo Sulzbach < mafagafogigante@gmail.com> wrote:
On Sun, Jan 10, 2016 at 9:55 PM, Neil Harris neil@tonal.clara.co.uk wrote:
Hello! I've noticed that no enwiki dump seems to have been generated
so far
this month. Is this by design, or has there been some sort of dump
failure?
Does anyone know when the next enwiki dump might happen?
I would also be interested.
-- Bernardo Sulzbach
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
CCing the Xmldatadumps mailing list https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l, where someone has already posted https://lists.wikimedia.org/pipermail/xmldatadumps-l/2016-January/001214.html about what might be the same issue.
-- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
-- *Luigi Assom* Founder & CEO @ XDiscovery - Crazy on Human Knowledge *Corporate* www.xdiscovery.com *Mobile App for knowledge Discovery* APP STORE http://tiny.cc/LearnDiscoveryApp | PR http://tiny.cc/app_Mindmap_Wikipedia | WEB http://www.learndiscovery.com/
T +39 349 3033334 | +1 415 707 9684
-- *Luigi Assom*
T +39 349 3033334 | +1 415 707 9684 Skype oggigigi
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
On 2016-01-11 22:06, gnosygnu wrote:
So, to answer your question, the IDs never change. 14640471 will always point to Mars, while 699008434 points to the 2016-01-09 revision for Mars.
While it's unlikely/rare, I think the page id can change when a page is deleted and re-created, and maybe some other cases. MediaWiki tries to keep it constant (for example, I think it's preserved after deletion and undeletion), but it's not always possible. It should be fine to use to track pages across renames, though, at least most of the time.
That said, different dumps will have different revision_ids, because an article may be updated. If Mars gets updated tomorrow, and the English Wikipedia dump is generated afterwards, then that dump will list Mars with a new revision_id (something higher than 6999008434).
Please don't assume that revision id's are increasing. Weird things can happen with import, export and page history merges :)
On Mon, Jan 11, 2016 at 10:37 PM, Bartosz Dziewoński matma.rex@gmail.com wrote:
On 2016-01-11 22:06, gnosygnu wrote:
So, to answer your question, the IDs never change. 14640471 will always point to Mars, while 699008434 points to the 2016-01-09 revision for Mars.
While it's unlikely/rare, I think the page id can change when a page is deleted and re-created, and maybe some other cases. MediaWiki tries to keep it constant (for example, I think it's preserved after deletion and undeletion), but it's not always possible.
The patch to preserve IDs over undeletion was merged today (so don't expect IDs to be unchanging in older dumps). Also, pages can be split and joined through partial undeletion of revisions, in which case it is hard to tell what staying constant even means. You can swap the ID of any two pages, for example, without any changes in their text or history, with the right sequence of deletions, undeletions and page moves. Also, when a page is moved, the ID and the title are disassociated (which is probably what you'd want in that case).
wikitech-l@lists.wikimedia.org