I am working with some enwiki-{YYYYMMDD}-stub-meta-history.xml dumps and wanted to get clarification on how certain fields of the articles can change:
1. What action will make an article get a new pageId? Is it only move/rename, a redirect, or a deletion and recreation, or are there other ways this could happen? Can any of these changes be detected from the stub-meta-history.xml files?
2. Is it possible for just one particular revision of an article to be deleted, maybe due to a copyright violation? If so, is just the content of the revision deleted or would this include all the data associated with it, so that the revision would not even appear in the stub-meta-history.xml file?
3. Are pageIds recycled? If a page is deleted, could its id number be used for a completely new page in the future?
Thanks, Jeff
2009/11/10 Jeff Kubina jeff.kubina@gmail.com:
I am working with some enwiki-{YYYYMMDD}-stub-meta-history.xml dumps and wanted to get clarification on how certain fields of the articles can change:
- What action will make an article get a new pageId? Is it
only move/rename, a redirect, or a deletion and recreation, or are there other ways this could happen? Can any of these changes be detected from the stub-meta-history.xml files?
When a page is moved, it'll change its name but keep its pageid. A redirect will be created at the old name with a new pageid.
- Is it possible for just one particular revision of an article to be
deleted, maybe due to a copyright violation? If so, is just the content of the revision deleted or would this include all the data associated with it, so that the revision would not even appear in the stub-meta-history.xml file?
Yes. In this case, any trace of the revision ever having existed is gone from the dumps, AFAIK.
- Are pageIds recycled? If a page is deleted, could its id number be used
for a completely new page in the future?
No, pageids are never recycled.
Roan Kattouw (Catrope)
On Tue, Nov 10, 2009 at 7:52 AM, Jeff Kubina jeff.kubina@gmail.com wrote:
- What action will make an article get a new pageId? Is it
only move/rename, a redirect, or a deletion and recreation, or are there other ways this could happen? Can any of these changes be detected from the stub-meta-history.xml files?
Normal deletion/undeletion, moving, or similar things will not create a new page_id. However, there are a couple of things to be aware of:
1) In the old days, deleting an article and recreating it would assign it a new page_id. This hasn't been true for several years.
2) It's still possible to get revisions associated with a different page_id than they were originally written for, by deleting a page, moving another page over it, and undeleting one or more revisions.
- Is it possible for just one particular revision of an article to be
deleted, maybe due to a copyright violation? If so, is just the content of the revision deleted or would this include all the data associated with it, so that the revision would not even appear in the stub-meta-history.xml file?
Yes, an individual revision can be deleted. There are at least three different ways to do this, last I checked. I would expect that the old ways (oversight, and delete+selective undelete) would leave no traces at all in the dump, while the new way (rev_deleted) might only suppress certain fields. I'm not sure offhand, though.
- Are pageIds recycled? If a page is deleted, could its id number be used
for a completely new page in the future?
No. page_ids are handed out in strictly increasing order.
"Aryeh Gregor" Simetrical+wikilist@gmail.com wrote in message news:7c2a12e20911100759s1ba211b0k6ef6cb076449be37@mail.gmail.com...
On Tue, Nov 10, 2009 at 7:52 AM, Jeff Kubina jeff.kubina@gmail.com wrote:
- Is it possible for just one particular revision of an article to be
deleted, maybe due to a copyright violation? If so, is just the content of the revision deleted or would this include all the data associated with it, so that the revision would not even appear in the stub-meta-history.xml file?
Yes, an individual revision can be deleted. There are at least three different ways to do this, last I checked. I would expect that the old ways (oversight, and delete+selective undelete) would leave no traces at all in the dump, while the new way (rev_deleted) might only suppress certain fields. I'm not sure offhand, though.
IIRC, any revision that has any of the rev_deleted bitfields set will be excluded from dumps. Don't quote me on that....
--HM
Thanks for the help, but I'm still a bit confused about this case: in enwiki-20090714-stub-meta-history.xml the AmericanSamoa page has a pageId of 6; as shown below. But, in enwiki-20090914-stub-meta-history.xml it has an id of 23741520 http://en.wikipedia.org/wiki/Special:Export/AmericanSamoa, with only the last edit history entry. So what happen? Is this an example of a delete, then restore with a new id? Why are the older revisions missing or does a restore only restore the latest revision?
XML from enwiki-20090714-stub-meta-history.xml for AmericanSamoa: <page> <title>AmericanSamoa</title> <id>6</id> <redirect /> <revision> <id>233188</id> <timestamp>2001-01-19T01:12:51Z</timestamp> <contributor> <ip>office.bomis.com</ip> </contributor> <comment>*</comment> <text id="233188" /> </revision> <revision> <id>15898942</id> <timestamp>2002-02-25T15:43:11Z</timestamp> <contributor> <ip>Conversion script</ip> </contributor> <minor/> <comment>Automated conversion</comment> <text id="15898942" /> </revision> <revision> <id>18063795</id> <timestamp>2005-07-03T11:14:17Z</timestamp> <contributor> <username>Docu</username> <id>8029</id> </contributor> <minor/> <comment>adding to cur_id=6 {{R from CamelCase}}</comment> <text id="18058393" /> </revision> <revision> <id>133180191</id> <timestamp>2007-05-24T14:41:33Z</timestamp> <contributor> <username>Ngaiklin</username> <id>4477979</id> </contributor> <minor/> <comment>Robot: Automated text replacement (-[[(.*?[:||])*?(.+?)]] +\g<2>)</comment> <text id="132462505" /> </revision> <revision> <id>133452270</id> <timestamp>2007-05-25T17:12:06Z</timestamp> <contributor> <username>Gurch</username> <id>241822</id> </contributor> <minor/> <comment>Revert edit(s) by [[Special:Contributions/Ngaiklin|Ngaiklin]] to last version by [[Special:Contributions/Docu|Docu]]</comment> <text id="132732979" /> </revision> </page>
Thanks, Jeff
On Tue, Nov 10, 2009 at 2:48 PM, Jeff Kubina jeff.kubina@gmail.com wrote:
Thanks for the help, but I'm still a bit confused about this case: in enwiki-20090714-stub-meta-history.xml the AmericanSamoa page has a pageId of 6; as shown below. But, in enwiki-20090914-stub-meta-history.xml it has an id of 23741520 http://en.wikipedia.org/wiki/Special:Export/AmericanSamoa, with only the last edit history entry. So what happen? Is this an example of a delete, then restore with a new id? Why are the older revisions missing or does a restore only restore the latest revision?
I assume the Page ID answer lies with whatever the hell Graham87 was doing here in July:
http://en.wikipedia.org/w/index.php?title=Special:Log&page=AmericanSamoa
Also, if you use a URL GET, such as you have above, it only gives the most recent revision. You can uncheck the "Include only the current revision" box at Special:Export if you want to get additional revisions from the online form.
-Robert Rohde
Jeff Kubina wrote:
Thanks for the help, but I'm still a bit confused about this case: in enwiki-20090714-stub-meta-history.xml the AmericanSamoa page has a pageId of 6; as shown below. But, in enwiki-20090914-stub-meta-history.xml it has an id of 23741520 http://en.wikipedia.org/wiki/Special:Export/AmericanSamoa, with only the last edit history entry. So what happen? Is this an example of a delete, then restore with a new id? Why are the older revisions missing or does a restore only restore the latest revision?
See http://en.wikipedia.org/w/index.php?title=Special:Log&page=AmericanSamoa
There was a quite a bit of deletion move and undeletion trickery to move the first revision on the XML (the one from office.bomis.com) to the history of American_Samoa. http://en.wikipedia.org/w/index.php?title=American_Samoa&oldid=233188
Seems AmericanSamoa page id was recreated during that.
There's another id oddness on that page, since that office edit is from January 2001 and has id 233188. It has listed (wrongly) as previous on the diff links one from July 2002 with revid of 205006. It is listed as previous because 205006 < 233188. That older revision has a newer revid because originally, only current version of articles were imported from UseModWiki (those that are tagged as from Conversion script). Older edits like this one were imported later, after that 205006 edit was made.
On Tue, Nov 10, 2009 at 1:15 PM, Happy-melon happy-melon@live.com wrote:
"Aryeh Gregor" Simetrical+wikilist@gmail.com wrote in message news:7c2a12e20911100759s1ba211b0k6ef6cb076449be37@mail.gmail.com...
On Tue, Nov 10, 2009 at 7:52 AM, Jeff Kubina jeff.kubina@gmail.com wrote:
- Is it possible for just one particular revision of an article to be
deleted, maybe due to a copyright violation? If so, is just the content of the revision deleted or would this include all the data associated with it, so that the revision would not even appear in the stub-meta-history.xml file?
Yes, an individual revision can be deleted. There are at least three different ways to do this, last I checked. I would expect that the old ways (oversight, and delete+selective undelete) would leave no traces at all in the dump, while the new way (rev_deleted) might only suppress certain fields. I'm not sure offhand, though.
IIRC, any revision that has any of the rev_deleted bitfields set will be excluded from dumps. Don't quote me on that....
I'm not sure what the criteria actually are, but I recall encountering a dump entry where the editor's name had been suppressed (missing in the revision) but where the revision text itself was present. (I had an analysis script choke on this, since up to that time I had assumed every revision would have valid contributor information attached to it.)
-Robert Rohde
--- El mié, 11/11/09, Robert Rohde rarohde@gmail.com escribió:
I'm not sure what the criteria actually are, but I recall encountering a dump entry where the editor's name had been suppressed (missing in the revision) but where the revision text itself was present. (I had an analysis script choke on this, since up to that time I had assumed every revision would have valid contributor information attached to it.)
Yes, actually that case forced updates on some parsers like mine, since they weren't supposed to expect empty fields on revisions (and specially the rev_user field).
F --
-Robert Rohde
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Happy-melon wrote:
Yes, an individual revision can be deleted. There are at least three different ways to do this, last I checked. I would expect that the old ways (oversight, and delete+selective undelete) would leave no traces at all in the dump, while the new way (rev_deleted) might only suppress certain fields. I'm not sure offhand, though.
IIRC, any revision that has any of the rev_deleted bitfields set will be excluded from dumps. Don't quote me on that....
--HM
They will appear with a deleted="deleted" attribute, so the content of the suppressed fields isn't available, but that of the other fields is.
wikitech-l@lists.wikimedia.org