[repost, sorry if it ends duplicated]
Hi
it seems to me that there are some inconsistencies between at least the
page and revision tables, in the 20060303 enwiki dump.
The first problematic page would be page_id 12, Anarchism (sorry for the
raw mysql formatting):
| page_id | page_namespace | page_title | page_restrictions |
page_counter | page_is_redirect | page_is_new | page_random |
page_touched | page_latest | page_len |
+---------+----------------+------------+-------------------+-----------
---+------------------+-------------+-------------------+---------------
-+-------------+----------+
| 12 | 0 | Anarchism | |
5252 | 0 | 0 | 0.786172332974311 |
20060303031540 | 41982999 | 67537 |
which indicates a revision # 41982999.
But there is no line with rev_id=41982999 in the revision table.
(these can be verified grepping for 41982999 directly in
enwiki-20060303-pages-articles.xml.bz2 and in
enwiki-20060303-page.sql.gz)
Now:
- am I missing something here ?
- it might be that the revision has changed between the dumps of those 2
tables (page has been edited)
- it ends in empty pages (i.e. with the usual stub text), for ~ 5% of
the pages (that seems huge, but I don't see where the problem lies)
- is it a temporary problem (I don't recall getting so many empty
articles with earlier dumps) ?
- is there a simple way to fix it ? (if no better idea emerges, I will
try to fix the page_latest column in the page table by doing a lookup on
rev_page in the revision table - is it right ?)
Thanks
--
Colin