[repost, sorry if it ends duplicated]
Hi
it seems to me that there are some inconsistencies between at least the page and revision tables, in the 20060303 enwiki dump.
The first problematic page would be page_id 12, Anarchism (sorry for the raw mysql formatting): | page_id | page_namespace | page_title | page_restrictions | page_counter | page_is_redirect | page_is_new | page_random | page_touched | page_latest | page_len | +---------+----------------+------------+-------------------+----------- ---+------------------+-------------+-------------------+--------------- -+-------------+----------+ | 12 | 0 | Anarchism | | 5252 | 0 | 0 | 0.786172332974311 | 20060303031540 | 41982999 | 67537 |
which indicates a revision # 41982999.
But there is no line with rev_id=41982999 in the revision table.
(these can be verified grepping for 41982999 directly in enwiki-20060303-pages-articles.xml.bz2 and in enwiki-20060303-page.sql.gz)
Now: - am I missing something here ? - it might be that the revision has changed between the dumps of those 2 tables (page has been edited) - it ends in empty pages (i.e. with the usual stub text), for ~ 5% of the pages (that seems huge, but I don't see where the problem lies) - is it a temporary problem (I don't recall getting so many empty articles with earlier dumps) ? - is there a simple way to fix it ? (if no better idea emerges, I will try to fix the page_latest column in the page table by doing a lookup on rev_page in the revision table - is it right ?)
Thanks
wikitech-l@lists.wikimedia.org