Hi,
there seems to be a problem with the dumps or with the way in which we
interpret the dumps.
Currently, daily dumps come with a maxrevid.txt file that is supposed to
give the largest revision number in the dump. For example, daily dump of
1st Aug 2013 has max id 62860640 [1]. I guess that's true.
The wda scripts (that also create the statistics and digested dumps we
publish) use this number to figure out if a daily is still relevant or
if it is already contained in the latest full dump. For this, we need to
get the maximal revision id of the full dump. We do this by reading the
file site_stats.sql.gz, where we look for the line starting with INSERT
INTO `site_stats` and take a revision number from there (third number in
the insert tuple). For example, for the dump of 27 July 2013, this
number is 63069374 [2].
There is a problem here, since the maximal revision in the dumps of 27
July 2013 is not actually that high (the history dump of that date is
incomplete, but the current revs dump is done and has max rev 61983867
[3]). Thus, our scripts ignore several days of dailies.
Before I go and work on this, my question is whether this is an error in
our script (i.e., the number we take from sitestats is not supposed to
be the max revision) or an error in the dumps (i.e., sitestats was
exported wrongly).
Cheers,
Markus
[1]
http://dumps.wikimedia.org/other/incr/wikidatawiki/20130801/maxrevid.txt
[2]
http://dumps.wikimedia.org/wikidatawiki/20130727/wikidatawiki-20130727-site…
[3] This can be seen in the comments for the dump at
http://dumps.wikimedia.org/wikidatawiki/20130727/
--
Markus Kroetzsch, Departmental Lecturer
Department of Computer Science, University of Oxford
Room 306, Parks Road, OX1 3QD Oxford, United Kingdom
+44 (0)1865 283529
http://korrekt.org/