Hi,
I am working on a MW instance where we batch load a ton of data from
an external database. We have about 50,000 records, each of which
becomes a MW page. So far so good (we use pywikipedia to load the
data). Now it comes to the issue of consistency...
A) The external database may change independently of the MW.
B) The MW may change independently of the external database.
Ignoring problems related to B (for the moment), we can propose the
following solution to problem A;
1) As each 'record page' is loaded into the MW (for the first time),
it is put into a special category ('Category:External DB record page'
for example).
2) We can list all 'record pages' using a direct query over the MW
database (or using pywikipedia?)
3) We can delete all the pages in our list from the MW.
4) We can load all the pages again from the external database.
The above is clearly inefficient. Even if we only run an 'update' job
every month - its a lot of needless work. So, I want to propose the
following;
1) After loading as before, we update 'cl_timestamp' in the
categorylinks table to the value of the 'last modified' field of the
data in the external database.
2) We can list all 'record pages' using a direct query over the MW
database, including the 'last modified' timestamp (which is now
'cl_timestamp').
3) We can synchronize pages that have changed based on newer 'last
modified' timestamps in the external database (and additionally delete
/ add entries as necessary).
The latter method has the advantage of only requiring incremental
updates, which will be much faster in our case (slow data turnover).
Question 1) Will tinkering with 'cl_timestamp' have any adverse
effects? The effect on DPL 'order=categoryadd' is actually desired in
our case.
Question 2) Any way to do this without direct database access? For
example, could we write a template to 'update categorized date' given
a page name and a category name? Does pywikipedia have functions to
interact with special pages? Can you point me at a similar template?
I don't want to put the 'last modified' field in the page, because it
would involve downloading and parsing all the content. Rather I would
like to query one field of the DB. Using 'cl_timestamp' in this case
is just a hack that seems to fit. Of course I could just make a new
table in the database to store this data, or keep an accurate list of
the state of the MW locally. However, none of these solutions seems
quite satisfactory.
Any feedback on the above would be most welcome.
All the best,
Dan.