Hi,
I am working on a MW instance where we batch load a ton of data from an external database. We have about 50,000 records, each of which becomes a MW page. So far so good (we use pywikipedia to load the data). Now it comes to the issue of consistency...
A) The external database may change independently of the MW. B) The MW may change independently of the external database.
Ignoring problems related to B (for the moment), we can propose the following solution to problem A;
1) As each 'record page' is loaded into the MW (for the first time), it is put into a special category ('Category:External DB record page' for example). 2) We can list all 'record pages' using a direct query over the MW database (or using pywikipedia?) 3) We can delete all the pages in our list from the MW. 4) We can load all the pages again from the external database.
The above is clearly inefficient. Even if we only run an 'update' job every month - its a lot of needless work. So, I want to propose the following;
1) After loading as before, we update 'cl_timestamp' in the categorylinks table to the value of the 'last modified' field of the data in the external database. 2) We can list all 'record pages' using a direct query over the MW database, including the 'last modified' timestamp (which is now 'cl_timestamp'). 3) We can synchronize pages that have changed based on newer 'last modified' timestamps in the external database (and additionally delete / add entries as necessary).
The latter method has the advantage of only requiring incremental updates, which will be much faster in our case (slow data turnover).
Question 1) Will tinkering with 'cl_timestamp' have any adverse effects? The effect on DPL 'order=categoryadd' is actually desired in our case.
Question 2) Any way to do this without direct database access? For example, could we write a template to 'update categorized date' given a page name and a category name? Does pywikipedia have functions to interact with special pages? Can you point me at a similar template?
I don't want to put the 'last modified' field in the page, because it would involve downloading and parsing all the content. Rather I would like to query one field of the DB. Using 'cl_timestamp' in this case is just a hack that seems to fit. Of course I could just make a new table in the database to store this data, or keep an accurate list of the state of the MW locally. However, none of these solutions seems quite satisfactory.
Any feedback on the above would be most welcome.
All the best,
Dan.
Should I create a 'version page' for every 'record page' to store this data?
On 14/09/2007, Dan Bolser dan.bolser@gmail.com wrote:
Hi,
I am working on a MW instance where we batch load a ton of data from an external database. We have about 50,000 records, each of which becomes a MW page. So far so good (we use pywikipedia to load the data). Now it comes to the issue of consistency...
A) The external database may change independently of the MW. B) The MW may change independently of the external database.
Ignoring problems related to B (for the moment), we can propose the following solution to problem A;
- As each 'record page' is loaded into the MW (for the first time),
it is put into a special category ('Category:External DB record page' for example). 2) We can list all 'record pages' using a direct query over the MW database (or using pywikipedia?) 3) We can delete all the pages in our list from the MW. 4) We can load all the pages again from the external database.
The above is clearly inefficient. Even if we only run an 'update' job every month - its a lot of needless work. So, I want to propose the following;
- After loading as before, we update 'cl_timestamp' in the
categorylinks table to the value of the 'last modified' field of the data in the external database. 2) We can list all 'record pages' using a direct query over the MW database, including the 'last modified' timestamp (which is now 'cl_timestamp'). 3) We can synchronize pages that have changed based on newer 'last modified' timestamps in the external database (and additionally delete / add entries as necessary).
The latter method has the advantage of only requiring incremental updates, which will be much faster in our case (slow data turnover).
Question 1) Will tinkering with 'cl_timestamp' have any adverse effects? The effect on DPL 'order=categoryadd' is actually desired in our case.
Question 2) Any way to do this without direct database access? For example, could we write a template to 'update categorized date' given a page name and a category name? Does pywikipedia have functions to interact with special pages? Can you point me at a similar template?
I don't want to put the 'last modified' field in the page, because it would involve downloading and parsing all the content. Rather I would like to query one field of the DB. Using 'cl_timestamp' in this case is just a hack that seems to fit. Of course I could just make a new table in the database to store this data, or keep an accurate list of the state of the MW locally. However, none of these solutions seems quite satisfactory.
Any feedback on the above would be most welcome.
All the best,
Dan.
I would. I recently did a batch import from an external database (from another wiki). Not using piwikipedia, but directly inserting into the database. The database schema information
http://www.mediawiki.org/wiki/Manual:Database_layout
was useful.
For a basic import, inserts into just three tables were needed: * page: page title etc. * revision: metadata about the version * text (pagecontent with postgresql): content of the version
Search didn't immediately work, but did after a while (still not quite sure if you need to edit each page once for it to be indexed, not tested but it seems a bit better than that).
Oliver
On 16 Sep 2007, at 10:40, Dan Bolser wrote:
Should I create a 'version page' for every 'record page' to store this data?
On 14/09/2007, Dan Bolser dan.bolser@gmail.com wrote:
Hi,
I am working on a MW instance where we batch load a ton of data from an external database. We have about 50,000 records, each of which becomes a MW page. So far so good (we use pywikipedia to load the data). Now it comes to the issue of consistency...
mediawiki-l@lists.wikimedia.org