On Sat, Jan 8, 2011 at 12:34 PM, Diederik van Liere <dvanliere(a)gmail.com> wrote:
Yes, manually matching is fairly simple but in the
worst case you need
to iterate over n-1 talk pages (where n is the total number of talk
pages of a Wikipedia) to find the talk page that belongs to a user
page when using the dump files. Hence, if the dump file would contain
for each article a tag with talk page id then it would significantly
reduce the processing time.
You're expected to build indexes for things like this. If you import
the data into MySQL, for instance, you can just do a join (since
MediaWiki has good indexes by default). If you're writing data
analysis code manually for some reason, load the data into an on-disk
B-tree, and then your worst case is logarithmic. Without indexes,
pretty much any operation on the data is going to take linear time.
(In fact, so is lookup by page id, unless you're just doing a binary
search on the dump file and assuming it's in id order . . .)
If you don't want to set up a database yourself, you might want to
look into getting a toolserver account, if you don't have one. This
would allow you read access to a live replica of Wikipedia's database,
which of course has all these indexes.