On Sat, Jan 8, 2011 at 12:34 PM, Diederik van Liere dvanliere@gmail.com wrote:
Yes, manually matching is fairly simple but in the worst case you need to iterate over n-1 talk pages (where n is the total number of talk pages of a Wikipedia) to find the talk page that belongs to a user page when using the dump files. Hence, if the dump file would contain for each article a tag with talk page id then it would significantly reduce the processing time.
You're expected to build indexes for things like this. If you import the data into MySQL, for instance, you can just do a join (since MediaWiki has good indexes by default). If you're writing data analysis code manually for some reason, load the data into an on-disk B-tree, and then your worst case is logarithmic. Without indexes, pretty much any operation on the data is going to take linear time. (In fact, so is lookup by page id, unless you're just doing a binary search on the dump file and assuming it's in id order . . .)
If you don't want to set up a database yourself, you might want to look into getting a toolserver account, if you don't have one. This would allow you read access to a live replica of Wikipedia's database, which of course has all these indexes.