On 9 January 2011 02:05, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
On Sat, Jan 8, 2011 at 12:34 PM, Diederik van Liere dvanliere@gmail.com wrote:
Yes, manually matching is fairly simple but in the worst case you need to iterate over n-1 talk pages (where n is the total number of talk pages of a Wikipedia) to find the talk page that belongs to a user page when using the dump files. Hence, if the dump file would contain for each article a tag with talk page id then it would significantly reduce the processing time.
You're expected to build indexes for things like this. If you import the data into MySQL, for instance, you can just do a join (since MediaWiki has good indexes by default). If you're writing data analysis code manually for some reason, load the data into an on-disk B-tree, and then your worst case is logarithmic. Without indexes, pretty much any operation on the data is going to take linear time. (In fact, so is lookup by page id, unless you're just doing a binary search on the dump file and assuming it's in id order . . .)
If you don't want to set up a database yourself, you might want to look into getting a toolserver account, if you don't have one. This would allow you read access to a live replica of Wikipedia's database, which of course has all these indexes.
You don't even have to use a B-Tree if that's beyond you. I just sort the titles and then use a binary search on them. Plenty fast even in Perl and Javascript.
Andrew Dunbar (hippietrail)
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l