Re: [Wikitech-l] Matching main namespace articles with associated talk page

9 Jan 2011


      On 9 January 2011 02:05, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
...
On Sat, Jan 8, 2011 at 12:34 PM, Diederik van Liere dvanliere@gmail.com wrote:
...
Yes, manually matching is fairly simple but in the worst case you need
to iterate over n-1 talk pages (where n is the total number of talk
pages of a Wikipedia) to find the talk page that belongs to a user
page when using the dump files. Hence, if the dump file would contain
for each article a tag with talk page id then it would significantly
reduce the processing time.
You're expected to build indexes for things like this.  If you import
the data into MySQL, for instance, you can just do a join (since
MediaWiki has good indexes by default).  If you're writing data
analysis code manually for some reason, load the data into an on-disk
B-tree, and then your worst case is logarithmic.  Without indexes,
pretty much any operation on the data is going to take linear time.
(In fact, so is lookup by page id, unless you're just doing a binary
search on the dump file and assuming it's in id order . . .)
If you don't want to set up a database yourself, you might want to
look into getting a toolserver account, if you don't have one.  This
would allow you read access to a live replica of Wikipedia's database,
which of course has all these indexes.
You don't even have to use a B-Tree if that's beyond you. I just sort
the titles and then use a binary search on them. Plenty fast even in
Perl and Javascript.
Andrew Dunbar (hippietrail)
...

Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Matching main namespace articles with associated talk page