Matching main namespace articles with associated talk page

List overview All Threads
Download

newer

older

Big problem to solve: good WYSIWYG...

testing FlaggedRevs

Diederik van Liere

8 Jan 2011 8 Jan '11

10:29 a.m.

Dear dev's,

I am wondering whether the Mediawiki db contains a foreignkey relationship between a main namespace article and the associated talk page (if present). Having this information would greatly simplify analytic projects to monitor editor behaviour and understanding revert behaviour (among other topics).

Currently, I am manually matching these two sets of pages by matching titles. I have two questions: 1) If this foreignkey does not exist, would it be worthwhile to create it? 2) If this foreignkey does exist, what would it take to expose this in the XML dumps?

Best regards,

Diederik

Show replies by date

John

8 Jan 8 Jan

10:32 a.m.

New subject: Matching main namespace articles with associated talk page

its just a matter of matching page titles, if there is a page in namespace 0 and a page in namespace (article and article talk) with the same title they go together. its fairly simple

John

On Sat, Jan 8, 2011 at 11:29 AM, Diederik van Liere dvanliere@gmail.comwrote:

...

Dear dev's,

I am wondering whether the Mediawiki db contains a foreignkey relationship between a main namespace article and the associated talk page (if present). Having this information would greatly simplify analytic projects to monitor editor behaviour and understanding revert behaviour (among other topics).

Currently, I am manually matching these two sets of pages by matching titles. I have two questions:

If this foreignkey does not exist, would it be worthwhile to create it?

If this foreignkey does exist, what would it take to expose this in

the XML dumps?

Best regards,

Diederik

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Bryan Tong Minh

10:39 a.m.

New subject: Matching main namespace articles with associated talk page

On Sat, Jan 8, 2011 at 5:32 PM, John phoenixoverride@gmail.com wrote:

...

its just a matter of matching page titles, if there is a page in namespace 0 and a page in namespace (article and article talk) with the same title they go together. its fairly simple

To expand John's comment, the talk page is always the page with the same title, but with a namespace number 1 higher.

Bryan

Diederik van Liere

11:34 a.m.

New subject: Matching main namespace articles with associated talk page

Yes, manually matching is fairly simple but in the worst case you need to iterate over n-1 talk pages (where n is the total number of talk pages of a Wikipedia) to find the talk page that belongs to a user page when using the dump files. Hence, if the dump file would contain for each article a tag with talk page id then it would significantly reduce the processing time. Diederik

On Sat, Jan 8, 2011 at 11:39 AM, Bryan Tong Minh bryan.tongminh@gmail.com wrote:

...

On Sat, Jan 8, 2011 at 5:32 PM, John phoenixoverride@gmail.com wrote:

...
its just a matter of matching page titles, if there is a page in namespace 0 and a page in namespace (article and article talk) with the same title they go together. its fairly simple

To expand John's comment, the talk page is always the page with the same title, but with a namespace number 1 higher.

Bryan

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- <a href="http://about.me/diederik">Check out my about.me profile!</a>

Conrad Irwin

12:52 p.m.

New subject: Matching main namespace articles with associated talk page

The talk page's ID may change over time (as may the page ID causing the talk page to "change" association) when pages are deleted and undeleted or moved. You should use the (namespace, page title) pair as persistent unique identifiers for associating talk pages and content pages. It's pretty easy to add an index on those columns allowing you (at the cost of a bit of storage space) to look-up pages in log(n) time.

Conrad

...

pages of a Wikipedia) to find the talk page that belongs to a user page when using the dump files. Hence, if the dump file would contain for each article a tag with talk page id then it would significantly reduce the processing time. Diederik

On Sat, Jan 8, 2011 at 11:39 AM, Bryan Tong Minh bryan.tongminh@gmail.com wrote:

...
On Sat, Jan 8, 2011 at 5:32 PM, John phoenixoverride@gmail.com wrote:

...
its just a matter of matching page titles, if there is a page in namespace 0 and a page in namespace (article and article talk) with the same title they go together. its fairly simple

To expand John's comment, the talk page is always the page with the same title, but with a namespace number 1 higher.

Bryan

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- <a href="http://about.me/diederik">Check out my about.me profile!</a>

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Aryeh Gregor

6:05 p.m.

New subject: Matching main namespace articles with associated talk page

On Sat, Jan 8, 2011 at 12:34 PM, Diederik van Liere dvanliere@gmail.com wrote:

...

Yes, manually matching is fairly simple but in the worst case you need to iterate over n-1 talk pages (where n is the total number of talk pages of a Wikipedia) to find the talk page that belongs to a user page when using the dump files. Hence, if the dump file would contain for each article a tag with talk page id then it would significantly reduce the processing time.

You're expected to build indexes for things like this. If you import the data into MySQL, for instance, you can just do a join (since MediaWiki has good indexes by default). If you're writing data analysis code manually for some reason, load the data into an on-disk B-tree, and then your worst case is logarithmic. Without indexes, pretty much any operation on the data is going to take linear time. (In fact, so is lookup by page id, unless you're just doing a binary search on the dump file and assuming it's in id order . . .)

If you don't want to set up a database yourself, you might want to look into getting a toolserver account, if you don't have one. This would allow you read access to a live replica of Wikipedia's database, which of course has all these indexes.

Andrew Dunbar

9 Jan 9 Jan

7:34 a.m.

New subject: Matching main namespace articles with associated talk page

On 9 January 2011 02:05, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:

...

On Sat, Jan 8, 2011 at 12:34 PM, Diederik van Liere dvanliere@gmail.com wrote:

...
Yes, manually matching is fairly simple but in the worst case you need to iterate over n-1 talk pages (where n is the total number of talk pages of a Wikipedia) to find the talk page that belongs to a user page when using the dump files. Hence, if the dump file would contain for each article a tag with talk page id then it would significantly reduce the processing time.

You're expected to build indexes for things like this. If you import the data into MySQL, for instance, you can just do a join (since MediaWiki has good indexes by default). If you're writing data analysis code manually for some reason, load the data into an on-disk B-tree, and then your worst case is logarithmic. Without indexes, pretty much any operation on the data is going to take linear time. (In fact, so is lookup by page id, unless you're just doing a binary search on the dump file and assuming it's in id order . . .)

If you don't want to set up a database yourself, you might want to look into getting a toolserver account, if you don't have one. This would allow you read access to a live replica of Wikipedia's database, which of course has all these indexes.

You don't even have to use a B-Tree if that's beyond you. I just sort the titles and then use a binary search on them. Plenty fast even in Perl and Javascript.

Andrew Dunbar (hippietrail)

...

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Ashar Voultoiz

8 Jan 8 Jan

3:59 p.m.

New subject: Matching main namespace articles with associated talk page

On 08/01/11 17:29, Diederik van Liere wrote:

...

I am wondering whether the Mediawiki db contains a foreignkey relationship between a main namespace article and the associated talk page (if present).

We do not have any foreign key in the database schema. Constraints are handled at the application level (read: Mediawiki).

...

Having this information would greatly simplify analytic projects to monitor editor behaviour and understanding revert behaviour (among other topics).

Currently, I am manually matching these two sets of pages by matching titles.

That is how you have to do it. Finding a title with the same name but in the associated talk namespace. Core namespace are given a number between 0 and 99, by convention odd ones are the talk pages.

...

I have two questions:

If this foreignkey does not exist, would it be worthwhile to create it?

I do not think we want to add database constraints. It is probably a good thing but I am almost sure it will break the software in a lot of different and "interesting" ways.

...

If this foreignkey does exist, what would it take to expose this in

the XML dumps?

Maybe we can change the XML format. Adding a new field indicating there is a talk page for the given page should be trivial.

-- Ashar Voultoiz

4942

Age (days ago)

4943

Last active (days ago)

wikitech-l@lists.wikimedia.org

7 comments

7 participants

tags (0)

participants (7)

Andrew Dunbar
Aryeh Gregor
Ashar Voultoiz
Bryan Tong Minh
Conrad Irwin
Diederik van Liere
John