Dear dev's,
I am wondering whether the Mediawiki db contains a foreignkey relationship between a main namespace article and the associated talk page (if present). Having this information would greatly simplify analytic projects to monitor editor behaviour and understanding revert behaviour (among other topics).
Currently, I am manually matching these two sets of pages by matching titles. I have two questions: 1) If this foreignkey does not exist, would it be worthwhile to create it? 2) If this foreignkey does exist, what would it take to expose this in the XML dumps?
Best regards,
Diederik
its just a matter of matching page titles, if there is a page in namespace 0 and a page in namespace (article and article talk) with the same title they go together. its fairly simple
John
On Sat, Jan 8, 2011 at 11:29 AM, Diederik van Liere dvanliere@gmail.comwrote:
Dear dev's,
I am wondering whether the Mediawiki db contains a foreignkey relationship between a main namespace article and the associated talk page (if present). Having this information would greatly simplify analytic projects to monitor editor behaviour and understanding revert behaviour (among other topics).
Currently, I am manually matching these two sets of pages by matching titles. I have two questions:
- If this foreignkey does not exist, would it be worthwhile to create it?
- If this foreignkey does exist, what would it take to expose this in
the XML dumps?
Best regards,
Diederik
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Sat, Jan 8, 2011 at 5:32 PM, John phoenixoverride@gmail.com wrote:
its just a matter of matching page titles, if there is a page in namespace 0 and a page in namespace (article and article talk) with the same title they go together. its fairly simple
To expand John's comment, the talk page is always the page with the same title, but with a namespace number 1 higher.
Bryan
Yes, manually matching is fairly simple but in the worst case you need to iterate over n-1 talk pages (where n is the total number of talk pages of a Wikipedia) to find the talk page that belongs to a user page when using the dump files. Hence, if the dump file would contain for each article a tag with talk page id then it would significantly reduce the processing time. Diederik
On Sat, Jan 8, 2011 at 11:39 AM, Bryan Tong Minh bryan.tongminh@gmail.com wrote:
On Sat, Jan 8, 2011 at 5:32 PM, John phoenixoverride@gmail.com wrote:
its just a matter of matching page titles, if there is a page in namespace 0 and a page in namespace (article and article talk) with the same title they go together. its fairly simple
To expand John's comment, the talk page is always the page with the same title, but with a namespace number 1 higher.
Bryan
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
The talk page's ID may change over time (as may the page ID causing the talk page to "change" association) when pages are deleted and undeleted or moved. You should use the (namespace, page title) pair as persistent unique identifiers for associating talk pages and content pages. It's pretty easy to add an index on those columns allowing you (at the cost of a bit of storage space) to look-up pages in log(n) time.
Conrad
pages of a Wikipedia) to find the talk page that belongs to a user page when using the dump files. Hence, if the dump file would contain for each article a tag with talk page id then it would significantly reduce the processing time. Diederik
On Sat, Jan 8, 2011 at 11:39 AM, Bryan Tong Minh bryan.tongminh@gmail.com wrote:
On Sat, Jan 8, 2011 at 5:32 PM, John phoenixoverride@gmail.com wrote:
its just a matter of matching page titles, if there is a page in namespace 0 and a page in namespace (article and article talk) with the same title they go together. its fairly simple
To expand John's comment, the talk page is always the page with the same title, but with a namespace number 1 higher.
Bryan
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- <a href="http://about.me/diederik">Check out my about.me profile!</a>
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Sat, Jan 8, 2011 at 12:34 PM, Diederik van Liere dvanliere@gmail.com wrote:
Yes, manually matching is fairly simple but in the worst case you need to iterate over n-1 talk pages (where n is the total number of talk pages of a Wikipedia) to find the talk page that belongs to a user page when using the dump files. Hence, if the dump file would contain for each article a tag with talk page id then it would significantly reduce the processing time.
You're expected to build indexes for things like this. If you import the data into MySQL, for instance, you can just do a join (since MediaWiki has good indexes by default). If you're writing data analysis code manually for some reason, load the data into an on-disk B-tree, and then your worst case is logarithmic. Without indexes, pretty much any operation on the data is going to take linear time. (In fact, so is lookup by page id, unless you're just doing a binary search on the dump file and assuming it's in id order . . .)
If you don't want to set up a database yourself, you might want to look into getting a toolserver account, if you don't have one. This would allow you read access to a live replica of Wikipedia's database, which of course has all these indexes.
On 9 January 2011 02:05, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
On Sat, Jan 8, 2011 at 12:34 PM, Diederik van Liere dvanliere@gmail.com wrote:
Yes, manually matching is fairly simple but in the worst case you need to iterate over n-1 talk pages (where n is the total number of talk pages of a Wikipedia) to find the talk page that belongs to a user page when using the dump files. Hence, if the dump file would contain for each article a tag with talk page id then it would significantly reduce the processing time.
You're expected to build indexes for things like this. If you import the data into MySQL, for instance, you can just do a join (since MediaWiki has good indexes by default). If you're writing data analysis code manually for some reason, load the data into an on-disk B-tree, and then your worst case is logarithmic. Without indexes, pretty much any operation on the data is going to take linear time. (In fact, so is lookup by page id, unless you're just doing a binary search on the dump file and assuming it's in id order . . .)
If you don't want to set up a database yourself, you might want to look into getting a toolserver account, if you don't have one. This would allow you read access to a live replica of Wikipedia's database, which of course has all these indexes.
You don't even have to use a B-Tree if that's beyond you. I just sort the titles and then use a binary search on them. Plenty fast even in Perl and Javascript.
Andrew Dunbar (hippietrail)
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 08/01/11 17:29, Diederik van Liere wrote:
I am wondering whether the Mediawiki db contains a foreignkey relationship between a main namespace article and the associated talk page (if present).
We do not have any foreign key in the database schema. Constraints are handled at the application level (read: Mediawiki).
Having this information would greatly simplify analytic projects to monitor editor behaviour and understanding revert behaviour (among other topics).
Currently, I am manually matching these two sets of pages by matching titles.
That is how you have to do it. Finding a title with the same name but in the associated talk namespace. Core namespace are given a number between 0 and 99, by convention odd ones are the talk pages.
I have two questions:
- If this foreignkey does not exist, would it be worthwhile to create it?
I do not think we want to add database constraints. It is probably a good thing but I am almost sure it will break the software in a lot of different and "interesting" ways.
- If this foreignkey does exist, what would it take to expose this in
the XML dumps?
Maybe we can change the XML format. Adding a new field indicating there is a talk page for the given page should be trivial.
wikitech-l@lists.wikimedia.org