I've been wondering about some kind of parser to add to the python wikipedia project such that it knows how to handle transwiki links as well as trans-interwiki links. This email contains my thoughts on the matter, please feel free to correct me, add to it, or elaborate anything further. I felt it necessary to send this email, so that we could possibly all agree on several points, and that will help me or someone else pin down the problem properly.
----
I've come to realize some assumptions: * Each wikilink will not typically exceed the usage of 2-3 colons, although anything above that is usually redundant.
Example: [[q:fr:w:en:Test]] on the English Wikipedia does process properly, and following such a link will take you to the English Wikiquote then to the French Wikiquote then back to the English Wikipedia.
Problem: Is there a simple and easy way to test for a namespace?
Discussion: Not easily... At least, I think so.
* The last part of the wikilink is ALWAYS the article title.
----
Most language codes are two characters long, but there are some exceptions: als, ang, arc, ast, cho, chr, chy, csb, fur, haw, jbo, mus, nah, nds, roa-rup, simple, tlh, tokipona, tpi, tpi, tum, zh-min-nan, zh-cn, zh-tw, minnan, & zh-cfr
I noticed in the family file there is another one: bug There maybe more on that list that I'm unaware of... It's likely to safe to assume the same language families of the other projects, even those may not exist yet. If they don't, they will likely exist in the future.
----
Okay, here is a list of cases, based on colon count:
1 colon: 1) Article namespace with no leading character in front 2) Interwiki link 3) Namespace preceeding the colon
2 colons: 1a) Interwiki link + Namespace (untranslated in English) 1b) Interwiki link + Namespace (translated in the Interwiki link's language) 2) Transwiki link + Interwiki link 3) Transwiki link + Namespace (may have to consider about different names for the "Project" namespace)
3 colons: 1) Transwiki link + Interwiki link + Namespace (translated/untranslated) 2) Interwiki link + Transwiki link + Interwiki link (stupid, but possible) 3) Interwiki link + Transwiki link + Namespace link (stupid, but possible) 4) Transwiki link + Interwiki link + Interwiki link (stupid, but possible) 5) Transwiki link X3 (stupid, but possible) 6) Transwiki link + Transwiki link + Namespace link (stupid, but possible)
4+ colons: Any combination above
Possible solution(s): * Create a function to specifically to determine transwiki links * Create a function to specifically to determine interwiki links based on transwiki link information * Create a function to specifically to determine namespace links based on transwiki and interwiki link information * Develop a class that uses the information from: http://meta.wikimedia.org/wiki/Interwiki_map * Develop a class for conversion only for the current available families -- ignore the rest
----
If we split anything between '[[' and ']]' using the ':' as the separator, we know the following to be true:
If the list is size of 1, then it has no interwiki links, no category links, and no transwiki links. We also know that [0] is the name of the article.
No matter what the situation of the split, index of -1 will always point to the name of the article.
Now, the matter is: In what order should we proceed?
Should we scan forwards or backwards?
In what order should we look for links? 1) Transwiki, interwiki, namespace 2) namespace, interwiki, transwiki 3) interwiki, namespace, transwiki etc.
----
One thing is for certain, the regular expression regarding this will be extensively long. If we do manage to resolve this, then our parser for wikilinks should be able to handle anything we throw at it, and would make any related bugs regarding linkedPages(), and getRedirectPage() easier to fix. One thing I have a problem with is that getRedirectPage() returns a string object, rather than a Page object. But it is obvious that it should return a string object, because it could have any of the number of situations I've described above.
The principle reason behind why I'm concerned over this matter is that I'm in the process of developing a Notification bot. Unfortunately, I've run into several user pages who have, in their wisdom, decided to redirect their pages to either a different project or a different language, and sometimes it is a combination of both. So I've been thinking of a way to properly parse the information from getRedirectPage() such that I can pass the correct parameters to the Site class.
Thoughts, anyone?
As I did some statistical analysations of dump files, I can provide you a small update which may be valuable implementing title parsing. It is not enough to "split anything between '[[' and ']]' using the ':' " - first of all you need to split using '|' and remove any information after a first '|'. After retrieving an article title, you must check if it does not contain # as section name may appear after a title ('#' may also be used as part of unicode character - like 'リ')
Hope this will help you in any way, Andrius
On 11/25/05, Jason Y. Lee jylee@cs.ucr.edu wrote:
I've been wondering about some kind of parser to add to the python wikipedia project such that it knows how to handle transwiki links as well as trans-interwiki links. This email contains my thoughts on the matter, please feel free to correct me, add to it, or elaborate anything further. I felt it necessary to send this email, so that we could possibly all agree on several points, and that will help me or someone else pin down the problem properly.
I've come to realize some assumptions:
- Each wikilink will not typically exceed the usage of 2-3 colons, although
anything above that is usually redundant.
Example: [[q:fr:w:en:Test]] on the English Wikipedia does process properly, and following such a link will take you to the English Wikiquote then to the French Wikiquote then back to the English Wikipedia.
Problem: Is there a simple and easy way to test for a namespace?
Discussion: Not easily... At least, I think so.
- The last part of the wikilink is ALWAYS the article title.
Most language codes are two characters long, but there are some exceptions: als, ang, arc, ast, cho, chr, chy, csb, fur, haw, jbo, mus, nah, nds, roa-rup, simple, tlh, tokipona, tpi, tpi, tum, zh-min-nan, zh-cn, zh-tw, minnan, & zh-cfr
I noticed in the family file there is another one: bug There maybe more on that list that I'm unaware of... It's likely to safe to assume the same language families of the other projects, even those may not exist yet. If they don't, they will likely exist in the future.
Okay, here is a list of cases, based on colon count:
1 colon:
- Article namespace with no leading character in front
- Interwiki link
- Namespace preceeding the colon
2 colons: 1a) Interwiki link + Namespace (untranslated in English) 1b) Interwiki link + Namespace (translated in the Interwiki link's language) 2) Transwiki link + Interwiki link 3) Transwiki link + Namespace (may have to consider about different names for the "Project" namespace)
3 colons:
- Transwiki link + Interwiki link + Namespace (translated/untranslated)
- Interwiki link + Transwiki link + Interwiki link (stupid, but possible)
- Interwiki link + Transwiki link + Namespace link (stupid, but possible)
- Transwiki link + Interwiki link + Interwiki link (stupid, but possible)
- Transwiki link X3 (stupid, but possible)
- Transwiki link + Transwiki link + Namespace link (stupid, but possible)
4+ colons: Any combination above
Possible solution(s):
- Create a function to specifically to determine transwiki links
- Create a function to specifically to determine interwiki links based on transwiki link information
- Create a function to specifically to determine namespace links based on transwiki and interwiki link information
- Develop a class that uses the information from: http://meta.wikimedia.org/wiki/Interwiki_map
- Develop a class for conversion only for the current available families -- ignore the rest
If we split anything between '[[' and ']]' using the ':' as the separator, we know the following to be true:
If the list is size of 1, then it has no interwiki links, no category links, and no transwiki links. We also know that [0] is the name of the article.
No matter what the situation of the split, index of -1 will always point to the name of the article.
Now, the matter is: In what order should we proceed?
Should we scan forwards or backwards?
In what order should we look for links?
- Transwiki, interwiki, namespace
- namespace, interwiki, transwiki
- interwiki, namespace, transwiki
etc.
One thing is for certain, the regular expression regarding this will be extensively long. If we do manage to resolve this, then our parser for wikilinks should be able to handle anything we throw at it, and would make any related bugs regarding linkedPages(), and getRedirectPage() easier to fix. One thing I have a problem with is that getRedirectPage() returns a string object, rather than a Page object. But it is obvious that it should return a string object, because it could have any of the number of situations I've described above.
The principle reason behind why I'm concerned over this matter is that I'm in the process of developing a Notification bot. Unfortunately, I've run into several user pages who have, in their wisdom, decided to redirect their pages to either a different project or a different language, and sometimes it is a combination of both. So I've been thinking of a way to properly parse the information from getRedirectPage() such that I can pass the correct parameters to the Site class.
Thoughts, anyone?
-- Jason Y. Lee AKA AllyUnion _______________________________________________ Wikibots-l mailing list Wikibots-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikibots-l
Looks like you're working within the existing regular expression and string methods. My citations bot has somewhat complex needs, so I've been looking at some powerful approaches to parsing. At the moment I think mxTextTools looks useful, with parsing.py being less applicable. However, as in my case it will be useful to parse much of WikiSyntax...has someone already looked at whether MediaWiki's parser can be accessed by pywikipedia tools?
Jason Y. Lee wrote:
I've been wondering about some kind of parser to add to the python wikipedia project such that it knows how to handle transwiki links as well as trans-interwiki links.
We could allow the site to parse for us, but that would involve getting: http://en.wikipedia.org/wiki/fr:User:AllyUnion instead of http://en.wikipedia.org/w/index.php?title=fr%3AUser%3AAllyUnion&action=e... (The edit doesn't follow the redirect);
Although, we could parse the parts of the MediaWiki files and get our information from that...
---- Jason Y. Lee
On Tue, Nov 29, 2005 at 12:12:23AM -0600, Scot Wilcoxon wrote:
Looks like you're working within the existing regular expression and string methods. My citations bot has somewhat complex needs, so I've been looking at some powerful approaches to parsing. At the moment I think mxTextTools looks useful, with parsing.py being less applicable. However, as in my case it will be useful to parse much of WikiSyntax...has someone already looked at whether MediaWiki's parser can be accessed by pywikipedia tools?
Jason Y. Lee wrote:
I've been wondering about some kind of parser to add to the python wikipedia project such that it knows how to handle transwiki links as well as trans-interwiki links.
Wikibots-l mailing list Wikibots-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikibots-l
wikibots-l@lists.wikimedia.org