I've been wondering about some kind of parser to add to the python wikipedia project such that it knows how to handle transwiki links as well as trans-interwiki links. This email contains my thoughts on the matter, please feel free to correct me, add to it, or elaborate anything further. I felt it necessary to send this email, so that we could possibly all agree on several points, and that will help me or someone else pin down the problem properly.
----
I've come to realize some assumptions: * Each wikilink will not typically exceed the usage of 2-3 colons, although anything above that is usually redundant.
Example: [[q:fr:w:en:Test]] on the English Wikipedia does process properly, and following such a link will take you to the English Wikiquote then to the French Wikiquote then back to the English Wikipedia.
Problem: Is there a simple and easy way to test for a namespace?
Discussion: Not easily... At least, I think so.
* The last part of the wikilink is ALWAYS the article title.
----
Most language codes are two characters long, but there are some exceptions: als, ang, arc, ast, cho, chr, chy, csb, fur, haw, jbo, mus, nah, nds, roa-rup, simple, tlh, tokipona, tpi, tpi, tum, zh-min-nan, zh-cn, zh-tw, minnan, & zh-cfr
I noticed in the family file there is another one: bug There maybe more on that list that I'm unaware of... It's likely to safe to assume the same language families of the other projects, even those may not exist yet. If they don't, they will likely exist in the future.
----
Okay, here is a list of cases, based on colon count:
1 colon: 1) Article namespace with no leading character in front 2) Interwiki link 3) Namespace preceeding the colon
2 colons: 1a) Interwiki link + Namespace (untranslated in English) 1b) Interwiki link + Namespace (translated in the Interwiki link's language) 2) Transwiki link + Interwiki link 3) Transwiki link + Namespace (may have to consider about different names for the "Project" namespace)
3 colons: 1) Transwiki link + Interwiki link + Namespace (translated/untranslated) 2) Interwiki link + Transwiki link + Interwiki link (stupid, but possible) 3) Interwiki link + Transwiki link + Namespace link (stupid, but possible) 4) Transwiki link + Interwiki link + Interwiki link (stupid, but possible) 5) Transwiki link X3 (stupid, but possible) 6) Transwiki link + Transwiki link + Namespace link (stupid, but possible)
4+ colons: Any combination above
Possible solution(s): * Create a function to specifically to determine transwiki links * Create a function to specifically to determine interwiki links based on transwiki link information * Create a function to specifically to determine namespace links based on transwiki and interwiki link information * Develop a class that uses the information from: http://meta.wikimedia.org/wiki/Interwiki_map * Develop a class for conversion only for the current available families -- ignore the rest
----
If we split anything between '[[' and ']]' using the ':' as the separator, we know the following to be true:
If the list is size of 1, then it has no interwiki links, no category links, and no transwiki links. We also know that [0] is the name of the article.
No matter what the situation of the split, index of -1 will always point to the name of the article.
Now, the matter is: In what order should we proceed?
Should we scan forwards or backwards?
In what order should we look for links? 1) Transwiki, interwiki, namespace 2) namespace, interwiki, transwiki 3) interwiki, namespace, transwiki etc.
----
One thing is for certain, the regular expression regarding this will be extensively long. If we do manage to resolve this, then our parser for wikilinks should be able to handle anything we throw at it, and would make any related bugs regarding linkedPages(), and getRedirectPage() easier to fix. One thing I have a problem with is that getRedirectPage() returns a string object, rather than a Page object. But it is obvious that it should return a string object, because it could have any of the number of situations I've described above.
The principle reason behind why I'm concerned over this matter is that I'm in the process of developing a Notification bot. Unfortunately, I've run into several user pages who have, in their wisdom, decided to redirect their pages to either a different project or a different language, and sometimes it is a combination of both. So I've been thinking of a way to properly parse the information from getRedirectPage() such that I can pass the correct parameters to the Site class.
Thoughts, anyone?