I've been wondering about some kind of parser to add to the python wikipedia
project such that it knows how to handle transwiki links as well as
trans-interwiki links. This email contains my thoughts on the matter,
please feel free to correct me, add to it, or elaborate anything further.
I felt it necessary to send this email, so that we could possibly all agree
on several points, and that will help me or someone else pin down the problem
properly.
----
I've come to realize some assumptions:
* Each wikilink will not typically exceed the usage of 2-3 colons, although
anything above that is usually redundant.
Example: [[q:fr:w:en:Test]] on the English Wikipedia does process properly,
and following such a link will take you to the English Wikiquote then to
the French Wikiquote then back to the English Wikipedia.
Problem: Is there a simple and easy way to test for a namespace?
Discussion: Not easily... At least, I think so.
* The last part of the wikilink is ALWAYS the article title.
----
Most language codes are two characters long, but there are some exceptions:
als, ang, arc, ast, cho, chr, chy, csb, fur, haw, jbo, mus, nah, nds, roa-rup,
simple, tlh, tokipona, tpi, tpi, tum, zh-min-nan, zh-cn, zh-tw, minnan,
& zh-cfr
I noticed in the family file there is another one: bug
There maybe more on that list that I'm unaware of... It's likely to safe to
assume the same language families of the other projects, even those may not
exist yet. If they don't, they will likely exist in the future.
----
Okay, here is a list of cases, based on colon count:
1 colon:
1) Article namespace with no leading character in front
2) Interwiki link
3) Namespace preceeding the colon
2 colons:
1a) Interwiki link + Namespace (untranslated in English)
1b) Interwiki link + Namespace (translated in the Interwiki link's language)
2) Transwiki link + Interwiki link
3) Transwiki link + Namespace (may have to consider about different names for
the "Project" namespace)
3 colons:
1) Transwiki link + Interwiki link + Namespace (translated/untranslated)
2) Interwiki link + Transwiki link + Interwiki link (stupid, but possible)
3) Interwiki link + Transwiki link + Namespace link (stupid, but possible)
4) Transwiki link + Interwiki link + Interwiki link (stupid, but possible)
5) Transwiki link X3 (stupid, but possible)
6) Transwiki link + Transwiki link + Namespace link (stupid, but possible)
4+ colons:
Any combination above
Possible solution(s):
* Create a function to specifically to determine transwiki links
* Create a function to specifically to determine interwiki links based on
transwiki link information
* Create a function to specifically to determine namespace links based on
transwiki and interwiki link information
* Develop a class that uses the information from:
http://meta.wikimedia.org/wiki/Interwiki_map
* Develop a class for conversion only for the current available families
-- ignore the rest
----
If we split anything between '[[' and ']]' using the ':' as the separator,
we know the following to be true:
If the list is size of 1, then it has no interwiki links, no category links,
and no transwiki links. We also know that [0] is the name of the article.
No matter what the situation of the split, index of -1 will always point to
the name of the article.
Now, the matter is: In what order should we proceed?
Should we scan forwards or backwards?
In what order should we look for links?
1) Transwiki, interwiki, namespace
2) namespace, interwiki, transwiki
3) interwiki, namespace, transwiki
etc.
----
One thing is for certain, the regular expression regarding this will be
extensively long. If we do manage to resolve this, then our parser for
wikilinks should be able to handle anything we throw at it, and would make
any related bugs regarding linkedPages(), and getRedirectPage() easier
to fix. One thing I have a problem with is that getRedirectPage()
returns a string object, rather than a Page object. But it is obvious that
it should return a string object, because it could have any of the number
of situations I've described above.
The principle reason behind why I'm concerned over this matter is that I'm
in the process of developing a Notification bot. Unfortunately, I've run
into several user pages who have, in their wisdom, decided to redirect their
pages to either a different project or a different language, and sometimes
it is a combination of both. So I've been thinking of a way to properly
parse the information from getRedirectPage() such that I can pass the correct
parameters to the Site class.
Thoughts, anyone?
--
Jason Y. Lee
AKA AllyUnion
I'd like to request a -namespace option to the interwiki.py script, I
see that the allpages() function in wikipedia.py supports this but I
don't know python well enough (or at all) to add this myself. This
would be useful e.g. to add/update interwiki links to/on templates &
categories.