Summary in English follows below. This is from the German list.
Timwi wrote:
[[Heilige[s|n] Römische[s|n] Reich[|es] Deutscher Nation]] oder [[Heiliges^n Römisches^n Reich ^es Deutscher Nation]] oder sowas.
oder einfach [[Heilgen Römischen Reiches Deutscher Nation]] und ein bisschen "fuzzy matching" in der Liste existierenden Artikeln?
So ein Algorithmus könnte auch den Schwedischen und Dänischen Wikipedien nützlich sein. Hier ein Vorschlag:
1. Wenn ein Klammerlink kein direkten Match hat (es gibt kein Artikel "Heiligen Römischen Reiches...").
2. Und wenn das Klammerlink aus drei oder mehr Wörter besteht.
3. Ersätzt mit ".*" (oder SQL "%") die zwei letzten Buchstaben in jedes Linkwort. ("Heiligen Römischen Reiches" wird zur Regexp "Heilig.* Römisch.* Reich.*" oder SQL "Heilg% Römisch% Reich%")
4. Wenn das Suchmuster genau _eine_ Artikelüberschrift antrifft, diesen Artikel automatisch verlinken.
SUMMARY IN ENGLISH:
The German language has a problem with making wiki links from phrases where word endings need to change to make the link text fit in a sentence, something like "calf -> calves", but on a much greater scale. For example an article heading might be "Heiliges Römisches Reich Deutscher Nation" (the Holy Roman Empire of German Nationality) but a in typical phrase where you in English can simply write
This was a typical property of the [[Holy Roman Empire of ...]]
where the article heading appears unmodified as the link text. But the German text would have to be:
Das war ein typisches Eigenschaft des [[Heiliges Römisches Reich Deutscher Nation | Heiligen Römischen Reiches Deutscher Nation]] ^^ ^^ ^^
In German, these different word endings are never (?) longer than the last two characters of a word, which made me suggest the following algorithm, from which I think the Swedish and Danish Wikipedia could also benefit:
1. When a bracket link doesn't have a direct match,
2. And the bracket link consists of three words or more,
3. Replace with ".*" or SQL "%" the last two characters of each word in the link text.
4. If this search pattern matches exactly *one* article heading, make a link directly to that article.
This would make it possible to write [[Heiligen Römischen Reiches Deutscher Nation]] without the pipe character and real form, the WikiToHtml conversion would not find a direct match (1), but since the link contains more than two words it tries a search for "Heilig% Römisch% Reich% Deutsch% Nati%" and thus finds the correct article to link to.
On Fri, Jul 18, 2003 at 09:51:30PM +0200, Lars Aronsson wrote:
SUMMARY IN ENGLISH:
The German language has a problem with making wiki links from phrases where word endings need to change to make the link text fit in a sentence, something like "calf -> calves", but on a much greater scale. For example an article heading might be "Heiliges R?misches Reich Deutscher Nation" (the Holy Roman Empire of German Nationality) but a in typical phrase where you in English can simply write
This was a typical property of the [[Holy Roman Empire of ...]]
where the article heading appears unmodified as the link text. But the German text would have to be:
Das war ein typisches Eigenschaft des [[Heiliges R?misches Reich Deutscher Nation | Heiligen R?mischen Reiches Deutscher Nation]] ^^ ^^ ^^
In German, these different word endings are never (?) longer than the last two characters of a word, which made me suggest the following algorithm, from which I think the Swedish and Danish Wikipedia could also benefit:
When a bracket link doesn't have a direct match,
And the bracket link consists of three words or more,
Why three? Why not apply this to the [[Englischer Kanal]], too?
- Replace with ".*" or SQL "%" the last two characters of each word
in the link text.
- If this search pattern matches exactly *one* article heading,
make a link directly to that article.
4. If the edit-link is hit, automatically propose to create a #REDIRECT to the already existing page.
mod_speling from apache automatically fixes spelling errors, and sometimes does not exactly what one might have expected. So I would prefer to have a human judge whether [[Haus der B�rse]] and [[Hausse der B�rse]] are really articles covering the same topic.
Regards,
JeLuF
Earlier, I wrote:
oder einfach [[Heilgen Römischen Reiches Deutscher Nation]] und ein bisschen "fuzzy matching" in der Liste existierenden Artikeln? [...] In German, these different word endings are never (?) longer than the last two characters of a word, which made me suggest the following algorithm, from which I think the Swedish and Danish Wikipedia could also benefit:
My suggested algorithm is far too simple and leads to too many false positives. I've tried it (on susning.nu) and disabled it again. A working algorithm probably must recognize typical endings, which makes it language specific (-er, -es in German; -a, -t in Swedish). Existing "stemming" algorithms might be worth trying. I'm not going any deeper into this.
Lars Aronsson wrote:
In German, these different word endings are never (?) longer than the last two characters of a word, which made me suggest the following algorithm, from which I think the Swedish and Danish Wikipedia could also benefit:
As you've already pointed out in your other mail, this will return too many false positives (e.g. Jack Mark Doe -> Jane Mary Dew; those are just fabricated names, but you get the idea).
I'm also opposed to making it language-specific. That'd be too complex.
My suggestion on the mailing list was to introduce new mark-up that makes it quicker to type an equivalent to [[Heiliges Römisches Reich Deutscher Nation|Heiligen Römischen Reiches Deutscher Nation]]: either [[Heilige[s|n] Römische[s|n] Reich[|es] Deutscher Nation]] or [[Heiliges^n Römisches^n Reich ^es Deutscher Nation]] I prefer the latter.
I know this looks a little bit complex at first, but I'm sure you'll have figured it out in no time (esp. Germans who need it all the time). Even if you can't get used to it or figure it out, that doesn't matter much, because 1) When you read the source text, you don't really need to know *exactly* how the mark-up works, you can still easily guess where the link goes to. 2) When you *edit* the source text, you can always use the original pipe notation.
A case where I've always wanted this in English:
[[socialism^t]]
Greetings, Timwi
wikitech-l@lists.wikimedia.org