[Wiktionary-l] Wiktionary size, format, long tail of languages

Lars Aronsson lars at aronsson.se
Thu Nov 22 04:43:00 UTC 2007


There's a list of Wiktionaries by raw size at
http://meta.wikimedia.org/wiki/Wiktionary#List_of_Wiktionaries

Do all Wiktionaries follow the same format, with one wiki article 
per word, containing sections for language / part of speech / 
aspects and then numbered lists for meanings?  E.g.

[[Snow]]
==English==
===Noun===
# The frozen, crystalline state of water
# A shade of white
# Random electrical noise
====Derived terms====
====Translations====
===Verb===
# Weather when snow is falling
# Bluff draw in poker
====Derived terms====
====Translations====

Or is there any Wiktionary that breaks this pattern?  Does this 
pattern have a name?  What do you call it when/if some Wiktionary 
breaks this pattern?

How did we end up with disambiguation pages on Wikipedia, strictly 
keeping one page per meaning of a word, but not on Wiktionary?  
Is that because Wiktionary spun off before disambiguation pages 
were invented on Wikipedia, and the news never spread to 
Wiktionary?  Or is it because the Oxford English Dictionary 
differs from Encyclopaedia Britannica in this respect, and we want 
to keep the best practice?  Or why?  One could say that all 
meanings of "snow" are the same word (by etymology), and should 
logically be in one page.  But this is not true of "pen" 
(etymology 1--4) and the keeping of foreign words of similar 
spelling in the same page (Norwegian "pen" meaning "fine").  Has 
there been a discussion about this, and where can that be found? I 
found something from December 2002, 
http://en.wiktionary.org/wiki/Wiktionary_talk:Entry_layout_explained/archive_2002 
But the voice of reason, Imran, left the project a year later. 
Another discussion took place in December 2005, 
http://en.wiktionary.org/wiki/Wiktionary:Beer_parlour_archive/October-December_05#Basic_flaw_in_Wiktionary--What_is_a_.27word.27.3F.3F 
(It appears to be a December issue, so I apologize for bringing it 
up a few weeks early this year.)

In the English Wiktionary, what percentage of words are in 
English?  And is the "long tail" of foreign languages similar over 
all Wiktionaries?  Is there any major Wiktionary that has a higher 
concentration of words in the own language?

If the above pattern holds, a simple count of all level-2 headings 
from the database dump could give the answer.  For example, in the 
dump of the Swedish Wiktionary, having 46500 articles and being 
the 13th biggest, these level-2 headings appear most frequently:

   2510 ==Svenska==              Swedish
   1847 ==Tvärspråkligt==        Translingual
    625 ==Engelska==             English
    343 ==Historik==             Etymology
    267 ==Tyska==                German
    245 ==Danska==               Danish
    230 ==Norska==               Norwegian
    217 ==Spanska==              Spanish
    217 ==Franska==              French
    192 ==Italienska==           Italian
    184 ==Nederländska==         Dutch
    169 ==Finska==               Finnish
    152 ==Polska==               Polish
    135 ==Serbiska==             Serbian
    122 ==Rumänska==             Romanian
    116 ==Interlingua==          Interlingua
    109 ==Ungerska==             Hungarian




-- 
  Lars Aronsson (lars at aronsson.se)
  Aronsson Datateknik - http://aronsson.se




More information about the Wiktionary-l mailing list