Wiktionary-l September 2013

wiktionary-l@lists.wikimedia.org

4 participants
5 discussions

Re: [Wiktionary-l] [Wikitech-l] Listing missing words of wiktionnaries

by Lars Aronsson

On 07/23/2013 11:23 AM, Mathieu Stumpf wrote: > Here is what I would like to do : generating reports which give, for a > given language, a list of words which are used on the web with a > number evaluating its occurencies, but which are not in a given > wiktionary. > > How would you recommand to implemente that within the wikimedia > infrastructure? Some years back, I undertook to add entries for Swedish words in the English Wiktionary. You can follow my diary at http://en.wiktionary.org/wiki/User:LA2 Among the things I did was to extract a list of all Swedish words that already had entries. The best way was to use CatScan to list entries in categories for Swedish words. Even if there is a page called "men", this doesn't mean the Swedish word "men" has an entry, because it could be the English word "men" that is in that page. Then I extracted all words from some known texts, e.g. novels, the Bible, government reports, and the Swedish Wikipedia, counting the number of occurrencies of each word. Case significance is a bit tricky. There should not be an entry for lower-case stockholm, so you can't just convert everything to lower case. But if a sentence begins with a capital letter, that word should not have a capitalized entry. Another tricky issue is abbreviations, which should keep the period, for example "i.e." rather than "i" and "e". But the period that ends a sentence should be removed. When splitting a text into words, I decided to keep all periods and initial capital letters, even if this leads to some false words. When you have word frequency statistics for a text, and a list of existing entries from Wiktionary, you can compute the coverage, and I wrote a little script for this. I found that English Wiktionary already had Swedish entries covering 72% of the words in the Bible, and when I started to add entries for the most common of the missing words, I was able to increase this to 87% in just a single month (September 2010). Many of the common words that were missing when I started were adverbs such as "thereof", "herein", which occur frequently in any text but are not very exciting to write entries about. This statistics-based approach gave me a reason to add those entries. It is interesting to contrast a given text to a given dictionary in this way. The Swedish entries in the English Wiktionary is a different dictionary than the Swedish entries in the German or Danish Wiktionary. The kinds of words found in the Bible are different from those found in Wikipedia or in legal texts. There is not a single, universal text corpus that we can aim to cover. Google has released its ngram dataset. I'm not sure if it covers Swedish, but even if it does, it must differ from the corpus frequencies published by the Swedish Academy. It is relatively easy to extract a list of existing entries from Wiktionary. But to prepare a given text corpus for frequency and coverage analysis needs more preparation. -- Lars Aronsson (lars(a)aronsson.se) Aronsson Datateknik - http://aronsson.se

10 years, 4 months

new word

by Alfred Langer

I submit a new word for your consideration: dogophile- a lover of dogs.

10 years, 7 months

Fwd: [Wikimedia-l] It's time to reclaim the community logo

by Federico Leva (Nemo)

FYI Nemo P.s.: P.s.: You can check whether the WMF protects the logo of your project by seeing if it's listed as "registered trademark" on <https://wikimediafoundation.org/wiki/Wikimedia_trademarks>. -------- Messaggio originale -------- Oggetto: [Wikimedia-l] It's time to reclaim the community logo Data: Sat, 21 Sep 2013 12:16:16 +0200 Mittente: Tomasz W. Kozlowski Hello community, this is to inform you that in response to the trademarking of the Wikimedia community logo[1], created in 2006 by Artur “WarX” Fijałkowski, which was discussed on this mailing list[2] as well as on Meta[3] back in March, a small group of community members—Artur, myself, Federico Leva (Nemo) and John Vandenberg—have initiated a formal process of opposition against the registration of the trademark by the Foundation in order to *reclaim the logo* for unrestricted use by the community. We appreciate the Foundation’s protection of the other trademarks they have registered so far, including the logos of Wikipedia, Wikisource and some other sister projects. In the case of the community logo, however, it is our belief that the Foundation’s actions are exactly opposite to what the community logo stands for and contradict the purpose behind its very existence. We would like to make it clear that it is not our intention to damage anyone; our actions are a challenge against what we perceive as unilateral declaration of ownership of an asset that has always belonged to the wider community, and not to one or another organisation that is part of the movement. By formally opposing the registration of the trademark we hope to ensure the history of this logo is not disregarded, and we wish to protect the community against unnecessary bureaucracy and, to use another quote, let “groups who do not purport to represent the WMF”[4] to continue to be able to freely associate with a logo that has been part of their identity for so long. We also want to note that this is in no way a legal action against the Foundation, but a simple notice of opposition against the registration of the logo in the European Union. If we assume good faith, we can only be confident that the WMF, having now a formal occasion, will withdraw its registration of the logo rather than continue using movement resources to force the community into lengthy, expensive proceedings. We invite all community members interested in this issue to express their opinions at: https://meta.wikimedia.org/wiki/Talk:Community_Logo/Reclaim_the_Logo If any of you would like to help us in any way (covering the costs of the opposition, promoting the discussion, etc.), please feel free to contact us off–list. Artur Fijalkowski (WarX) Tomasz Kozlowski (odder) Federico Leva (Nemo) John Vandenberg (jayvdb) == References == * [1] https://meta.wikimedia.org/wiki/File:Wikimedia_Community_Logo.svg * [2] https://lists.wikimedia.org/pipermail/wikimedia-l/2013-March/124715.html * [3] https://meta.wikimedia.org/wiki/Talk:Community_Logo * [4] http://lists.wikimedia.org/pipermail/wikimedia-l/2013-March/124730.html _______________________________________________ Wikimedia-l mailing list Wikimedia-l(a)lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>

10 years, 7 months

Re: [Wiktionary-l] [Wikitech-l] CirrusSearch on mediawiki.org

by Federico Leva (Nemo)

Nice! As for next steps, what about using Wiktionary as next "pioneering project" for the new CirrusSearch (first opt-in and then default)? It exists in most languages (we really need to see how the new search works in different languages), it's one of the most impacted projects by the new features (e.g. expanded templates indexing) and at the same time allows to make full-project deploys without impacting some 95 % of our traffic. Nemo

10 years, 7 months

Fwd: [Wikidata-l] Wikidata/Wiktionary and other projects

by Gerard Meijssen

Hoi, This is quite relevant ... They want our comments. As far as I am concerned I fail to appreciate "lemon". But I really like to hear your take on it. Thanks, GerardM ---------- Forwarded message ---------- From: Saskia Warzecha <saskia.warzecha(a)wikimedia.de> Date: 3 September 2013 11:59 Subject: [Wikidata-l] Wikidata/Wiktionary and other projects To: wikidata-l(a)lists.wikimedia.org Hello, for those who are interested, this is the link to the Wikidata/Wiktionary proposals and comparison of other structures: https://www.wikidata.org/wiki/Wikidata:Comparison_of_Projects_and_Proposals… Kind regards, Saskia 2013/7/16 Saskia Warzecha <saskia.warzecha(a)wikimedia.de> > Hi, > > I'm Saskia and I wanted to introduce myself. I started yesterday as an > intern at Wikidata in Berlin. > > I am currently finishing my studies in Computational Linguistics (B.Sc.) at > the University of Potsdam and will commence a M.Sc. in Vienna, Austria, > this fall. > > My task at Wikidata is to analyze the proposals for Wiktionary in Wikidata > [0],[1], to compare it with similar work (OmegaWiki, WordNet, etc.), and to > help in finding the best solution for Wiktionary. > > I'm staying for two months and am looking forward to your questions and > inputs. > > Best regards, > Saskia > > [0] https://www.wikidata.org/wiki/Wikidata:Wiktionary > [1] > https://www.wikidata.org/wiki/Wikidata:Wiktionary_%28alternative_proposal%29 > _______________________________________________ Wikidata-l mailing list Wikidata-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

10 years, 7 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Wiktionary-l September 2013