On 07/23/2013 11:23 AM, Mathieu Stumpf wrote:
> Here is what I would like to do : generating reports which give, for a
> given language, a list of words which are used on the web with a
> number evaluating its occurencies, but which are not in a given
> How would you recommand to implemente that within the wikimedia
Some years back, I undertook to add entries for
Swedish words in the English Wiktionary. You can
follow my diary at http://en.wiktionary.org/wiki/User:LA2
Among the things I did was to extract a list of all
Swedish words that already had entries. The best
way was to use CatScan to list entries in categories
for Swedish words. Even if there is a page called
"men", this doesn't mean the Swedish word "men"
has an entry, because it could be the English word
"men" that is in that page.
Then I extracted all words from some known texts,
e.g. novels, the Bible, government reports, and the
Swedish Wikipedia, counting the number of
occurrencies of each word. Case significance is
a bit tricky. There should not be an entry for
lower-case stockholm, so you can't just convert
everything to lower case. But if a sentence begins
with a capital letter, that word should not have
a capitalized entry. Another tricky issue is
abbreviations, which should keep the period,
for example "i.e." rather than "i" and "e". But
the period that ends a sentence should be removed.
When splitting a text into words, I decided to keep
all periods and initial capital letters, even if this
leads to some false words.
When you have word frequency statistics for a text,
and a list of existing entries from Wiktionary, you
can compute the coverage, and I wrote a little
script for this. I found that English Wiktionary already
had Swedish entries covering 72% of the words in the
Bible, and when I started to add entries for the most
common of the missing words, I was able to increase
this to 87% in just a single month (September 2010).
Many of the common words that were missing when
I started were adverbs such as "thereof", "herein",
which occur frequently in any text but are not very
exciting to write entries about. This statistics-based
approach gave me a reason to add those entries.
It is interesting to contrast a given text to a given
dictionary in this way. The Swedish entries in the
English Wiktionary is a different dictionary than the
Swedish entries in the German or Danish Wiktionary.
The kinds of words found in the Bible are different
from those found in Wikipedia or in legal texts.
There is not a single, universal text corpus that we
can aim to cover. Google has released its ngram
dataset. I'm not sure if it covers Swedish, but even
if it does, it must differ from the corpus frequencies
published by the Swedish Academy.
It is relatively easy to extract a list of existing entries
from Wiktionary. But to prepare a given text corpus
for frequency and coverage analysis needs more
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik - http://aronsson.se
[Apologies for X-posting]
We are pleased to announce the release of the Java-based Wiktionary Library (JWKTL) 1.0.0 - an application programming interface for Wiktionary.
Project homepage: http://code.google.com/p/jwktl/
== Overview ==
JWKTL (Java-based Wiktionary Library) is an application programming interface for the free multilingual online dictionary Wiktionary (http://www.wiktionary.org). JWKTL enables efficient and structured access to the information encoded in the English, the German, and the Russian Wiktionary language editions, including sense definitions, part of speech tags, etymology, example sentences, translations, semantic relations, and many other lexical information types. The Russian JWKTL parser is based on Wikokit (http://code.google.com/p/wikokit/).
Prior to being available as open source software, JWKTL has been a research project at the Ubiquitous Knowledge Processing (UKP) Lab of the Technische Universität Darmstadt, Germany. The following people have mainly contributed to this project: Yevgen Chebotar, Iryna Gurevych, Christian M. Meyer, Christof Müller, Lizhen Qu, Torsten Zesch.
== Publications ==
A detailed description of Wiktionary and JWKTL is available in our scientific articles:
* Christian M. Meyer and Iryna Gurevych: Wiktionary: A new rival for expert-built lexicons? Exploring the possibilities of collaborative lexicography, Chapter 13 in S. Granger & M. Paquot (Eds.): Electronic Lexicography, pp. 259-291, Oxford: Oxford University Press, November 2012. (http://www.ukp.tu-darmstadt.de/publications/details/?no_cache=1&tx_bibtex_p…)
* Christian M. Meyer and Iryna Gurevych: OntoWiktionary - Constructing an Ontology from the Collaborative Online Dictionary Wiktionary, chapter 6 in M. T. Pazienza and A. Stellato (Eds.): Semi-Automatic Ontology Development: Processes and Resources, pp. 131-161, Hershey, PA: IGI Global, February 2012. (http://www.ukp.tu-darmstadt.de/publications/details/?no_cache=1&tx_bibtex_p…)
* Torsten Zesch, Christof Müller, and Iryna Gurevych: Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary, in: Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC), pp. 1646-1652, May 2008. Marrakech, Morocco. (http://www.ukp.tu-darmstadt.de/publications/details/?no_cache=1&tx_bibtex_p…)
== License and Availability ==
The latest version of JWKTL is available via Maven Central. If you use Maven as your build tool, then you can add JWKTL as a dependency in your pom.xml file:
JWKTL is available as open source software under the Apache License 2.0 (ASL). The software thus comes "as is" without any warranty (see license text for more details). JWKTL makes use of Berkeley DB Java Edition 5.0.73 (Sleepycat License), Apache Ant 1.7.1 (ASL), Xerces 2.9.1 (ASL), JUnit 4.10 (CPL).
Some classes have been taken from the Wikokit project (available under multiple licenses, redistributed under the ASL license). See NOTICE.txt for further details.
== Contact ==
Please direct any questions or suggestions to
Group E-Mail: jwktl-users(a)googlegroups.com
Christian M. Meyer
Christian M. Meyer, M.Sc.
Ubiquitous Knowledge Processing (UKP Lab)
FB 20 Computer Science Department
Technische Universität Darmstadt
Hochschulstr. 10, D-64289 Darmstadt, Germany
Phone [+49] (0)6151 16-5386, fax -5455, room S2/02/B113
Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de
I created a tool to extract translations from different editions of
Wiktionary. Right now it supports 39 different Wiktionaries. It only
extracts translations and ignores the rest.
Azerbaijani, Bulgarian, Catalan, Czech, Danish, Greek, English, Esperanto,
Spanish, Estonian, Basque, Finnish, French, Galician, Hebrew, Croatian,
Hungarian, Indonesian, Italian, Georgian, Latin, Lithuanian, Malagasy,
Dutch, Norwegian, Occitan, Polish, Portuguese, Romanian, Russian, Slovak,
Slovenian, Serbian, Swedish, Swahili, Turkish, Ukrainian, Vietnamese and
Adding a new Wiktionary is done via a configuration file.
Right now the beta version is available for download at:
Documentation is in progress, until then the README should be enough to get
Please test it and send me your feedback and bug reports.
To add up a couple of comments to what Denny said, from my experience with
Wikisource, reaching out to international, loosely connected communities is
already a big challenge on its own. I would like to invite Wiktionary
contributors to take a look to this Individual Engagement Grant project
that Aubrey and me are doing for Wikisource, because maybe it would make
sense that a group of involved Wiktionarians started a similar initiative
for Wiktionary. The original application can be found here:
And the midterm report:
If anyone from the Wiktionary community wants to step forward, I would be
more than happy to share experiences and provide advice.
On Sat, Aug 10, 2013 at 3:30 AM, Denny Vrandečić <
> [Sorry for cross-posting]
> Yes, I agree that the OmegaWiki community should be involved in the
> discussions, and I pointed GerardM to our proposals whenever and
> discussions, using him as a liaison. We also looked and keep looking at the
> OmegaWiki data model to see what we are missing.
> Our latest proposal is different from OmegaWiki in two major points:
> * our primary goal is to provide support for structured data in the
> Wiktionaries. We do not plan to be the main resource ourselves, where
> readers come to in order to look up something, we merely provide structured
> data that a Wiktionary may or may not use. This parallels the role of
> Wikidata has with regards to Wikipedia. This also highlights the difference
> between Wikidata and OmegaWiki, since OmegaWiki's goal is "to create a
> dictionary of all words of all languages, including lexical, terminological
> and ontological information."
> * a smaller difference is the data model. Wikidata's latest proposal to
> support Wiktionary is centered around lexemes, and we do not assume that
> there is such a things as a language-independent defined meaning. But no
> matter what model we end up with, it is important to ensure that the bulk
> of the data could freely flow between the projects, and even though we
> might disagree on this issue in the modeling, it is ensured that the
> exchange of data is widely possible.
> We tried to keep notes on the discussion we had today: <
> My major take home message for me is that:
> * the proposal needs more visual elements, especially a mock-up or sketch
> of how it would look like and how it could be used on the Wiktionaries
> * there is no generally accepted place for a discussion that involves all
> Wiktionary projects. Still, my initial decision to have the discussion on
> the Wikidata wiki was not a good one, and it should and will be moved to
> Having said that, the current proposal for the data model of how to support
> Wiktionary with Wikidata seems to have garnered a lot of support so far. So
> this is what I will continue building upon. Further comments are extremely
> welcomed. You can find it here:
> As said, it will be moved to Meta, as soon as the requested mockups and
> extensions are done.
> 2013/8/10 Samuel Klein <meta.sj(a)gmail.com>
> > Hello,
> > > On Fri, Aug 9, 2013 at 6:13 PM, JP Béland <lebo.beland(a)gmail.com>
> > >> I agree. We also need to include the Omegawiki community.
> > Agreed.
> > On Fri, Aug 9, 2013 at 12:22 PM, Laura Hale <laura(a)fanhistory.com>
> > > Why? The question of moving them into the WMF fold was pretty much no,
> > > because the project has an overlapping purpose with Wiktionary,
> > This is not actually the case.
> > There was overwhelming community support for adopting Omegawiki - at
> > least simply providing hosting. It stalled because the code needed a
> > security and style review, and Kip (the lead developer) was going to
> > put some time into that. The OW editors and dev were very interested
> > in finding a way forward that involved Wikidata and led to a combined
> > project with a single repository of terms, meanings, definitions and
> > translations.
> > Recap: The page describing the OmegaWiki project satisfies all of the
> > criteria for requesting WMF adoption.
> > * It is well-defined on Meta http://meta.wikimedia.org/wiki/Omegawiki
> > * It describes an interesting idea clearly aligned with expanding the
> > scope of free knowledge
> > * It is not a 'competing' project to Wiktionaries; it is an idea that
> > grew out of the Wiktionary community, has been developed for years
> > alongside it, and shares many active contributors and linguiaphiles.
> > * It started an RfC which garnered 85% support for adoption.
> > http://meta.wikimedia.org/wiki/Requests_for_comment/Adopt_OmegaWiki
> > Even if the current OW code is not used at all for a future Wiktionary
> > update -- and this idea was proposed and taken seriously by the OW
> > devs -- their community of contributors should be part of discussions
> > about how to solve the Wiktionary problem that they were the first to
> > dedicate themselves to.
> > Regards,
> > Sam.
> > _______________________________________________
> > Wikimedia-l mailing list
> > Wikimedia-l(a)lists.wikimedia.org
> > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> > <mailto:email@example.com?subject=unsubscribe>
> Project director Wikidata
> Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
> Tel. +49-30-219 158 26-0 | http://wikimedia.de
> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
> Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
> der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
> Körperschaften I Berlin, Steuernummer 27/681/51985.
> Wikimedia-l mailing list
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
Etiamsi omnes, ego non
If there is someone in Wikimania interested in participating in the talks
about the future support of Wiktionary in Wikidata, we will having a
discussion about the several proposals.
Date : Saturday, 10 Aug, 11:30 am - 1:00 pm
Place: Y520 (block Y, 5th floor)
See you there,
Le 2013-08-09 13:04, Romaine Wiki a écrit :
> Are there much users from Wiktionary in Hong Kong? I do not think any
> of the Dutch users is, I can't say for others.
> I think it would be essential that this subject is discussed inside
> the wider Wiktionary community. To me the group of users
> is too narrow. Also is a mailing list not handy as most of the users
> from Wiktionary do not read that. I think a Wikt-community wide
> discussion is needed.
I agree, and I think meta would be the most obvious channel for such
As said in the previous email, there's already [[Wiktionary future]]
which is waiting for contributions and discussion on meta. Anyway,
whatever the canal, it would be realy important to make aware as
much contributors as possible aware of this initiative, so they can
provide relevant feedback specific to their needs.
> On Fri, 8/9/13, David Cuenca <dacuetu(a)gmail.com> wrote:
> Subject: [Wikidata-l] Meeting about the support of Wiktionary in
> To: wiktionary-l(a)lists.wikimedia.org, "Wikimania general list (open
> subscription)" <wikimania-l(a)lists.wikimedia.org>, "Discussion list
> the Wikidata project." <wikidata-l(a)lists.wikimedia.org>, "Wikimedia
> Mailing List" <wikimedia-l(a)lists.wikimedia.org>
> Date: Friday, August 9, 2013, 4:43 AM
> If there is someone in Wikimania interested in participating
> in the talks about the future support of Wiktionary in
> Wikidata, we will having a discussion about the several
> Date : Saturday, 10 Aug, 11:30 am - 1:00 pm
> Place: Y520 (block Y, 5th floor)
> See you there,
> -----Inline Attachment Follows-----
> Wikidata-l mailing list
> Wikidata-l mailing list