2009/10/16 Dennis During <dcduring(a)gmail.com>
Your best bet is likely to be to go to the Grease Pit
at Wiktionary.
Someone
had a similar request recently, I think and seemed to get some help. This
list is rarely used.
On Fri, Oct 16, 2009 at 6:58 PM, Kelly Jones <kelly.terry.jones(a)gmail.com
wrote:
I want a list of all English words + a brief
definition of each [1].
I tried downloading enwiktionary-latest-pages-articles.xml.bz2, but
this is way too much: it includes foreign words, word roots/origins,
and a lot more that I don't need.
How can I extract just a word list w/ definitions from wiktionary?
I know about mthes and scowl, but wiktionary supercedes these, yes?
[1] I realize this isn't well-defined: I'll settle for an approximation
There is no one simple direct way to download just the English words and
definitions.
Wiktionary uses the same software as Wikipedia which is designed for
encyclopedias which just need one big blob of text for an article. A
dictionary has a structure which has no support in the software and so
instead we represent in a big blob of text.
Now it is possibe to extract useful content from these blobs of text.
The English Wiktionary is divided into sections and subsections with
more-or-less standard formats. It is possible to write a program which
parses this format. But because it is not totally standard some things are
easier to parse than others.
The English list of words in the easiest to extract because you just need to
find every page in the article namespace (not talk pages etc) which contains
==English==
But you might need to ask yourself what you mean be "word" because
Wiktionary also contains many forms of the same word including spelling
variations and compounding variations such as "treeline" vs
"tree-line" vs
"tree line". Not ot only this but it includes many many inflected forms such
as "word" vs "words"; "look" vs "looks" vs
"looked" vs "looking"; and "fast"
vs "faster" vs "fastest". It is not always easy to filter these out.
Worse,
Wiktionary also includes "common misspellings" such as "alot" which
are also
tricky to filter out.
Words can have many definitions. There are both "homonyms" and
"senses".
Homonyms are words of different origins which share a spelling such as
"sewer" (that which sews) vs "sewer" (wast drainage pipes) and senses
are
different meanings of the same word with the same origin such as "chicken"
(domestic fowl) vs (coward).
So you have to decide what "a brief definition" of each means for you. Do
you want just the first definition for each entry, ignoring all the others,
do you want all definitions lumped together, or do you want all definitions
grouped by homonym?
All that being said I think it is a very fair expectation that the English
Wiktionary should make avilable such lists on a regular basis, much like the
Wikimedia foundation makes the raw dump files avaiable. Please feel free to
further this discussion here on the mailing list or in the "Greast pit" on
the English Wiktionary.
Andrew Dunbar (hippietrail)
--
We're just a Bunch Of Regular Guys, a
collective group that's trying
to understand and assimilate technology. We feel that resistance to
new ideas and technology is unwise and ultimately futile.
_______________________________________________
Wiktionary-l mailing list
Wiktionary-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
--
Dennis C. During
Cynolatry is tolerant so long as the dog is not denied an equal divinity
with the deities of other faiths. - Ambrose Bierce
http://en.wiktionary.org/wiki/cynolatry
_______________________________________________
Wiktionary-l mailing list
Wiktionary-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
--
http://wiktionarydev.leuksman.com http://linguaphile.sf.net