2009/10/16 Dennis During dcduring@gmail.com
Your best bet is likely to be to go to the Grease Pit at Wiktionary. Someone had a similar request recently, I think and seemed to get some help. This list is rarely used.
On Fri, Oct 16, 2009 at 6:58 PM, Kelly Jones <kelly.terry.jones@gmail.com
wrote:
I want a list of all English words + a brief definition of each [1].
I tried downloading enwiktionary-latest-pages-articles.xml.bz2, but this is way too much: it includes foreign words, word roots/origins, and a lot more that I don't need.
How can I extract just a word list w/ definitions from wiktionary?
I know about mthes and scowl, but wiktionary supercedes these, yes?
[1] I realize this isn't well-defined: I'll settle for an approximation
There is no one simple direct way to download just the English words and definitions.
Wiktionary uses the same software as Wikipedia which is designed for encyclopedias which just need one big blob of text for an article. A dictionary has a structure which has no support in the software and so instead we represent in a big blob of text.
Now it is possibe to extract useful content from these blobs of text.
The English Wiktionary is divided into sections and subsections with more-or-less standard formats. It is possible to write a program which parses this format. But because it is not totally standard some things are easier to parse than others.
The English list of words in the easiest to extract because you just need to find every page in the article namespace (not talk pages etc) which contains ==English==
But you might need to ask yourself what you mean be "word" because Wiktionary also contains many forms of the same word including spelling variations and compounding variations such as "treeline" vs "tree-line" vs "tree line". Not ot only this but it includes many many inflected forms such as "word" vs "words"; "look" vs "looks" vs "looked" vs "looking"; and "fast" vs "faster" vs "fastest". It is not always easy to filter these out. Worse, Wiktionary also includes "common misspellings" such as "alot" which are also tricky to filter out.
Words can have many definitions. There are both "homonyms" and "senses". Homonyms are words of different origins which share a spelling such as "sewer" (that which sews) vs "sewer" (wast drainage pipes) and senses are different meanings of the same word with the same origin such as "chicken" (domestic fowl) vs (coward).
So you have to decide what "a brief definition" of each means for you. Do you want just the first definition for each entry, ignoring all the others, do you want all definitions lumped together, or do you want all definitions grouped by homonym?
All that being said I think it is a very fair expectation that the English Wiktionary should make avilable such lists on a regular basis, much like the Wikimedia foundation makes the raw dump files avaiable. Please feel free to further this discussion here on the mailing list or in the "Greast pit" on the English Wiktionary.
Andrew Dunbar (hippietrail)
--
We're just a Bunch Of Regular Guys, a collective group that's trying to understand and assimilate technology. We feel that resistance to new ideas and technology is unwise and ultimately futile.
Wiktionary-l mailing list Wiktionary-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
-- Dennis C. During
Cynolatry is tolerant so long as the dog is not denied an equal divinity with the deities of other faiths. - Ambrose Bierce
http://en.wiktionary.org/wiki/cynolatry _______________________________________________ Wiktionary-l mailing list Wiktionary-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiktionary-l