Wiktionary parsing ; multiple languages

List overview All Threads
Download

newer

older

Links

Re: [Wiktionary-l]...

Moutupsi Paul

3 Apr 2013 3 Apr '13

6:21 p.m.

Hi All,

Greeting,

I am a CS grad student from Data Science Lab Stony Brookhttps://sites.google.com/site/datascienceslab/ and I am dropping this mail to request information about parsing multi-lingual Wiktionary data. Our lab has been using Wikipedia data for quite a while now but we are really interested in taking advantage of the massive Wiktionary content which we feel , after proper parsing, can become an rich muti-language corpus.

But the big hurdle is a parsing tool. We have tried a few Wiktionary parsing tools

1. https://github.com/clbecker/perl-wiktionary-parser/

2. https://code.google.com/p/wikokit/wiki/GettingStartedWiktionaryParser

3. https://github.com/benreynwar/wiktionary-parser/tree/master/wiktionary_parse...

4. http://www.ukp.tu-darmstadt.de/software/jwktl/

but none of them are available in a ready-to-use or easy-to-extend in multiple language mode. (I am currently trying to work with wikokit (parser 2 above) )

I request for some advice, suggestion or redirection towards best available Wiktionary parser. We are mainly looking to extract meanings, POS, examples, translations etc. (more can never hurt).

Any help is appreciated. Kindly let know if further information is needed.

Regards,

Moutupsi

Show replies by date

Sebastian Hellmann

4 Apr 4 Apr

11:50 p.m.

Hi Moutupsi, there are actually some problems, that can be better solved by a community than by software alone. It took quite some efforts and three years, but we are very close to really start now.

Since two days, we have a working minimal example for the Wiktionary2RDF subproject of DBpedia, so the community can really pick it up now. Main docu is here: http://dbpedia.org/Wiktionary

Now that the software and the linked data and sparql hosting are working, we will try to find maintainers for each language. DBpedia already has a vast network for this: http://wiki.dbpedia.org/Internationalization

I think there will be configs + data for these languages quite soon: ko, sr, el, es with many more to follow. You are welcome to join in, try to produce the data you need and give back your results to the community.

There are two views on the software, one for people who just want to use it and create configs: https://github.com/dbpedia/dbpedia-wiktionary

and for Scala/Java developers: https://github.com/dbpedia/extraction-framework/tree/master/wiktionary

Data can be found here: http://downloads.dbpedia.org/wiktionary/dumps/

I will write a blog post announcing this soon.

All the best, Sebastian

Am 04.04.2013 03:21, schrieb Moutupsi Paul:

...

Hi All,

Greeting,

I am a CS grad student from Data Science Lab Stony Brookhttps://sites.google.com/site/datascienceslab/ and I am dropping this mail to request information about parsing multi-lingual Wiktionary data. Our lab has been using Wikipedia data for quite a while now but we are really interested in taking advantage of the massive Wiktionary content which we feel , after proper parsing, can become an rich muti-language corpus.

But the big hurdle is a parsing tool. We have tried a few Wiktionary parsing tools
  https://github.com/clbecker/perl-wiktionary-parser/
  https://code.google.com/p/wikokit/wiki/GettingStartedWiktionaryParser
  https://github.com/benreynwar/wiktionary-parser/tree/master/wiktionary_parser
  http://www.ukp.tu-darmstadt.de/software/jwktl/
but none of them are available in a ready-to-use or easy-to-extend in multiple language mode. (I am currently trying to work with wikokit (parser 2 above) )

I request for some advice, suggestion or redirection towards best available Wiktionary parser. We are mainly looking to extract meanings, POS, examples, translations etc. (more can never hurt).

Any help is appreciated. Kindly let know if further information is needed.

Regards,

Moutupsi

Wiktionary-l mailing list Wiktionary-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiktionary-l

-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects: http://nlp2rdf.org , http://linguistics.okfn.org , http://dbpedia.org/Wiktionary , http://dbpedia.org Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group: http://aksw.org

Mathieu Stumpf

5 Apr 5 Apr

2:56 a.m.

Le 2013-04-05 08:50, Sebastian Hellmann a écrit :

...

Hi Moutupsi, there are actually some problems, that can be better solved by a community than by software alone. It took quite some efforts and three years, but we are very close to really start now.

I added the dbpedia wiktionary entry on [1]. I wasn't aware of your effort, despite being really interesting in the wiktionary future. Could you please read [1] and update it with your vision as a dbpedia contributor?

[1] https://meta.wikimedia.org/wiki/Wiktionary_future

...

Since two days, we have a working minimal example for the Wiktionary2RDF subproject of DBpedia, so the community can really pick it up now. Main docu is here: http://dbpedia.org/Wiktionary

Now that the software and the linked data and sparql hosting are working, we will try to find maintainers for each language. DBpedia already has a vast network for this: http://wiki.dbpedia.org/Internationalization

I think there will be configs + data for these languages quite soon: ko, sr, el, es with many more to follow. You are welcome to join in, try to produce the data you need and give back your results to the community.

There are two views on the software, one for people who just want to use it and create configs: https://github.com/dbpedia/dbpedia-wiktionary

and for Scala/Java developers:

https://github.com/dbpedia/extraction-framework/tree/master/wiktionary

Data can be found here: http://downloads.dbpedia.org/wiktionary/dumps/

I will write a blog post announcing this soon.

All the best, Sebastian

Am 04.04.2013 03:21, schrieb Moutupsi Paul:

...
Hi All,

Greeting,

I am a CS grad student from Data Science Lab Stony Brookhttps://sites.google.com/site/datascienceslab/ and I am dropping this mail to request information about parsing multi-lingual Wiktionary data. Our lab has been using Wikipedia data for quite a while now but we are really interested in taking advantage of the massive Wiktionary content which we feel , after proper parsing, can become an rich muti-language corpus.

But the big hurdle is a parsing tool. We have tried a few Wiktionary parsing tools
  https://github.com/clbecker/perl-wiktionary-parser/
https://code.google.com/p/wikokit/wiki/GettingStartedWiktionaryParser

https://github.com/benreynwar/wiktionary-parser/tree/master/wiktionary_parse...
  http://www.ukp.tu-darmstadt.de/software/jwktl/
but none of them are available in a ready-to-use or easy-to-extend in multiple language mode. (I am currently trying to work with wikokit (parser 2 above) )

I request for some advice, suggestion or redirection towards best available Wiktionary parser. We are mainly looking to extract meanings, POS, examples, translations etc. (more can never hurt).

Any help is appreciated. Kindly let know if further information is needed.

Regards,

Moutupsi

Wiktionary-l mailing list Wiktionary-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiktionary-l

-- Association Culture-Libre http://www.culture-libre.org/

Sebastian Hellmann

10:44 a.m.

Hi Mathieu,

Am 05.04.2013 11:56, schrieb Mathieu Stumpf:

...

I added the dbpedia wiktionary entry on [1]. I wasn't aware of your effort, despite being really interesting in the wiktionary future. Could you please read [1] and update it with your vision as a dbpedia contributor?

[1] https://meta.wikimedia.org/wiki/Wiktionary_future

this page is interesting, but seems to be very idealistic. I am not sure, every language community agrees to use a common model. I also wonder if this is possible at all and whether there is an overlap. Do you think it makes sense to edit that page? Normally, there is a lot of talk and planning and nothing comes around in the end.

Note that the good thing about Wiktionary is, that you can add information freely without adhering to a preset structure.

DBpedia is already implementing adapters to load data from WikiData. So Once WikiData is working for Wiktionary, we will have data from there and from the remaining Wikisyntax and merge them.DBpedia and WikiData have a loose cooperation for a joint task in a Google Summer of Code proposal.

All the best, Sebastian

Amgine

11:48 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

There are several Wiktionary proposals for GSOC. I'm aware of another for pronunciation recording from wiktionary pages, and one to create a DICT-like api either as an extension to MW api or as a special page extension.

Amgine

On 05/04/13 10:44 AM, Sebastian Hellmann wrote:

...

Hi Mathieu,

Am 05.04.2013 11:56, schrieb Mathieu Stumpf:

...
I added the dbpedia wiktionary entry on [1]. I wasn't aware of your effort, despite being really interesting in the wiktionary future. Could you please read [1] and update it with your vision as a dbpedia contributor?

[1] https://meta.wikimedia.org/wiki/Wiktionary_future

this page is interesting, but seems to be very idealistic. I am not sure, every language community agrees to use a common model. I also wonder if this is possible at all and whether there is an overlap. Do you think it makes sense to edit that page? Normally, there is a lot of talk and planning and nothing comes around in the end.

Note that the good thing about Wiktionary is, that you can add information freely without adhering to a preset structure.

DBpedia is already implementing adapters to load data from WikiData. So Once WikiData is working for Wiktionary, we will have data from there and from the remaining Wikisyntax and merge them.DBpedia and WikiData have a loose cooperation for a joint task in a Google Summer of Code proposal.

All the best, Sebastian

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJRXxx2AAoJEBGze5c9ley6M4QH/1PdSnexBUBj+8BWr8LMBrao WJAzSwMowGsxi+27DcC1VxqWocGgFEbiJ8OTezN47SbcDpQu1QAQOIvq/iU0fgeE zNdV8zLf2C+BH4Ods1Qm6LcPi3efWx4GHtr07BQjmUB/1iW2qZ1adyPu32C6SfTU hsmEnYxDFAXoXSnfJtTZN8SFC4licZykHzJMQke2nibexVPfbkv4s202pCU+Uey1 YyZkWYFzw8cDInODME2OgHIbzEiACq99bsrB2U+1p/aikIt1p5qsBG7k2qkuMUaA XoIF8EvjVt2dkuwTnVCeK8O1XlizgaDmx7uURZOMO7CCTGBqB845zUNowvvveCM= =okg4 -----END PGP SIGNATURE-----

Sebastian Hellmann

3:25 p.m.

Hi Amgine, I think these are just ideas for now and the students still have to break it down into proposals, right? Do you have links to these ideas?

Our Wiktionary related ideas for GSoC are here: http://wiki.dbpedia.org/gsoc2013/ideas#h254-12 http://wiki.dbpedia.org/gsoc2013/ideas#h254-13

-- Sebastian

Am 05.04.2013 20:48, schrieb Amgine:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

There are several Wiktionary proposals for GSOC. I'm aware of another for pronunciation recording from wiktionary pages, and one to create a DICT-like api either as an extension to MW api or as a special page extension.

Amgine

On 05/04/13 10:44 AM, Sebastian Hellmann wrote:

...
Hi Mathieu,

Am 05.04.2013 11:56, schrieb Mathieu Stumpf:

...
I added the dbpedia wiktionary entry on [1]. I wasn't aware of your effort, despite being really interesting in the wiktionary future. Could you please read [1] and update it with your vision as a dbpedia contributor?

[1] https://meta.wikimedia.org/wiki/Wiktionary_future

this page is interesting, but seems to be very idealistic. I am not sure, every language community agrees to use a common model. I also wonder if this is possible at all and whether there is an overlap. Do you think it makes sense to edit that page? Normally, there is a lot of talk and planning and nothing comes around in the end.

Note that the good thing about Wiktionary is, that you can add information freely without adhering to a preset structure.

DBpedia is already implementing adapters to load data from WikiData. So Once WikiData is working for Wiktionary, we will have data from there and from the remaining Wikisyntax and merge them.DBpedia and WikiData have a loose cooperation for a joint task in a Google Summer of Code proposal.

All the best, Sebastian

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJRXxx2AAoJEBGze5c9ley6M4QH/1PdSnexBUBj+8BWr8LMBrao WJAzSwMowGsxi+27DcC1VxqWocGgFEbiJ8OTezN47SbcDpQu1QAQOIvq/iU0fgeE zNdV8zLf2C+BH4Ods1Qm6LcPi3efWx4GHtr07BQjmUB/1iW2qZ1adyPu32C6SfTU hsmEnYxDFAXoXSnfJtTZN8SFC4licZykHzJMQke2nibexVPfbkv4s202pCU+Uey1 YyZkWYFzw8cDInODME2OgHIbzEiACq99bsrB2U+1p/aikIt1p5qsBG7k2qkuMUaA XoIF8EvjVt2dkuwTnVCeK8O1XlizgaDmx7uURZOMO7CCTGBqB845zUNowvvveCM= =okg4 -----END PGP SIGNATURE-----

Wiktionary-l mailing list Wiktionary-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiktionary-l

Amgine

4:02 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

One is being written up by the student. The other is a project idea still looking for a student.

Amgine

On 05/04/13 03:25 PM, Sebastian Hellmann wrote:

...

Hi Amgine, I think these are just ideas for now and the students still have to break it down into proposals, right? Do you have links to these ideas?

Our Wiktionary related ideas for GSoC are here: http://wiki.dbpedia.org/gsoc2013/ideas#h254-12 http://wiki.dbpedia.org/gsoc2013/ideas#h254-13

-- Sebastian

Am 05.04.2013 20:48, schrieb Amgine: There are several Wiktionary proposals for GSOC. I'm aware of another for pronunciation recording from wiktionary pages, and one to create a DICT-like api either as an extension to MW api or as a special page extension.

Amgine

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJRX1gYAAoJEBGze5c9ley6ciIH/RVCBcqK4yNLZGXHMGcZQk7u Pi9Yk4+GJ6vV/ayFNZDyqqvxYyAdu9D0/CJwPIjvAWrIVG2Xj7JLWM9l1liGgjgJ r85UFHKODk3Z3O9dkcieAKQcIBDn8UJjNACvep3f2JPmlOjLJeXLtM+0Jgo6sHvX gGjHqBZx3lwnbdDFKRgO5sxCOOQPvn4vstJ5wfAVnUVpCwqP3dkhNOI+m8luNvBZ OchGrxKlNGt8JxDvwW7Z530v5/EKtyl2UUJjXuxw/BBUWu/EIv61jiloVDJMOW/R 6icCYdu84xv5t+fl2r4s/sVgP8VtfhirH+CUd+CuEkhrm3XH+PoTA03XsmgkshU= =RZBR -----END PGP SIGNATURE-----

Mathieu Stumpf

8 Apr 8 Apr

1:43 a.m.

Le 2013-04-05 20:48, Amgine a écrit :

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

There are several Wiktionary proposals for GSOC. I'm aware of another for pronunciation recording from wiktionary pages, and one to create a DICT-like api either as an extension to MW api or as a special page extension.

Amgine

Oh, great! If you have some relevant links, please share them on the meta page. :)

...

On 05/04/13 10:44 AM, Sebastian Hellmann wrote:

...
Hi Mathieu,

Am 05.04.2013 11:56, schrieb Mathieu Stumpf:

...
I added the dbpedia wiktionary entry on [1]. I wasn't aware of your effort, despite being really interesting in the wiktionary future. Could you please read [1] and update it with your vision as a dbpedia contributor?

[1] https://meta.wikimedia.org/wiki/Wiktionary_future

this page is interesting, but seems to be very idealistic. I am not sure, every language community agrees to use a common model. I also wonder if this is possible at all and whether there is an overlap. Do you think it makes sense to edit that page? Normally, there is a lot of talk and planning and nothing comes around in the end.

Note that the good thing about Wiktionary is, that you can add information freely without adhering to a preset structure.

DBpedia is already implementing adapters to load data from WikiData. So Once WikiData is working for Wiktionary, we will have data from there and from the remaining Wikisyntax and merge them.DBpedia and WikiData have a loose cooperation for a joint task in a Google Summer of Code proposal.

All the best, Sebastian

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJRXxx2AAoJEBGze5c9ley6M4QH/1PdSnexBUBj+8BWr8LMBrao WJAzSwMowGsxi+27DcC1VxqWocGgFEbiJ8OTezN47SbcDpQu1QAQOIvq/iU0fgeE zNdV8zLf2C+BH4Ods1Qm6LcPi3efWx4GHtr07BQjmUB/1iW2qZ1adyPu32C6SfTU hsmEnYxDFAXoXSnfJtTZN8SFC4licZykHzJMQke2nibexVPfbkv4s202pCU+Uey1 YyZkWYFzw8cDInODME2OgHIbzEiACq99bsrB2U+1p/aikIt1p5qsBG7k2qkuMUaA XoIF8EvjVt2dkuwTnVCeK8O1XlizgaDmx7uURZOMO7CCTGBqB845zUNowvvveCM= =okg4 -----END PGP SIGNATURE-----

Wiktionary-l mailing list Wiktionary-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiktionary-l

-- Association Culture-Libre http://www.culture-libre.org/

Mathieu Stumpf

1:41 a.m.

Le 2013-04-05 19:44, Sebastian Hellmann a écrit :

...

Hi Mathieu,

Am 05.04.2013 11:56, schrieb Mathieu Stumpf:

...
I added the dbpedia wiktionary entry on [1]. I wasn't aware of your effort, despite being really interesting in the wiktionary future. Could you please read [1] and update it with your vision as a dbpedia contributor?

[1] https://meta.wikimedia.org/wiki/Wiktionary_future

this page is interesting, but seems to be very idealistic. I am not sure, every language community agrees to use a common model. I also wonder if this is possible at all and whether there is an overlap. Do you think it makes sense to edit that page? Normally, there is a lot of talk and planning and nothing comes around in the end.

I would present that in an other way which would to say that this page try to adress the problem with long term perpectives, but with real concrete goals. Sure you can't reach the one solution that will make everybody happy, but making people talk together of their specifics issues and expectations from wikitionaries is a path which I think worth to be explored. To my mind, this should help us to have a better overview of various linguistic knowledge people are expecting to find in wiktionnaries, and how to improve the transmission of this knowledge between each chapters.

As it is said on the page, this is not a trivial problem, because it asks to gather a lot of linguistic expertise, as well as think about the UX we want to provide to end users and facilitate for third parties.

...

Note that the good thing about Wiktionary is, that you can add information freely without adhering to a preset structure.

Yes and no. Sure if you don't count with the wikisyntax, there are no specific structure imposed to wiktionnaries chapters. But in practice, you know that they did adopted a more or less rigid structure, because that was relevant. But now we are in a situation where each chapter have its own idiom of templates, that not only make harder to automate cross-chapter information transmission, but also can make newcommers affraid. This is a really serious issue, I know that at least for the french chapter, we are losing wannabe contributor, because of heavy use we make of template. Don't get me wrong here, I'm not blaming the french wiktionary community, to my mind it's an upstream issue.

You know that having more editors is one of our community goals, don't you? Well, to have more editor, we have to make the participating leurning curve as small as possible. And that require a good UX. And that require a well thought end-user interface/API integration. I have no doubt it will be really difficult to integrate the Visual Editor into the french wiktionary for example, because articles there heavily relies on templates, and as far as I know, the Visual Editor doesn't provide (yet?) any tool to structure information further than section/bold/italic. But in the french wiktionary, even sections are created using templates!

...

DBpedia is already implementing adapters to load data from WikiData. So Once WikiData is working for Wiktionary, we will have data from there and from the remaining Wikisyntax and merge them.DBpedia and WikiData have a loose cooperation for a joint task in a Google Summer of Code proposal.

Well, that's great, we need such a work to be done too. Thank you to do it.

Dimitris Kontokostas

5 Apr 5 Apr

12:05 a.m.

Hi Moutupsi,

You should definitely take look at DBpedia Wiktionary ( http://dbpedia.org/Wiktionary). It supports everything you want and can be easily configured for other languages.

Best, Dimitris

On Thu, Apr 4, 2013 at 4:21 AM, Moutupsi Paul mopaul@cs.stonybrook.eduwrote:

...

Hi All,

Greeting,

I am a CS grad student from Data Science Lab Stony Brook< https://sites.google.com/site/datascienceslab/%3E and I am dropping this mail to request information about parsing multi-lingual Wiktionary data. Our lab has been using Wikipedia data for quite a while now but we are really interested in taking advantage of the massive Wiktionary content which we feel , after proper parsing, can become an rich muti-language corpus.

But the big hurdle is a parsing tool. We have tried a few Wiktionary parsing tools
  https://github.com/clbecker/perl-wiktionary-parser/
https://code.google.com/p/wikokit/wiki/GettingStartedWiktionaryParser

https://github.com/benreynwar/wiktionary-parser/tree/master/wiktionary_parse...
  http://www.ukp.tu-darmstadt.de/software/jwktl/
but none of them are available in a ready-to-use or easy-to-extend in multiple language mode. (I am currently trying to work with wikokit (parser 2 above) )

I request for some advice, suggestion or redirection towards best available Wiktionary parser. We are mainly looking to extract meanings, POS, examples, translations etc. (more can never hurt).

Any help is appreciated. Kindly let know if further information is needed.

Regards,

Moutupsi

Wiktionary-l mailing list Wiktionary-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiktionary-l

-- Dimitris Kontokostas Department of Computer Science, University of Leipzig Research Group: http://aksw.org Homepage:http://aksw.org/DimitrisKontokostas https://lists.wikimedia.org/mailman/listinfo/wiktionary-l

Andrew Krizhanovsky

2:23 a.m.

DBpedia Wiktionary - is very interesting project!

Is it possible to get list of synonyms for the first meaning of the noun "dog" now? http://en.wiktionary.org/wiki/dog#Synonyms

Best regards, Andrew Krizhanovsky.

On Fri, Apr 5, 2013 at 11:05 AM, Dimitris Kontokostas kontokostas@informatik.uni-leipzig.de wrote:

...

Hi Moutupsi,

You should definitely take look at DBpedia Wiktionary ( http://dbpedia.org/Wiktionary). It supports everything you want and can be easily configured for other languages.

Best, Dimitris

On Thu, Apr 4, 2013 at 4:21 AM, Moutupsi Paul mopaul@cs.stonybrook.eduwrote:

...
Hi All,

Greeting,

I am a CS grad student from Data Science Lab Stony Brook< https://sites.google.com/site/datascienceslab/%3E and I am dropping this mail to request information about parsing multi-lingual Wiktionary data. Our lab has been using Wikipedia data for quite a while now but we are really interested in taking advantage of the massive Wiktionary content which we feel , after proper parsing, can become an rich muti-language corpus.

But the big hurdle is a parsing tool. We have tried a few Wiktionary parsing tools
  https://github.com/clbecker/perl-wiktionary-parser/
https://code.google.com/p/wikokit/wiki/GettingStartedWiktionaryParser

https://github.com/benreynwar/wiktionary-parser/tree/master/wiktionary_parse...
  http://www.ukp.tu-darmstadt.de/software/jwktl/
but none of them are available in a ready-to-use or easy-to-extend in multiple language mode. (I am currently trying to work with wikokit (parser 2 above) )

I request for some advice, suggestion or redirection towards best available Wiktionary parser. We are mainly looking to extract meanings, POS, examples, translations etc. (more can never hurt).

Any help is appreciated. Kindly let know if further information is needed.

Regards,

Moutupsi

Wiktionary-l mailing list Wiktionary-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiktionary-l

-- Dimitris Kontokostas Department of Computer Science, University of Leipzig Research Group: http://aksw.org Homepage:http://aksw.org/DimitrisKontokostas https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
Wiktionary-l mailing list Wiktionary-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiktionary-l

Sebastian Hellmann

6:57 a.m.

Hi Andrew, actually the tools to solve this problem are in place: http://en.wiktionary.org/wiki/house#English-abode links to a sense, the highlighting is there, also if you go to Editing Gadgets you can enable "Enable definition editing options." to add glosses. This was created by Yair_rand and it allows you to connect senses with the help of glosses such as "abode".

However, this has not received any uptake by the Wiktionary community.

The idea is to have something like (on http://en.wiktionary.org/wiki/house#English-establishment) # {{senseid|en|establishment}}An [[establishment]], whether actual, as a pub, or virtual, as a website. Particularly restaurant, casino, or financial or trading company. ... * {{sense|establishment}} [[shop]] ... {{trans-top|an establishment}}

But these do not occur frequently. For senses these seem to be available however: http://wiktionary.dbpedia.org/resource/as_soon_as_possible-English-Adverb-1e...

Query: http://wiktionary.dbpedia.org/sparql select * where {Graph ?g {?s http://wiktionary.dbpedia.org/terms/hasSynonym ?o } } limit 100

All the best, Sebastian

Am 05.04.2013 11:23, schrieb Andrew Krizhanovsky:

...

DBpedia Wiktionary - is very interesting project!

Is it possible to get list of synonyms for the first meaning of the noun "dog" now? http://en.wiktionary.org/wiki/dog#Synonyms

Best regards, Andrew Krizhanovsky.

On Fri, Apr 5, 2013 at 11:05 AM, Dimitris Kontokostas kontokostas@informatik.uni-leipzig.de wrote:

...
Hi Moutupsi,

You should definitely take look at DBpedia Wiktionary ( http://dbpedia.org/Wiktionary). It supports everything you want and can be easily configured for other languages.

Best, Dimitris

On Thu, Apr 4, 2013 at 4:21 AM, Moutupsi Paul mopaul@cs.stonybrook.eduwrote:

...
Hi All,

Greeting,

I am a CS grad student from Data Science Lab Stony Brook< https://sites.google.com/site/datascienceslab/%3E and I am dropping this mail to request information about parsing multi-lingual Wiktionary data. Our lab has been using Wikipedia data for quite a while now but we are really interested in taking advantage of the massive Wiktionary content which we feel , after proper parsing, can become an rich muti-language corpus.

But the big hurdle is a parsing tool. We have tried a few Wiktionary parsing tools
  https://github.com/clbecker/perl-wiktionary-parser/
https://code.google.com/p/wikokit/wiki/GettingStartedWiktionaryParser

https://github.com/benreynwar/wiktionary-parser/tree/master/wiktionary_parse...
  http://www.ukp.tu-darmstadt.de/software/jwktl/
but none of them are available in a ready-to-use or easy-to-extend in multiple language mode. (I am currently trying to work with wikokit (parser 2 above) )

I request for some advice, suggestion or redirection towards best available Wiktionary parser. We are mainly looking to extract meanings, POS, examples, translations etc. (more can never hurt).

Any help is appreciated. Kindly let know if further information is needed.

Regards,

Moutupsi

Wiktionary-l mailing list Wiktionary-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiktionary-l

-- Dimitris Kontokostas Department of Computer Science, University of Leipzig Research Group: http://aksw.org Homepage:http://aksw.org/DimitrisKontokostas https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
Wiktionary-l mailing list Wiktionary-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
Wiktionary-l mailing list Wiktionary-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiktionary-l

Andrew Krizhanovsky

9:13 a.m.

Thank Sebastian, for quick reply.

...

...
But these do not occur frequently. For senses these seem to be available however...

Can you count - how many senses and synonyms were successfully extracted from English Wiktionary and Russian Wiktionary, i.e. how many senses and synonyms are available now in DBpedia Wiktionary?

It will be interesting to compare with number of senses and synonyms extracted from Wiktionaries by wikokit parser, see http://code.google.com/p/wikokit/#Statistics

Best regards, Andrew.

On Fri, Apr 5, 2013 at 5:57 PM, Sebastian Hellmann hellmann@informatik.uni-leipzig.de wrote:

...

Hi Andrew, actually the tools to solve this problem are in place: http://en.wiktionary.org/wiki/house#English-abode links to a sense, the highlighting is there, also if you go to Editing Gadgets you can enable "Enable definition editing options." to add glosses. This was created by Yair_rand and it allows you to connect senses with the help of glosses such as "abode".

However, this has not received any uptake by the Wiktionary community.

The idea is to have something like (on http://en.wiktionary.org/wiki/house#English-establishment) # {{senseid|en|establishment}}An [[establishment]], whether actual, as a pub, or virtual, as a website. Particularly restaurant, casino, or financial or trading company. ...

{{sense|establishment}} [[shop]]

... {{trans-top|an establishment}}

But these do not occur frequently. For senses these seem to be available however: http://wiktionary.dbpedia.org/resource/as_soon_as_possible-English-Adverb-1e...

Query: http://wiktionary.dbpedia.org/sparql select * where {Graph ?g {?s http://wiktionary.dbpedia.org/terms/hasSynonym ?o } } limit 100

All the best, Sebastian

Am 05.04.2013 11:23, schrieb Andrew Krizhanovsky:

...
DBpedia Wiktionary - is very interesting project!

Is it possible to get list of synonyms for the first meaning of the noun "dog" now? http://en.wiktionary.org/wiki/dog#Synonyms

Best regards, Andrew Krizhanovsky.

On Fri, Apr 5, 2013 at 11:05 AM, Dimitris Kontokostas kontokostas@informatik.uni-leipzig.de wrote:

...
Hi Moutupsi,

You should definitely take look at DBpedia Wiktionary ( http://dbpedia.org/Wiktionary). It supports everything you want and can be easily configured for other languages.

Best, Dimitris

On Thu, Apr 4, 2013 at 4:21 AM, Moutupsi Paul mopaul@cs.stonybrook.eduwrote:

...
Hi All,

Greeting,

I am a CS grad student from Data Science Lab Stony Brook< https://sites.google.com/site/datascienceslab/%3E and I am dropping this mail to request information about parsing multi-lingual Wiktionary data. Our lab has been using Wikipedia data for quite a while now but we are really interested in taking advantage of the massive Wiktionary content which we feel , after proper parsing, can become an rich muti-language corpus.

But the big hurdle is a parsing tool. We have tried a few Wiktionary parsing tools
  https://github.com/clbecker/perl-wiktionary-parser/
https://code.google.com/p/wikokit/wiki/GettingStartedWiktionaryParser

https://github.com/benreynwar/wiktionary-parser/tree/master/wiktionary_parse...
  http://www.ukp.tu-darmstadt.de/software/jwktl/
but none of them are available in a ready-to-use or easy-to-extend in multiple language mode. (I am currently trying to work with wikokit (parser 2 above) )

I request for some advice, suggestion or redirection towards best available Wiktionary parser. We are mainly looking to extract meanings, POS, examples, translations etc. (more can never hurt).

Any help is appreciated. Kindly let know if further information is needed.

Regards,

Moutupsi

Wiktionary-l mailing list Wiktionary-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiktionary-l

-- Dimitris Kontokostas Department of Computer Science, University of Leipzig Research Group: http://aksw.org Homepage:http://aksw.org/DimitrisKontokostas https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
Wiktionary-l mailing list Wiktionary-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
Wiktionary-l mailing list Wiktionary-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
-- Dipl. Inf. Sebastian Hellmann

Department of Computer Science, University of Leipzig Projects: http://nlp2rdf.org , http://linguistics.okfn.org , http://dbpedia.org/Wiktionary , http://dbpedia.org Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group: http://aksw.org

Sebastian Hellmann

6 Apr 6 Apr

9:53 p.m.

Hi Andrew, some statistics are in here: http://svn.aksw.org/papers/2012/JIST_Wiktionary/public.pdf

I executed a SPARQL query on the store to do these statistics: http://downloads.dbpedia.org/wiktionary/stats_2013_04_06.csv

We tried to honor ELE[1] for extraction, so most likely, if the the Wiktionary page deviates from ELE, then results are not so good for it.

I assume you are familiar with SPARQL, because of your D2R mapping for wikokit. Here is the query: Select ?g ?p count(?p) as ?count where { Graph ?g { ?s ?p ?o } } group by ?p ?g order by desc (?g) desc(?count) It takes to long to run over http. If you are interested in more difficult statistics and calculations, I can also give you better access to our service (maybe even ssh access).

All the best, Sebastian

[1] https://en.wiktionary.org/wiki/Wiktionary:Entry_layout_explained

Am 05.04.2013 18:13, schrieb Andrew Krizhanovsky:

...

Thank Sebastian, for quick reply.

...
...
But these do not occur frequently. For senses these seem to be available however...

Can you count - how many senses and synonyms were successfully extracted from English Wiktionary and Russian Wiktionary, i.e. how many senses and synonyms are available now in DBpedia Wiktionary?

It will be interesting to compare with number of senses and synonyms extracted from Wiktionaries by wikokit parser, seehttp://code.google.com/p/wikokit/#Statistics

Best regards, Andrew.

On Fri, Apr 5, 2013 at 5:57 PM, Sebastian Hellmann hellmann@informatik.uni-leipzig.de wrote:

...
Hi Andrew, actually the tools to solve this problem are in place: http://en.wiktionary.org/wiki/house#English-abode links to a sense, the highlighting is there, also if you go to Editing Gadgets you can enable "Enable definition editing options." to add glosses. This was created by Yair_rand and it allows you to connect senses with the help of glosses such as "abode".

However, this has not received any uptake by the Wiktionary community.

The idea is to have something like (on http://en.wiktionary.org/wiki/house#English-establishment) # {{senseid|en|establishment}}An [[establishment]], whether actual, as a pub, or virtual, as a website. Particularly restaurant, casino, or financial or trading company. ...

{{sense|establishment}} [[shop]]

... {{trans-top|an establishment}}

But these do not occur frequently. For senses these seem to be available however: http://wiktionary.dbpedia.org/resource/as_soon_as_possible-English-Adverb-1e...

Query: http://wiktionary.dbpedia.org/sparql select * where {Graph ?g {?s http://wiktionary.dbpedia.org/terms/hasSynonym ?o } } limit 100

All the best, Sebastian

Am 05.04.2013 11:23, schrieb Andrew Krizhanovsky:

...
DBpedia Wiktionary - is very interesting project!

Is it possible to get list of synonyms for the first meaning of the noun "dog" now? http://en.wiktionary.org/wiki/dog#Synonyms

Best regards, Andrew Krizhanovsky.

On Fri, Apr 5, 2013 at 11:05 AM, Dimitris Kontokostas kontokostas@informatik.uni-leipzig.de wrote:

...
Hi Moutupsi,

You should definitely take look at DBpedia Wiktionary ( http://dbpedia.org/Wiktionary). It supports everything you want and can be easily configured for other languages.

Best, Dimitris

On Thu, Apr 4, 2013 at 4:21 AM, Moutupsi Paul mopaul@cs.stonybrook.eduwrote:

...
Hi All,

Greeting,

I am a CS grad student from Data Science Lab Stony Brook< https://sites.google.com/site/datascienceslab/%3E and I am dropping this mail to request information about parsing multi-lingual Wiktionary data. Our lab has been using Wikipedia data for quite a while now but we are really interested in taking advantage of the massive Wiktionary content which we feel , after proper parsing, can become an rich muti-language corpus.

But the big hurdle is a parsing tool. We have tried a few Wiktionary parsing tools

1.https://github.com/clbecker/perl-wiktionary-parser/

https://code.google.com/p/wikokit/wiki/GettingStartedWiktionaryParser

https://github.com/benreynwar/wiktionary-parser/tree/master/wiktionary_parse...

4.http://www.ukp.tu-darmstadt.de/software/jwktl/

but none of them are available in a ready-to-use or easy-to-extend in multiple language mode. (I am currently trying to work with wikokit (parser 2 above) )

I request for some advice, suggestion or redirection towards best available Wiktionary parser. We are mainly looking to extract meanings, POS, examples, translations etc. (more can never hurt).

Any help is appreciated. Kindly let know if further information is needed.

Regards,

Moutupsi

Wiktionary-l mailing list Wiktionary-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiktionary-l

-- Dimitris Kontokostas Department of Computer Science, University of Leipzig Research Group:http://aksw.org Homepage:http://aksw.org/DimitrisKontokostas https://lists.wikimedia.org/mailman/listinfo/wiktionary-l

Wiktionary-l mailing list Wiktionary-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiktionary-l

Wiktionary-l mailing list Wiktionary-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiktionary-l

-- Dipl. Inf. Sebastian Hellmann

Department of Computer Science, University of Leipzig Projects:http://nlp2rdf.org ,http://linguistics.okfn.org , http://dbpedia.org/Wiktionary ,http://dbpedia.org Homepage:http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group:http://aksw.org

Andrew Krizhanovsky

7 Apr 7 Apr

12:46 a.m.

Thank you for the paper. I like the overview in this paper and the clear description of Wiktionary parsing difficulties.

In the beginning of the wikokit development I thought about Finite-state machine in order to extract data, but it was very complex for me, and Wiktionary data formatting are too various in kind or quality :) So, I selected usual procedural programming with short pieces of regular expressions.

But you project proves that Finite-state machines could be used in non-trivial situations. Great!

-- Andrew Krizhanovsky.

On Sun, Apr 7, 2013 at 8:53 AM, Sebastian Hellmann hellmann@informatik.uni-leipzig.de wrote:

...

Hi Andrew, some statistics are in here: http://svn.aksw.org/papers/2012/JIST_Wiktionary/public.pdf

I executed a SPARQL query on the store to do these statistics: http://downloads.dbpedia.org/wiktionary/stats_2013_04_06.csv

We tried to honor ELE[1] for extraction, so most likely, if the the Wiktionary page deviates from ELE, then results are not so good for it.

I assume you are familiar with SPARQL, because of your D2R mapping for wikokit. Here is the query: Select ?g ?p count(?p) as ?count where { Graph ?g { ?s ?p ?o } } group by ?p ?g order by desc (?g) desc(?count) It takes to long to run over http. If you are interested in more difficult statistics and calculations, I can also give you better access to our service (maybe even ssh access).

All the best, Sebastian

[1] https://en.wiktionary.org/wiki/Wiktionary:Entry_layout_explained

Am 05.04.2013 18:13, schrieb Andrew Krizhanovsky:

...
Thank Sebastian, for quick reply.

...
...
But these do not occur frequently. For senses these seem to be available however...

Can you count - how many senses and synonyms were successfully extracted from English Wiktionary and Russian Wiktionary, i.e. how many senses and synonyms are available now in DBpedia Wiktionary?

It will be interesting to compare with number of senses and synonyms extracted from Wiktionaries by wikokit parser, seehttp://code.google.com/p/wikokit/#Statistics

Best regards, Andrew.

On Fri, Apr 5, 2013 at 5:57 PM, Sebastian Hellmann hellmann@informatik.uni-leipzig.de wrote:

...
Hi Andrew, actually the tools to solve this problem are in place: http://en.wiktionary.org/wiki/house#English-abode links to a sense, the highlighting is there, also if you go to Editing Gadgets you can enable "Enable definition editing options." to add glosses. This was created by Yair_rand and it allows you to connect senses with the help of glosses such as "abode".

However, this has not received any uptake by the Wiktionary community.

The idea is to have something like (on http://en.wiktionary.org/wiki/house#English-establishment) # {{senseid|en|establishment}}An [[establishment]], whether actual, as a pub, or virtual, as a website. Particularly restaurant, casino, or financial or trading company. ...

{{sense|establishment}} [[shop]]

... {{trans-top|an establishment}}

But these do not occur frequently. For senses these seem to be available however:

http://wiktionary.dbpedia.org/resource/as_soon_as_possible-English-Adverb-1e...

Query: http://wiktionary.dbpedia.org/sparql select * where {Graph ?g {?s http://wiktionary.dbpedia.org/terms/hasSynonym ?o } } limit 100

All the best, Sebastian

Am 05.04.2013 11:23, schrieb Andrew Krizhanovsky:

...
DBpedia Wiktionary - is very interesting project!

Is it possible to get list of synonyms for the first meaning of the noun "dog" now? http://en.wiktionary.org/wiki/dog#Synonyms

Best regards, Andrew Krizhanovsky.

On Fri, Apr 5, 2013 at 11:05 AM, Dimitris Kontokostas kontokostas@informatik.uni-leipzig.de wrote:

...
Hi Moutupsi,

You should definitely take look at DBpedia Wiktionary ( http://dbpedia.org/Wiktionary). It supports everything you want and can be easily configured for other languages.

Best, Dimitris

On Thu, Apr 4, 2013 at 4:21 AM, Moutupsi Paul mopaul@cs.stonybrook.eduwrote:

...
Hi All,

Greeting,

I am a CS grad student from Data Science Lab Stony Brook< https://sites.google.com/site/datascienceslab/%3E and I am dropping this mail to request information about parsing multi-lingual Wiktionary data. Our lab has been using Wikipedia data for quite a while now but we are really interested in taking advantage of the massive Wiktionary content which we feel , after proper parsing, can become an rich muti-language corpus.

But the big hurdle is a parsing tool. We have tried a few Wiktionary parsing tools

1.https://github.com/clbecker/perl-wiktionary-parser/

https://code.google.com/p/wikokit/wiki/GettingStartedWiktionaryParser

https://github.com/benreynwar/wiktionary-parser/tree/master/wiktionary_parse...

4.http://www.ukp.tu-darmstadt.de/software/jwktl/

but none of them are available in a ready-to-use or easy-to-extend in multiple language mode. (I am currently trying to work with wikokit (parser 2 above) )

I request for some advice, suggestion or redirection towards best available Wiktionary parser. We are mainly looking to extract meanings, POS, examples, translations etc. (more can never hurt).

Any help is appreciated. Kindly let know if further information is needed.

Regards,

Moutupsi

Wiktionary-l mailing list Wiktionary-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiktionary-l

-- Dimitris Kontokostas Department of Computer Science, University of Leipzig Research Group:http://aksw.org Homepage:http://aksw.org/DimitrisKontokostas https://lists.wikimedia.org/mailman/listinfo/wiktionary-l

Wiktionary-l mailing list Wiktionary-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiktionary-l

Wiktionary-l mailing list Wiktionary-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiktionary-l

-- Dipl. Inf. Sebastian Hellmann

Department of Computer Science, University of Leipzig Projects:http://nlp2rdf.org ,http://linguistics.okfn.org , http://dbpedia.org/Wiktionary ,http://dbpedia.org Homepage:http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group:http://aksw.org

-- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects: http://nlp2rdf.org , http://linguistics.okfn.org , http://dbpedia.org/Wiktionary , http://dbpedia.org Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group: http://aksw.org

Lars Aronsson

4:54 a.m.

On 04/07/2013 09:46 AM, Andrew Krizhanovsky wrote:

...

Thank you for the paper. I like the overview in this paper and the clear description of Wiktionary parsing difficulties.

An issue that is related to Wiktionary parsing is the automatic creation of Wiktionary entries by bots.

I have used a bot to create inflection entries, but only for Swedish words in the English Wiktionary, and not for main entries with definitions. What attempts of that kind have been made, and what software or data structures have they used? Could that work be generalized and coordinated?

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

4277

Age (days ago)

4281

Last active (days ago)

wiktionary-l@lists.wikimedia.org

15 comments

7 participants

tags (0)

participants (7)

Amgine
Andrew Krizhanovsky
Dimitris Kontokostas
Lars Aronsson
Mathieu Stumpf
Moutupsi Paul
Sebastian Hellmann