[GSoc2013] Interested in developing Entity Suggester

List overview All Threads
Download

newer

older

Extensions History in enwiki

Prototyping Wiki Inline Comments

Nilesh Chakraborty

19 Apr 2013 19 Apr '13

7:50 p.m.

Hi,

I am a 3rd year undergraduate student of computer science, pursuing my B.Tech degree at RCC Institute of Information Technology. I am proficient in Java, PHP and C#.

Among the project ideas on the GSoC 2013 ideas page, the one particular idea that seemed really interesting to me is developing an Entity Suggester for Wikidata. I want to work on it.

I am passionate about data mining, big data and recommendation engines, therefore this idea naturally appeals to me a lot. I have experience with building music and people recommendation systems, and have worked with Myrrix and Apache Mahout. I recently designed and implemented such a recommendation system and deployed it on a live production site, where I'm interning at, to recommend Facebook users to each other depending upon their interests.

The problem is, the documentation for Wikidata and the Wikibase extension seems pretty daunting to me since I have not ever configured a mediawiki instance or actually used it. (I am on my way to try it out following the instructions at http://www.mediawiki.org/wiki/Summer_of_Code_2013#Where_to_start.) I can easily build a recommendation system and create a web-service or REST based API through which the engine can be trained with existing data, and queried and all. This seems to be a collaborative filtering problem (people who bought x also bought y). It'll be easier if I could get some help about the part where/how I need to integrate it with Wikidata. Also, some sample datasets (csv files?) or schemas (just the column names and data types?) would help a lot, for me to figure this out.

I have added this email as a comment on the bug report at https://bugzilla.wikimedia.org/show_bug.cgi?id=46555#c1.

Please ask me if you have any questions. :-)

Thanks, Nilesh

-- A quest eternal, a life so small! So don't just play the guitar, build one. You can also email me at contact@nileshc.com or visit my websitehttp://www.nileshc.com/

Show replies by date

Lydia Pintscher

21 Apr 21 Apr

2:52 p.m.

New subject: [GSoc2013] Interested in developing Entity Suggester

Hi Nilesh :)

On Fri, Apr 19, 2013 at 7:50 PM, Nilesh Chakraborty nilesh@nileshc.com wrote:

...

Hi,

I am a 3rd year undergraduate student of computer science, pursuing my B.Tech degree at RCC Institute of Information Technology. I am proficient in Java, PHP and C#.

Among the project ideas on the GSoC 2013 ideas page, the one particular idea that seemed really interesting to me is developing an Entity Suggester for Wikidata. I want to work on it.

I am passionate about data mining, big data and recommendation engines, therefore this idea naturally appeals to me a lot. I have experience with building music and people recommendation systems, and have worked with Myrrix and Apache Mahout. I recently designed and implemented such a recommendation system and deployed it on a live production site, where I'm interning at, to recommend Facebook users to each other depending upon their interests.

This sounds excellent!

...

The problem is, the documentation for Wikidata and the Wikibase extension seems pretty daunting to me since I have not ever configured a mediawiki instance or actually used it. (I am on my way to try it out following the instructions at http://www.mediawiki.org/wiki/Summer_of_Code_2013#Where_to_start.) I can easily build a recommendation system and create a web-service or REST based API through which the engine can be trained with existing data, and queried and all. This seems to be a collaborative filtering problem (people who bought x also bought y). It'll be easier if I could get some help about the part where/how I need to integrate it with Wikidata. Also, some sample datasets (csv files?) or schemas (just the column names and data types?) would help a lot, for me to figure this out.

It is important I think that you try to set up a system where you can test what you're working on. If the documentation is not good enough for you to get this running please let me know where you are stuck. Then we need to improve the documentation there. That'll make it a lot easier for others following you :)

I assume you have also already gotten yourself familiar with wikidata.org, browsed around and made a few edits? That should help you get a feeling for why the suggester is so important. http://meta.wikimedia.org/wiki/Wikidata/Notes/Data_model_primer is also important to understand for this project.

Let me know if you have more questions or get stuck.

Cheers Lydia

-- Lydia Pintscher - http://about.me/lydia.pintscher Community Communications for Technical Projects

Wikimedia Deutschland e.V. Obentrautstr. 72 10963 Berlin www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

Denny Vrandečić

22 Apr 22 Apr

4:24 p.m.

New subject: [GSoc2013] Interested in developing Entity Suggester

You can get the data from here: http://dumps.wikimedia.org/wikidatawiki/20130417/

All items with all properties and their values are inside the dump. The questions would be, based on this data, could we make suggestions for:

* when I create a new statement, suggest a property. then suggest a value * suggest qualifier properties, then suggest qualifier values (there is no data yet on qualifiers, but this would change soon) * suggest properties for references, and values

Does this help?

Cheers, Denny

2013/4/19 Nilesh Chakraborty nilesh@nileshc.com

...

Hi,

I am a 3rd year undergraduate student of computer science, pursuing my B.Tech degree at RCC Institute of Information Technology. I am proficient in Java, PHP and C#.

Among the project ideas on the GSoC 2013 ideas page, the one particular idea that seemed really interesting to me is developing an Entity Suggester for Wikidata. I want to work on it.

I am passionate about data mining, big data and recommendation engines, therefore this idea naturally appeals to me a lot. I have experience with building music and people recommendation systems, and have worked with Myrrix and Apache Mahout. I recently designed and implemented such a recommendation system and deployed it on a live production site, where I'm interning at, to recommend Facebook users to each other depending upon their interests.

The problem is, the documentation for Wikidata and the Wikibase extension seems pretty daunting to me since I have not ever configured a mediawiki instance or actually used it. (I am on my way to try it out following the instructions at http://www.mediawiki.org/wiki/Summer_of_Code_2013#Where_to_start.) I can easily build a recommendation system and create a web-service or REST based API through which the engine can be trained with existing data, and queried and all. This seems to be a collaborative filtering problem (people who bought x also bought y). It'll be easier if I could get some help about the part where/how I need to integrate it with Wikidata. Also, some sample datasets (csv files?) or schemas (just the column names and data types?) would help a lot, for me to figure this out.

I have added this email as a comment on the bug report at https://bugzilla.wikimedia.org/show_bug.cgi?id=46555#c1.

Please ask me if you have any questions. :-)

Thanks, Nilesh

-- A quest eternal, a life so small! So don't just play the guitar, build one. You can also email me at contact@nileshc.com or visit my websitehttp://www.nileshc.com/ _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

Nilesh Chakraborty

27 Apr 27 Apr

10:25 p.m.

New subject: [GSoc2013] Interested in developing Entity Suggester

Hi Lydia, hello Denny,

Thank you so much. I have to apologise for my delayed response, got caught up with university lab exams and assignments; the semester is drawing to an end.

Firstly, Lydia - thanks for the link. The Data model primer really helps. I've been browsing the pages on wikidata.org, and surely now I realize why this is so important (please read my question in the last paragraph). It'll reduce a lot of repeated labour. I'm currently setting up a mediawiki instance and my development environment. I'll ask quick questions if I need any help along the way.

Denny - That's just what I wanted to know - clean and crisp. Thanks! I'm browsing the xml for some familiar entries and comparing them to the pages on wikidata.org (eg. India http://www.wikidata.org/wiki/Q668). I'm getting the whole picture now.

I have a question - when someone creates a new statement, for suggesting "properties", I can use collaborative filtering to make suggestions. Example, explained in the simplest terms - suppose there are X cities in the dataset. The user is adding another city (writes 'city in Australia' for short description). The system checks all other cities, figures out the common properties and suggests them. Cool. But I can't get any "exact" ideas off the top of my head that can used to suggest "values" for the properties. Suppose one of the recommended properties is "population". How can I make the system guess its value? (Am I getting this right?) Have you guys got anything on your minds regarding this? Please point me to the right direction. :)

Cheers, Nilesh

On Mon, Apr 22, 2013 at 7:54 PM, Denny Vrandečić < denny.vrandecic@wikimedia.de> wrote:

...

You can get the data from here: http://dumps.wikimedia.org/wikidatawiki/20130417/

All items with all properties and their values are inside the dump. The questions would be, based on this data, could we make suggestions for:

when I create a new statement, suggest a property. then suggest a value

suggest qualifier properties, then suggest qualifier values (there is no

data yet on qualifiers, but this would change soon)

suggest properties for references, and values

Does this help?

Cheers, Denny

2013/4/19 Nilesh Chakraborty nilesh@nileshc.com

...
Hi,

I am a 3rd year undergraduate student of computer science, pursuing my B.Tech degree at RCC Institute of Information Technology. I am proficient in Java, PHP and C#.

Among the project ideas on the GSoC 2013 ideas page, the one particular idea that seemed really interesting to me is developing an Entity Suggester for Wikidata. I want to work on it.

I am passionate about data mining, big data and recommendation engines, therefore this idea naturally appeals to me a lot. I have experience with building music and people recommendation systems, and have worked with Myrrix and Apache Mahout. I recently designed and implemented such a recommendation system and deployed it on a live production site, where

I'm

...
interning at, to recommend Facebook users to each other depending upon their interests.

The problem is, the documentation for Wikidata and the Wikibase extension seems pretty daunting to me since I have not ever configured a mediawiki instance or actually used it. (I am on my way to try it out following the instructions at http://www.mediawiki.org/wiki/Summer_of_Code_2013#Where_to_start.) I can easily build a recommendation system and create a web-service or REST

based

...
API through which the engine can be trained with existing data, and

queried

...
and all. This seems to be a collaborative filtering problem (people who bought x also bought y). It'll be easier if I could get some help about

the

...
part where/how I need to integrate it with Wikidata. Also, some sample datasets (csv files?) or schemas (just the column names and data types?) would help a lot, for me to figure this out.

I have added this email as a comment on the bug report at https://bugzilla.wikimedia.org/show_bug.cgi?id=46555#c1.

Please ask me if you have any questions. :-)

Thanks, Nilesh

-- A quest eternal, a life so small! So don't just play the guitar, build

one.

...
You can also email me at contact@nileshc.com or visit my websitehttp://www.nileshc.com/ _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- A quest eternal, a life so small! So don't just play the guitar, build one. You can also email me at contact@nileshc.com or visit my websitehttp://www.nileshc.com/

Lydia Pintscher

10:40 p.m.

New subject: [GSoc2013] Interested in developing Entity Suggester

On Sat, Apr 27, 2013 at 10:25 PM, Nilesh Chakraborty nilesh@nileshc.com wrote:

...

I have a question - when someone creates a new statement, for suggesting "properties", I can use collaborative filtering to make suggestions. Example, explained in the simplest terms - suppose there are X cities in the dataset. The user is adding another city (writes 'city in Australia' for short description). The system checks all other cities, figures out the common properties and suggests them. Cool. But I can't get any "exact" ideas off the top of my head that can used to suggest "values" for the properties. Suppose one of the recommended properties is "population". How can I make the system guess its value? (Am I getting this right?) Have you guys got anything on your minds regarding this? Please point me to the right direction. :)

For your example I'd say that isn't really possible indeed. But take for example a country. Someone wants to add http://www.wikidata.org/wiki/Property:P30 to indicate which continent this country is on. Across all of Wikidata this property should have a very limited number of values. The same is true for things like the sex of a person. And then for something a bit more advanced: there are things like the property father. The suggested values for this should be other items that are persons. http://www.wikidata.org/wiki/Wikidata:List_of_properties has the list of all current properties. I am sure you can find more such cases.

Hope that makes it clearer.

Cheers Lydia

-- Lydia Pintscher - http://about.me/lydia.pintscher Community Communications for Technical Projects

Wikimedia Deutschland e.V. Obentrautstr. 72 10963 Berlin www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Nilesh Chakraborty

11:14 p.m.

New subject: [GSoc2013] Interested in developing Entity Suggester

Hi Lydia,

That helps a lot, and makes it way more interesting. Rather than being a one-size-fits-all solution, as it seems to me, each property or each type of property (eg. different relationships) will need individual attention and different methods/metrics for recommendation.

The examples you gave, like continents, sex, relations like father/son, uncle/aunt/spouse, or place-oriented properties like place of birth, country of citizenship, ethnic group etc. - each type has a certain pattern to it (if a person was born in the US, US should be one of the countries he was a citizen of; US census/ethnicity statistics may be used to predict ethnic group etc.) I'm already starting to chalk out a few patterns and how they can be used for recommendation. In my proposal, should I go into details regarding these? Or should I just give a few examples and explain how the algorithms would work, to explain the idea?

Thanks, Nilesh

On Sun, Apr 28, 2013 at 2:10 AM, Lydia Pintscher < lydia.pintscher@wikimedia.de> wrote:

...

On Sat, Apr 27, 2013 at 10:25 PM, Nilesh Chakraborty nilesh@nileshc.com wrote:

...
I have a question - when someone creates a new statement, for suggesting "properties", I can use collaborative filtering to make suggestions. Example, explained in the simplest terms - suppose there are X cities in the dataset. The user is adding another city (writes 'city in Australia' for short description). The system checks all other cities, figures out

the

...
common properties and suggests them. Cool. But I can't get any "exact" ideas off the top of my head that can used to suggest "values" for the properties. Suppose one of the recommended properties is "population".

How

...
can I make the system guess its value? (Am I getting this right?) Have

you

...
guys got anything on your minds regarding this? Please point me to the right direction. :)

For your example I'd say that isn't really possible indeed. But take for example a country. Someone wants to add http://www.wikidata.org/wiki/Property:P30 to indicate which continent this country is on. Across all of Wikidata this property should have a very limited number of values. The same is true for things like the sex of a person. And then for something a bit more advanced: there are things like the property father. The suggested values for this should be other items that are persons. http://www.wikidata.org/wiki/Wikidata:List_of_properties has the list of all current properties. I am sure you can find more such cases.

Hope that makes it clearer.

Cheers Lydia

-- Lydia Pintscher - http://about.me/lydia.pintscher Community Communications for Technical Projects

Wikimedia Deutschland e.V. Obentrautstr. 72 10963 Berlin www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- A quest eternal, a life so small! So don't just play the guitar, build one. You can also email me at contact@nileshc.com or visit my websitehttp://www.nileshc.com/

Lydia Pintscher

11:25 p.m.

New subject: [GSoc2013] Interested in developing Entity Suggester

On Sat, Apr 27, 2013 at 11:14 PM, Nilesh Chakraborty nilesh@nileshc.com wrote:

...

Hi Lydia,

That helps a lot, and makes it way more interesting. Rather than being a one-size-fits-all solution, as it seems to me, each property or each type of property (eg. different relationships) will need individual attention and different methods/metrics for recommendation.

The examples you gave, like continents, sex, relations like father/son, uncle/aunt/spouse, or place-oriented properties like place of birth, country of citizenship, ethnic group etc. - each type has a certain pattern to it (if a person was born in the US, US should be one of the countries he was a citizen of; US census/ethnicity statistics may be used to predict ethnic group etc.) I'm already starting to chalk out a few patterns and how they can be used for recommendation. In my proposal, should I go into details regarding these? Or should I just give a few examples and explain how the algorithms would work, to explain the idea?

Give some examples and how you'd handle them. You definitely don't need to have it for all properties. What's important is giving an idea about how you'd tackle the problem. Give the reader the impression that you know what you are talking about and can handle the larger problem.

Also: Don't make the system too intelligent like it knowing about US census data for example. Keep it simple and stupid for now. Things like "property A is usually used with value X, Y or Z" should cover a lot already and are likely enough for most cases.

Cheers Lydia

-- Lydia Pintscher - http://about.me/lydia.pintscher Community Communications for Technical Projects

Wikimedia Deutschland e.V. Obentrautstr. 72 10963 Berlin www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Nilesh Chakraborty

11:31 p.m.

New subject: [GSoc2013] Interested in developing Entity Suggester

Awesome. Got it.

I see what you mean, great, thank you. :)

Cheers, Nilesh On Apr 28, 2013 2:56 AM, "Lydia Pintscher" lydia.pintscher@wikimedia.de wrote:

...

On Sat, Apr 27, 2013 at 11:14 PM, Nilesh Chakraborty nilesh@nileshc.com wrote:

...
Hi Lydia,

That helps a lot, and makes it way more interesting. Rather than being a one-size-fits-all solution, as it seems to me, each property or each type of property (eg. different relationships) will need individual attention and different methods/metrics for recommendation.

The examples you gave, like continents, sex, relations like father/son, uncle/aunt/spouse, or place-oriented properties like place of birth, country of citizenship, ethnic group etc. - each type has a certain

pattern

...
to it (if a person was born in the US, US should be one of the countries

he

...
was a citizen of; US census/ethnicity statistics may be used to predict ethnic group etc.) I'm already starting to chalk out a few patterns and

how

...
they can be used for recommendation. In my proposal, should I go into details regarding these? Or should I just give a few examples and explain how the algorithms would work, to explain the idea?

Give some examples and how you'd handle them. You definitely don't need to have it for all properties. What's important is giving an idea about how you'd tackle the problem. Give the reader the impression that you know what you are talking about and can handle the larger problem.

Also: Don't make the system too intelligent like it knowing about US census data for example. Keep it simple and stupid for now. Things like "property A is usually used with value X, Y or Z" should cover a lot already and are likely enough for most cases.

Cheers Lydia

-- Lydia Pintscher - http://about.me/lydia.pintscher Community Communications for Technical Projects

Wikimedia Deutschland e.V. Obentrautstr. 72 10963 Berlin www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Nilesh Chakraborty

3 May 3 May

5:39 a.m.

New subject: [GSoc2013] Interested in developing Entity Suggester

Hi Lydia,

I am currently drafting my proposal, I shall submit within a few hours once the initial version is complete.

I installed mediawiki-vagrant on my PC and it went quite smoothly. I could do all the usual things through the browser; I logged into the mysql server to examine the database schema.

I also began to clone the wikidata-vagranthttps://github.com/SilkeMeyer/wikidata-vagrant repo. But it seems that the 'git submodule update --init' part would take a long time - if I'm not mistaken, it's a huge download (excluding the vagrant up command, which alone takes around 1.25 hours to download everything). I wanted to clarify something before downloading it all.

Since the entity suggester will be working with wikidata, it'll obviously need to access the whole live dataset from the database (not the xml dump) to make the recommendations. I tried searching for database access APIs or high-level REST APIs for wikidata, but couldn't figure out how I to do that. Could you point me to the proper documentation?

And also, what is the best way to add a few .jar files to wikidata and execute them with custom commands (nohup java blah.jar --blah blah --> running as daemons)? I can of course set it up on my development box inside virtualbox - I want to know how to "integrate" it into the system so that any other user can download vagrant and wikidata and have the jars all ready and running? What is the proper development workflow for this?

Thanks, Nilesh

On Sun, Apr 28, 2013 at 3:01 AM, Nilesh Chakraborty nilesh@nileshc.comwrote:

...

Awesome. Got it.

I see what you mean, great, thank you. :)

Cheers, Nilesh On Apr 28, 2013 2:56 AM, "Lydia Pintscher" lydia.pintscher@wikimedia.de wrote:

...
On Sat, Apr 27, 2013 at 11:14 PM, Nilesh Chakraborty nilesh@nileshc.com wrote:

...
Hi Lydia,

That helps a lot, and makes it way more interesting. Rather than being a one-size-fits-all solution, as it seems to me, each property or each

type

...
of property (eg. different relationships) will need individual attention and different methods/metrics for recommendation.

The examples you gave, like continents, sex, relations like father/son, uncle/aunt/spouse, or place-oriented properties like place of birth, country of citizenship, ethnic group etc. - each type has a certain

pattern

...
to it (if a person was born in the US, US should be one of the

countries he

...
was a citizen of; US census/ethnicity statistics may be used to predict ethnic group etc.) I'm already starting to chalk out a few patterns and

how

...
they can be used for recommendation. In my proposal, should I go into details regarding these? Or should I just give a few examples and

explain

...
how the algorithms would work, to explain the idea?

Give some examples and how you'd handle them. You definitely don't need to have it for all properties. What's important is giving an idea about how you'd tackle the problem. Give the reader the impression that you know what you are talking about and can handle the larger problem.

Also: Don't make the system too intelligent like it knowing about US census data for example. Keep it simple and stupid for now. Things like "property A is usually used with value X, Y or Z" should cover a lot already and are likely enough for most cases.

Cheers Lydia

-- Lydia Pintscher - http://about.me/lydia.pintscher Community Communications for Technical Projects

Wikimedia Deutschland e.V. Obentrautstr. 72 10963 Berlin www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- A quest eternal, a life so small! So don't just play the guitar, build one. You can also email me at contact@nileshc.com or visit my websitehttp://www.nileshc.com/

aude

12:23 p.m.

New subject: [GSoc2013] Interested in developing Entity Suggester

On Fri, May 3, 2013 at 5:39 AM, Nilesh Chakraborty nilesh@nileshc.comwrote:

...

Hi Lydia,

I am currently drafting my proposal, I shall submit within a few hours once the initial version is complete.

I installed mediawiki-vagrant on my PC and it went quite smoothly. I could do all the usual things through the browser; I logged into the mysql server to examine the database schema.

I also began to clone the wikidata-vagranthttps://github.com/SilkeMeyer/wikidata-vagrant repo. But it seems that the 'git submodule update --init' part would take a long time - if I'm not mistaken, it's a huge download (excluding the vagrant up command, which alone takes around 1.25 hours to download everything). I wanted to clarify something before downloading it all.

Since the entity suggester will be working with wikidata, it'll obviously need to access the whole live dataset from the database (not the xml dump) to make the recommendations. I tried searching for database access APIs or high-level REST APIs for wikidata, but couldn't figure out how I to do that. Could you point me to the proper documentation?

One of the best examples of a MediaWiki extension interacting with a Java service is how Solr is used. Solr is still pretty new at Wikimedia, though. It is used with the GeoData extension and then Solr is used by geodata api modules.

I think Solr gets updated via a cronjob (solrupdate.php) which creates jobs in the job queue. Not 100% sure of the exact details.

I do not think direct access to the live database is very practical. I think anyway the data (json blobs) would need indexing in some particular way to support what the entity selector needs to do.

http://www.mediawiki.org/wiki/Extension:GeoData

The Translate extension also uses Solr in some way, though I am not very familiar with the details.

On the operations side, puppet is used to configure everything. The puppet git repo is available to see how things are done.

https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=tree;f=modul...

...

And also, what is the best way to add a few .jar files to wikidata and execute them with custom commands (nohup java blah.jar --blah blah --> running as daemons)? I can of course set it up on my development box inside virtualbox - I want to know how to "integrate" it into the system so that any other user can download vagrant and wikidata and have the jars all ready and running? What is the proper development workflow for this?

wikidata-vagrant is maintained in github, though I think might not work perfectly right now. We need to update it and it's on our to-do, and perhaps could be moved to gerrit. I do not know about integrating the jars, but should be possible.

Cheers, Katie Filbert

[answering from this email, as I am not subscribed to wikitech-l on my wikimedia.de email]

...

Thanks, Nilesh

On Sun, Apr 28, 2013 at 3:01 AM, Nilesh Chakraborty <nilesh@nileshc.com

...
wrote:

...
Awesome. Got it.

I see what you mean, great, thank you. :)

Cheers, Nilesh On Apr 28, 2013 2:56 AM, "Lydia Pintscher" <lydia.pintscher@wikimedia.de

wrote:

...
On Sat, Apr 27, 2013 at 11:14 PM, Nilesh Chakraborty <

nilesh@nileshc.com>

...
...
wrote:

...
Hi Lydia,

That helps a lot, and makes it way more interesting. Rather than

being a

...
...
...
one-size-fits-all solution, as it seems to me, each property or each

type

...
of property (eg. different relationships) will need individual

attention

...
...
...
and different methods/metrics for recommendation.

The examples you gave, like continents, sex, relations like

father/son,

...
...
...
uncle/aunt/spouse, or place-oriented properties like place of birth, country of citizenship, ethnic group etc. - each type has a certain

pattern

...
to it (if a person was born in the US, US should be one of the

countries he

...
was a citizen of; US census/ethnicity statistics may be used to

predict

...
...
...
ethnic group etc.) I'm already starting to chalk out a few patterns

and

...
...
how

...
they can be used for recommendation. In my proposal, should I go into details regarding these? Or should I just give a few examples and

explain

...
how the algorithms would work, to explain the idea?

Give some examples and how you'd handle them. You definitely don't need to have it for all properties. What's important is giving an idea about how you'd tackle the problem. Give the reader the impression that you know what you are talking about and can handle the larger problem.

Also: Don't make the system too intelligent like it knowing about US census data for example. Keep it simple and stupid for now. Things like "property A is usually used with value X, Y or Z" should cover a lot already and are likely enough for most cases.

Cheers Lydia

-- Lydia Pintscher - http://about.me/lydia.pintscher Community Communications for Technical Projects

Wikimedia Deutschland e.V. Obentrautstr. 72 10963 Berlin www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- A quest eternal, a life so small! So don't just play the guitar, build one. You can also email me at contact@nileshc.com or visit my websitehttp://www.nileshc.com/ _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- @wikimediadc / @wikidata

Nilesh Chakraborty

4 May 4 May

7:35 p.m.

New subject: [GSoc2013] Interested in developing Entity Suggester

Thanks for the help, Katie. I'll be looking into how Solr has been integrated with the GeoData extension. About wikidata-vagrant, no problem, I'll install it by following this pagehttp://www.mediawiki.org/wiki/Extension:Wikibase .

You're right, raw DB access can be painful and I'd need to rewrite a lot of code. I'm considering two options:

*i)* Using the database-related code in the wikidata extension (I'm studying the DataModel classes and how they interact with the database) to fetch what I need and feed them into the recommendation engine.

*ii)* Not accessing the DB at all. Rather, I can write map-reduce scripts to extract all the training data and everything I need for each Item from the wikidatawiki data dump and feed it into the recommendation engine. I can use a cron job to download the latest data dump when available, and run the scripts on it. I don't think it would be an issue even if the engine lags by the interval the dumps are generated in, since the whole recommendation thing is all about approximations.

My request to the devs, the community - please discuss the pros and cons of each method and suggest which one you think would be the best, mainly in terms of performance. I personally feel that option (ii) would be cleaner.

Cheers, Nilesh

On Fri, May 3, 2013 at 3:53 PM, aude aude.wiki@gmail.com wrote:

...

On Fri, May 3, 2013 at 5:39 AM, Nilesh Chakraborty <nilesh@nileshc.com

...
wrote:

...
Hi Lydia,

I am currently drafting my proposal, I shall submit within a few hours

once

...
the initial version is complete.

I installed mediawiki-vagrant on my PC and it went quite smoothly. I

could

...
do all the usual things through the browser; I logged into the mysql

server

...
to examine the database schema.

I also began to clone the wikidata-vagranthttps://github.com/SilkeMeyer/wikidata-vagrant repo. But it seems that the 'git submodule update --init' part would take a

long

...
time - if I'm not mistaken, it's a huge download (excluding the vagrant

up

...
command, which alone takes around 1.25 hours to download everything). I wanted to clarify something before downloading it all.

Since the entity suggester will be working with wikidata, it'll obviously need to access the whole live dataset from the database (not the xml

dump)

...
to make the recommendations. I tried searching for database access APIs

or

...
high-level REST APIs for wikidata, but couldn't figure out how I to do that. Could you point me to the proper documentation?

One of the best examples of a MediaWiki extension interacting with a Java service is how Solr is used. Solr is still pretty new at Wikimedia, though. It is used with the GeoData extension and then Solr is used by geodata api modules.

I think Solr gets updated via a cronjob (solrupdate.php) which creates jobs in the job queue. Not 100% sure of the exact details.

I do not think direct access to the live database is very practical. I think anyway the data (json blobs) would need indexing in some particular way to support what the entity selector needs to do.

http://www.mediawiki.org/wiki/Extension:GeoData

The Translate extension also uses Solr in some way, though I am not very familiar with the details.

On the operations side, puppet is used to configure everything. The puppet git repo is available to see how things are done.

https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=tree;f=modul...

...
And also, what is the best way to add a few .jar files to wikidata and execute them with custom commands (nohup java blah.jar --blah blah --> running as daemons)? I can of course set it up on my development box

inside

...
virtualbox - I want to know how to "integrate" it into the system so that any other user can download vagrant and wikidata and have the jars all ready and running? What is the proper development workflow for this?

wikidata-vagrant is maintained in github, though I think might not work perfectly right now. We need to update it and it's on our to-do, and perhaps could be moved to gerrit. I do not know about integrating the jars, but should be possible.

Cheers, Katie Filbert

[answering from this email, as I am not subscribed to wikitech-l on my wikimedia.de email]

...
Thanks, Nilesh

On Sun, Apr 28, 2013 at 3:01 AM, Nilesh Chakraborty <nilesh@nileshc.com

...
wrote:

...
Awesome. Got it.

I see what you mean, great, thank you. :)

Cheers, Nilesh On Apr 28, 2013 2:56 AM, "Lydia Pintscher" <

lydia.pintscher@wikimedia.de

...
...
wrote:

...
On Sat, Apr 27, 2013 at 11:14 PM, Nilesh Chakraborty <

nilesh@nileshc.com>

...
...
wrote:

...
Hi Lydia,

That helps a lot, and makes it way more interesting. Rather than

being a

...
...
...
one-size-fits-all solution, as it seems to me, each property or each

type

...
of property (eg. different relationships) will need individual

attention

...
...
...
and different methods/metrics for recommendation.

The examples you gave, like continents, sex, relations like

father/son,

...
...
...
uncle/aunt/spouse, or place-oriented properties like place of birth, country of citizenship, ethnic group etc. - each type has a certain

pattern

...
to it (if a person was born in the US, US should be one of the

countries he

...
was a citizen of; US census/ethnicity statistics may be used to

predict

...
...
...
ethnic group etc.) I'm already starting to chalk out a few patterns

and

...
...
how

...
they can be used for recommendation. In my proposal, should I go

into

...
...
...
...
details regarding these? Or should I just give a few examples and

explain

...
how the algorithms would work, to explain the idea?

Give some examples and how you'd handle them. You definitely don't need to have it for all properties. What's important is giving an idea about how you'd tackle the problem. Give the reader the impression that you know what you are talking about and can handle the larger problem.

Also: Don't make the system too intelligent like it knowing about US census data for example. Keep it simple and stupid for now. Things like "property A is usually used with value X, Y or Z" should cover a lot already and are likely enough for most cases.

Cheers Lydia

-- Lydia Pintscher - http://about.me/lydia.pintscher Community Communications for Technical Projects

Wikimedia Deutschland e.V. Obentrautstr. 72 10963 Berlin www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.

V.

...
...
...
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- A quest eternal, a life so small! So don't just play the guitar, build

one.

...
You can also email me at contact@nileshc.com or visit my websitehttp://www.nileshc.com/ _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- @wikimediadc / @wikidata _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- A quest eternal, a life so small! So don't just play the guitar, build one. You can also email me at contact@nileshc.com or visit my websitehttp://www.nileshc.com/

Nilesh Chakraborty

7:44 p.m.

New subject: [GSoc2013] Interested in developing Entity Suggester

Hi everyone,

One more thing - should I create a new thread to discuss on prototyping my project (entity suggester) and to discuss any issues I'm facing along the way or ask for help? Or should I just stick to this old thread?

Cheers, Nilesh

On Sat, May 4, 2013 at 11:05 PM, Nilesh Chakraborty nilesh@nileshc.comwrote:

...

Thanks for the help, Katie. I'll be looking into how Solr has been integrated with the GeoData extension. About wikidata-vagrant, no problem, I'll install it by following this pagehttp://www.mediawiki.org/wiki/Extension:Wikibase .

You're right, raw DB access can be painful and I'd need to rewrite a lot of code. I'm considering two options:

*i)* Using the database-related code in the wikidata extension (I'm studying the DataModel classes and how they interact with the database) to fetch what I need and feed them into the recommendation engine.

*ii)* Not accessing the DB at all. Rather, I can write map-reduce scripts to extract all the training data and everything I need for each Item from the wikidatawiki data dump and feed it into the recommendation engine. I can use a cron job to download the latest data dump when available, and run the scripts on it. I don't think it would be an issue even if the engine lags by the interval the dumps are generated in, since the whole recommendation thing is all about approximations.

My request to the devs, the community - please discuss the pros and cons of each method and suggest which one you think would be the best, mainly in terms of performance. I personally feel that option (ii) would be cleaner.

Cheers, Nilesh

On Fri, May 3, 2013 at 3:53 PM, aude aude.wiki@gmail.com wrote:

...
On Fri, May 3, 2013 at 5:39 AM, Nilesh Chakraborty <nilesh@nileshc.com

...
wrote:

...
Hi Lydia,

I am currently drafting my proposal, I shall submit within a few hours

once

...
the initial version is complete.

I installed mediawiki-vagrant on my PC and it went quite smoothly. I

could

...
do all the usual things through the browser; I logged into the mysql

server

...
to examine the database schema.

I also began to clone the wikidata-vagranthttps://github.com/SilkeMeyer/wikidata-vagrant repo. But it seems that the 'git submodule update --init' part would take a

long

...
time - if I'm not mistaken, it's a huge download (excluding the vagrant

up

...
command, which alone takes around 1.25 hours to download everything). I wanted to clarify something before downloading it all.

Since the entity suggester will be working with wikidata, it'll

obviously

...
need to access the whole live dataset from the database (not the xml

dump)

...
to make the recommendations. I tried searching for database access APIs

or

...
high-level REST APIs for wikidata, but couldn't figure out how I to do that. Could you point me to the proper documentation?

One of the best examples of a MediaWiki extension interacting with a Java service is how Solr is used. Solr is still pretty new at Wikimedia, though. It is used with the GeoData extension and then Solr is used by geodata api modules.

I think Solr gets updated via a cronjob (solrupdate.php) which creates jobs in the job queue. Not 100% sure of the exact details.

I do not think direct access to the live database is very practical. I think anyway the data (json blobs) would need indexing in some particular way to support what the entity selector needs to do.

http://www.mediawiki.org/wiki/Extension:GeoData

The Translate extension also uses Solr in some way, though I am not very familiar with the details.

On the operations side, puppet is used to configure everything. The puppet git repo is available to see how things are done.

https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=tree;f=modul...

...
And also, what is the best way to add a few .jar files to wikidata and execute them with custom commands (nohup java blah.jar --blah blah --> running as daemons)? I can of course set it up on my development box

inside

...
virtualbox - I want to know how to "integrate" it into the system so

that

...
any other user can download vagrant and wikidata and have the jars all ready and running? What is the proper development workflow for this?

wikidata-vagrant is maintained in github, though I think might not work perfectly right now. We need to update it and it's on our to-do, and perhaps could be moved to gerrit. I do not know about integrating the jars, but should be possible.

Cheers, Katie Filbert

[answering from this email, as I am not subscribed to wikitech-l on my wikimedia.de email]

...
Thanks, Nilesh

On Sun, Apr 28, 2013 at 3:01 AM, Nilesh Chakraborty <nilesh@nileshc.com

...
wrote:

...
Awesome. Got it.

I see what you mean, great, thank you. :)

Cheers, Nilesh On Apr 28, 2013 2:56 AM, "Lydia Pintscher" <

lydia.pintscher@wikimedia.de

...
...
wrote:

...
On Sat, Apr 27, 2013 at 11:14 PM, Nilesh Chakraborty <

nilesh@nileshc.com>

...
...
wrote:

...
Hi Lydia,

That helps a lot, and makes it way more interesting. Rather than

being a

...
...
...
one-size-fits-all solution, as it seems to me, each property or

each

...
...
...
type

...
of property (eg. different relationships) will need individual

attention

...
...
...
and different methods/metrics for recommendation.

The examples you gave, like continents, sex, relations like

father/son,

...
...
...
uncle/aunt/spouse, or place-oriented properties like place of

birth,

...
...
...
...
country of citizenship, ethnic group etc. - each type has a certain

pattern

...
to it (if a person was born in the US, US should be one of the

countries he

...
was a citizen of; US census/ethnicity statistics may be used to

predict

...
...
...
ethnic group etc.) I'm already starting to chalk out a few patterns

and

...
...
how

...
they can be used for recommendation. In my proposal, should I go

into

...
...
...
...
details regarding these? Or should I just give a few examples and

explain

...
how the algorithms would work, to explain the idea?

Give some examples and how you'd handle them. You definitely don't need to have it for all properties. What's important is giving an

idea

...
...
...
about how you'd tackle the problem. Give the reader the impression that you know what you are talking about and can handle the larger problem.

Also: Don't make the system too intelligent like it knowing about US census data for example. Keep it simple and stupid for now. Things like "property A is usually used with value X, Y or Z" should cover a lot already and are likely enough for most cases.

Cheers Lydia

-- Lydia Pintscher - http://about.me/lydia.pintscher Community Communications for Technical Projects

Wikimedia Deutschland e.V. Obentrautstr. 72 10963 Berlin www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.

V.

...
...
...
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- A quest eternal, a life so small! So don't just play the guitar, build

one.

...
You can also email me at contact@nileshc.com or visit my websitehttp://www.nileshc.com/ _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- @wikimediadc / @wikidata _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- A quest eternal, a life so small! So don't just play the guitar, build one. You can also email me at contact@nileshc.com or visit my websitehttp://www.nileshc.com/

-- A quest eternal, a life so small! So don't just play the guitar, build one. You can also email me at contact@nileshc.com or visit my websitehttp://www.nileshc.com/

Quim Gil

5 May 5 May

12:56 a.m.

New subject: [GSoc2013] Interested in developing Entity Suggester

On 05/04/2013 10:44 AM, Nilesh Chakraborty wrote:

...

One more thing - should I create a new thread to discuss on prototyping my project (entity suggester) and to discuss any issues I'm facing along the way or ask for help? Or should I just stick to this old thread?

Usually developers will discuss features and report progress in the related bug reports, while documenting in the appropriate wiki page(s). wikitech-l is good for heads up, announcements or when you are really stuck and you need help from the wider community.

Therefore in your case these seem to be good starting points:

Bug 46555 - Entity suggester for Wikidata https://bugzilla.wikimedia.org/show_bug.cgi?id=46555

https://www.mediawiki.org/wiki/User:Nilesh.c/Entity_Suggester

-- Quim Gil Technical Contributor Coordinator @ Wikimedia Foundation http://www.mediawiki.org/wiki/User:Qgil

4246

Age (days ago)

4261

Last active (days ago)

wikitech-l@lists.wikimedia.org

12 comments

5 participants

tags (0)

participants (5)

aude
Denny Vrandečić
Lydia Pintscher
Nilesh Chakraborty
Quim Gil