Re: [Wikitech-l] [GSoc2013] Interested in developing Entity Suggester

3 May 2013


      On Fri, May 3, 2013 at 5:39 AM, Nilesh Chakraborty nilesh@nileshc.comwrote:
...
Hi Lydia,
I am currently drafting my proposal, I shall submit within a few hours once
the initial version is complete.
I installed mediawiki-vagrant on my PC and it went quite smoothly. I could
do all the usual things through the browser; I logged into the mysql server
to examine the database schema.
I also began to clone the
wikidata-vagranthttps://github.com/SilkeMeyer/wikidata-vagrant repo.
But it seems that the 'git submodule update --init' part would take a long
time - if I'm not mistaken, it's a huge download (excluding the vagrant up
command, which alone takes around 1.25 hours to download everything). I
wanted to clarify something before downloading it all.
Since the entity suggester will be working with wikidata, it'll obviously
need to access the whole live dataset from the database (not the xml dump)
to make the recommendations. I tried searching for database access APIs or
high-level REST APIs for wikidata, but couldn't figure out how I to do
that. Could you point me to the proper documentation?
One of the best examples of a MediaWiki extension interacting with a Java
service is how Solr is used.  Solr is still pretty new at Wikimedia,
though.  It is used with the GeoData extension and then Solr is used by
geodata api modules.
I think Solr gets updated via a cronjob (solrupdate.php) which creates jobs
in the job queue.  Not 100% sure of the exact details.
I do not think direct access to the live database is very practical. I
think anyway the data (json blobs) would need indexing in some particular
way to support what the entity selector needs to do.
http://www.mediawiki.org/wiki/Extension:GeoData
The Translate extension also uses Solr in some way, though I am not very
familiar with the details.
On the operations side, puppet is used to configure everything.  The puppet
git repo is available to see how things are done.
https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=tree;f=modul...
...
And also, what is the best way to add a few .jar files to wikidata and
execute them with custom commands (nohup java blah.jar --blah blah -->
running as daemons)? I can of course set it up on my development box inside
virtualbox - I want to know how to "integrate" it into the system so that
any other user can download vagrant and wikidata and have the jars all
ready and running? What is the proper development workflow for this?
wikidata-vagrant is maintained in github, though I think might not work
perfectly right now.  We need to update it and it's on our to-do, and
perhaps could be moved to gerrit.  I do not know about integrating the
jars, but should be possible.
Cheers,
Katie Filbert
[answering from this email, as I am not subscribed to wikitech-l on my
wikimedia.de email]
...
Thanks,
Nilesh
On Sun, Apr 28, 2013 at 3:01 AM, Nilesh Chakraborty <nilesh@nileshc.com
...
wrote:
...
Awesome. Got it.
I see what you mean, great, thank you. :)
Cheers,
Nilesh
On Apr 28, 2013 2:56 AM, "Lydia Pintscher" <lydia.pintscher@wikimedia.de
wrote:
...
On Sat, Apr 27, 2013 at 11:14 PM, Nilesh Chakraborty <
nilesh@nileshc.com>
...
...
wrote:
...
Hi Lydia,
That helps a lot, and makes it way more interesting. Rather than
being a
...
...
...
one-size-fits-all solution, as it seems to me, each property or each
type
...
of property (eg. different relationships) will need individual
attention
...
...
...
and different methods/metrics for recommendation.
The examples you gave, like continents, sex, relations like
father/son,
...
...
...
uncle/aunt/spouse, or place-oriented properties like place of birth,
country of citizenship, ethnic group etc. - each type has a certain
pattern
...
to it (if a person was born in the US, US should be one of the
countries he
...
was a citizen of; US census/ethnicity statistics may be used to
predict
...
...
...
ethnic group etc.) I'm already starting to chalk out a few patterns
and
...
...
how
...
they can be used for recommendation. In my proposal, should I go into
details regarding these? Or should I just give a few examples and
explain
...
how the algorithms would work, to explain the idea?
Give some examples and how you'd handle them. You definitely don't
need to have it for all properties. What's important is giving an idea
about how you'd tackle the problem. Give the reader the impression
that you know what you are talking about and can handle the larger
problem.
Also: Don't make the system too intelligent like it knowing about US
census data for example. Keep it simple and stupid for now. Things
like "property A is usually used with value X, Y or Z" should cover a
lot already and are likely enough for most cases.
Cheers
Lydia
--
Lydia Pintscher - http://about.me/lydia.pintscher
Community Communications for Technical Projects
Wikimedia Deutschland e.V.
Obentrautstr. 72
10963 Berlin
www.wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
--
A quest eternal, a life so small! So don't just play the guitar, build one.
You can also email me at contact@nileshc.com or visit my
websitehttp://www.nileshc.com/
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- 
@wikimediadc / @wikidata

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] [GSoc2013] Interested in developing Entity Suggester