The architecture we have discussed with the team at
the HPI is a bit
from what we designed for the GSoC project. The idea is to have a MediaWiki
extension that relies directly on the data in a MySQL table, and generates
suggestions from that. It does not care where the data comes from, so the
database table(s) server as an interface between the "front" (mediawiki)
and the "back" (data analysis) part. This has two advantages: 1) front and
are decoupled and only have to agree on the structure and interpretation of
data in the database (this is the current TODO). 2) No new services need to
deployed in the public-facing subnet.
This is great. Makes me feel like "why didn't I think about this!".
I think your expertise with data ingestion could help
the folks at the HPI
a bit. Also, the modular architecture allows for data analysis components to
swapped out easily, and we would like to try and compare different
for data analysis.
One based on Hadoop and/or Myrrix could well be an
though I'm not sure whether Myrrix would be very useful, since the actual
generation of suggestions from the pre-processed data would already be
I see. So in this case we're not needing the real-time fetching of
suggestions from a Java web-service. Rather, the backend part (the
data analysis component) will be something that parses the datasets,
performs analyses (collaborative filtering, something else, anything)
to generate data that'll be pushed directly to a MySQL database (in a
certain format that will be agreed upon by both the frontend API and
the data analysis module).
You're right, Myrrix won't be needed to run as a service. We can still
use it though as a command line program to generate suggestions and
store them in the MySQL DB. (or use Mahout, or any machine learning
library that we decide on). Hadoop is just for making things faster.
Please let's have a proper discussion on this on IRC and get a bit of
planning done, get everyone on the same page, including what data
analysis methods we'd like to explore. This week I have some time on
my hands from 29th Nov to 1st Dec, and I'm free from 13th Dec
This is just an idea, I think you can best figure
things out among
Am 25.11.2013 17:01, schrieb Lydia Pintscher:
I have the feeling it would be good to make an official introduction.
Nilesh has been working on the Wikidata entity suggester. There is now
a team of students who are working on the entity suggester to get it
finished and ready for production as part of their bachelor project.
It would be good if you could work together and coordinate on the
public wikidata-tech list. I'm sure with you all working together we
can provide the Wikidata community with the great entity suggester
they are waiting for.
Virginia and co: Are you still having issues with the data import?
Maybe Nilesh can help you with that as a first good step.
A quest eternal, a life so small! So don't just play the guitar, build one.
You can also email me at contact(a)nileshc.com or visit my