Hey Daniel!
The architecture we have discussed with the team at the HPI is a bit different from what we designed for the GSoC project. The idea is to have a MediaWiki extension that relies directly on the data in a MySQL table, and generates suggestions from that. It does not care where the data comes from, so the database table(s) server as an interface between the "front" (mediawiki) part and the "back" (data analysis) part. This has two advantages: 1) front and back are decoupled and only have to agree on the structure and interpretation of the data in the database (this is the current TODO). 2) No new services need to be deployed in the public-facing subnet.
This is great. Makes me feel like "why didn't I think about this!". Less coupling!
I think your expertise with data ingestion could help the folks at the HPI quite a bit. Also, the modular architecture allows for data analysis components to be swapped out easily, and we would like to try and compare different approaches for data analysis.
Brilliant.
One based on Hadoop and/or Myrrix could well be an option
though I'm not sure whether Myrrix would be very useful, since the actual generation of suggestions from the pre-processed data would already be covered.
I see. So in this case we're not needing the real-time fetching of suggestions from a Java web-service. Rather, the backend part (the data analysis component) will be something that parses the datasets, performs analyses (collaborative filtering, something else, anything) to generate data that'll be pushed directly to a MySQL database (in a certain format that will be agreed upon by both the frontend API and the data analysis module).
You're right, Myrrix won't be needed to run as a service. We can still use it though as a command line program to generate suggestions and store them in the MySQL DB. (or use Mahout, or any machine learning library that we decide on). Hadoop is just for making things faster.
Please let's have a proper discussion on this on IRC and get a bit of planning done, get everyone on the same page, including what data analysis methods we'd like to explore. This week I have some time on my hands from 29th Nov to 1st Dec, and I'm free from 13th Dec (holidays!!).
Cheers, Nilesh
This is just an idea, I think you can best figure things out among yourself.
Cheers, Daniel
Am 25.11.2013 17:01, schrieb Lydia Pintscher:
Hey everyone,
I have the feeling it would be good to make an official introduction. Nilesh has been working on the Wikidata entity suggester. There is now a team of students who are working on the entity suggester to get it finished and ready for production as part of their bachelor project. It would be good if you could work together and coordinate on the public wikidata-tech list. I'm sure with you all working together we can provide the Wikidata community with the great entity suggester they are waiting for. Virginia and co: Are you still having issues with the data import? Maybe Nilesh can help you with that as a first good step.
Cheers Lydia