Hi,
I spoke to Dario today about investigating uses for our Hadoop cluster.
This is an internal cluster but it's mirrored on labs so I'm posting to
the public list in case people are interested in the technology and hearing
what we're up to.
The questions we need to answer are :
- What's an easy way to import lots of data from MySQL without killing
the source servers? We've used sqoop and drdee's sqoopy but these would
hammer the prod servers too hard we think.
- drdee mentioned a way to pass a comment with select statements to
make them lower priority, is this documented somewhere?
- Could we just stand up the MySQL backups and import them?
- Could we import from the xml dumps?
- Is there a way to do incremental importing once an initial load is
done?
Once we figure this out, the fun starts. What are some useful questions
once we have access to the core mediawiki db tables across all projects?