We also experimented with this; we have a tool called Sqoop to import the data from MySQL to Hadoop. We would need to define which tables we need to import but that's not hard. I wrote a small tool called sqoopy that will automatically map MySQL column types to Hive column types and now with the the labsdb's we can just import from those databases and not have to worry about PII.
sounds like a good plan. I suspect we will need to import data from an uncensored source, tool labs removes much more than PII (for example, the archive table doesn't contain any PII and it's routinely used for internal data analysis).
Dario