On Thu, Sep 19, 2013 at 7:48 AM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
We also experimented with this; we have a tool called Sqoop to import
the data from MySQL to Hadoop. We would need to define which tables we need to import but that's not hard. I wrote a small tool called sqoopy that will automatically map MySQL column types to Hive column types and now with the the labsdb's we can just import from those databases and not have to worry about PII.
sounds like a good plan. I suspect we will need to import data from an uncensored source, tool labs removes much more than PII (for example, the archive table doesn't contain any PII and it's routinely used for internal data analysis).
Ok, that's fine as well -- we would have to be a bit more careful with 'hammering' those prod slaves but that's all. An important question is how often would we need to import the data: daily, weekly etc.
Dario
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics