Importing data into Hadoop

List overview All Threads
Download

newer

older

Fwd: [Wikidata-l] Statistics

Dan Andreescu

17 Oct 2013 17 Oct '13

2:23 p.m.

Hi,

I spoke to Dario today about investigating uses for our Hadoop cluster. This is an internal cluster but it's mirrored on labs so I'm posting to the public list in case people are interested in the technology and hearing what we're up to.

The questions we need to answer are :

- What's an easy way to import lots of data from MySQL without killing the source servers? We've used sqoop and drdee's sqoopy but these would hammer the prod servers too hard we think. - drdee mentioned a way to pass a comment with select statements to make them lower priority, is this documented somewhere? - Could we just stand up the MySQL backups and import them? - Could we import from the xml dumps? - Is there a way to do incremental importing once an initial load is done?

Once we figure this out, the fun starts. What are some useful questions once we have access to the core mediawiki db tables across all projects?

Attachments:

attachment.htm (text/html — 1.0 KB)

Show replies by date

Jeremy Baron

17 Oct 17 Oct

3:14 p.m.

What's the end goal? just to have a complete copy that can be accessed from map reduce?

Maybe you should do periodic dumps from a prod slave to e.g. csv and then load that into HDFS after each run.

On Thu, Oct 17, 2013 at 6:23 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...

drdee mentioned a way to pass a comment with select statements to make them lower priority, is this documented somewhere?

I imagine you could either * take a slave out of rotation during a non-peak time of day * use a tampa slave in case the sites are using only eqiad hosts (would need to double check that it's actually not in rotation) (I guess take a host that's in https://noc.wikimedia.org/dbtree/ but not in https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php and then double check with springle) * or both

Depends on how heavy your use would be. I guess the comment method may also work but it couldn't take as much load as the other methods above. Maybe better to just not risk degrading prod at all.

...

Could we just stand up the MySQL backups and import them?

Sure, but if hot spares are sitting idle then maybe easier to use them instead.

...

Could we import from the xml dumps?

I'm not sure how much might be missing from those (although I guess you could also get access to the private XML parts)

...

Is there a way to do incremental importing once an initial load is done?

I guess this may be related to the incremental dumping that was part of GSoC this year. Not sure what the status is there.

-Jeremy

Dan Andreescu

6:19 p.m.

...

What's the end goal? just to have a complete copy that can be accessed from map reduce?

To be able to run global queries (queries across all projects) without jumping through the requisite hoops each time.

Thanks for all the suggestions and insight Jeremy!

Toby Negrin

3:34 p.m.

I would be surprised if sqoop doesn't let you dial back the load so you don't crush the source DB. I've used it in production and it's been fine.

There's also some interesting work done on sqoop 2 right now.

-Toby

...

On Oct 17, 2013, at 11:23 AM, Dan Andreescu dandreescu@wikimedia.org wrote:

Hi,

I spoke to Dario today about investigating uses for our Hadoop cluster. This is an internal cluster but it's mirrored on labs so I'm posting to the public list in case people are interested in the technology and hearing what we're up to.

The questions we need to answer are : What's an easy way to import lots of data from MySQL without killing the source servers? We've used sqoop and drdee's sqoopy but these would hammer the prod servers too hard we think. drdee mentioned a way to pass a comment with select statements to make them lower priority, is this documented somewhere? Could we just stand up the MySQL backups and import them? Could we import from the xml dumps? Is there a way to do incremental importing once an initial load is done? Once we figure this out, the fun starts. What are some useful questions once we have access to the core mediawiki db tables across all projects? _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Dan Andreescu

6:17 p.m.

On Thu, Oct 17, 2013 at 3:34 PM, Toby Negrin tnegrin@wikimedia.org wrote:

...

I would be surprised if sqoop doesn't let you dial back the load so you don't crush the source DB. I've used it in production and it's been fine.

There's also some interesting work done on sqoop 2 right now.

I read up a bit, couldn't immediately find any load-reducing strategies. But you're right, they support incremental imports and lots of interesting things like integration with oozie. We'll look more closely.

4074

Age (days ago)

4074

Last active (days ago)

analytics@lists.wikimedia.org

4 comments

3 participants

tags (0)

participants (3)

Dan Andreescu
Jeremy Baron
Toby Negrin