Re: [Xmldatadumps-l] uploaded media for WMF projects available via rsync

List overview All Threads
Download

newer

older

issues on wikis running 1.20wmf1

Re: [Xmldatadumps-l] Fwd: I DID...

Alex Buie

2 Apr 2012 2 Apr '12

7:38 p.m.

Hey guys,

Sorry for breaking the thread, but I just subscribed, so I think this'll probably break mailman's threading headers.

This is very exciting news, and IA would love to have a copy! We're more interested in being a historical mirror (on our item infrastructure), rather than a live rsync/http/ftp mirror, but perhaps we can also work something out mirroring the latest dumps. (How big are the last 2 or so?)

I suppose the next step is for me and Ariel to talk about technical procedures and details, et cetera, but I just wanted to subscribe to this ml and introduce myself.

Ariel, when you have a minute to chat, shoot me an email (or skype). I'm thinking we just pull things at whatever frequency you guys push out the data to your.org (which may or may not be scheduled yet) and throw them into new items on the cluster.

Others' thoughts are, of course, always welcome.

Thanks!

Alex Buie Collections Group Internet Archive, a registered California non-profit library abuie@archive.org

Show replies by date

Alex Buie

2 Apr 2 Apr

7:42 p.m.

New subject: uploaded media for WMF projects available via rsync

Of course, right after I send this, I got pointed here. https://meta.wikimedia.org/wiki/Mailing_lists#Using_digests

Sorry 'bout that, heh.

Ariel T. Glenn

7:55 p.m.

New subject: uploaded media for WMF projects available via rsync

For space requirements etc for the xml dumps, see: http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps (these figures are pretty up to date).

For "dumps" of images, we have no such thing; this rsync mirror is the first thing out of the gate and we can't possibly generate multiple copies of it on different dates as we do for the xml dumps.

Come find me on irc (wikimedia-tech on freenode) or send me an email off list and we can talk through the technical end of things.

I'm already working on automating copying up the xml archives to you guys at 6month intervals or so.

Ariel

Στις 02-04-2012, ημέρα Δευ, και ώρα 15:38 -0400, ο/η Alex Buie έγραψε:

...

Hey guys,

Sorry for breaking the thread, but I just subscribed, so I think this'll probably break mailman's threading headers.

This is very exciting news, and IA would love to have a copy! We're more interested in being a historical mirror (on our item infrastructure), rather than a live rsync/http/ftp mirror, but perhaps we can also work something out mirroring the latest dumps. (How big are the last 2 or so?)

I suppose the next step is for me and Ariel to talk about technical procedures and details, et cetera, but I just wanted to subscribe to this ml and introduce myself.

Ariel, when you have a minute to chat, shoot me an email (or skype). I'm thinking we just pull things at whatever frequency you guys push out the data to your.org (which may or may not be scheduled yet) and throw them into new items on the cluster.

Others' thoughts are, of course, always welcome.

Thanks!

Alex Buie Collections Group Internet Archive, a registered California non-profit library abuie@archive.org

Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

Alex Buie

8:20 p.m.

New subject: uploaded media for WMF projects available via rsync

Ok, great. I'm cycling right now but I'll /join when I get back.

As far as image "dumps", I was unclear. I just meant the rsync mirror. We'll talk more later!

Thanks again.

Alex

Platonides

9:43 p.m.

New subject: uploaded media for WMF projects available via rsync

On 02/04/12 21:55, Ariel T. Glenn wrote:

...

For "dumps" of images, we have no such thing; this rsync mirror is the first thing out of the gate and we can't possibly generate multiple copies of it on different dates as we do for the xml dumps.

That's not too hard to do. You just copy the image tree with hardlinks, making a version. Then the next rsync will only replace modified images. (Unless you manually added the --inplace parameter, but in such case you supposedly know what you're doing) You could also use --link-dest instead of manaully building hardlink copies.

Alex Buie

10:06 p.m.

New subject: uploaded media for WMF projects available via rsync

I also have a little script that uses rsync's --list-only output to generate a list to feed to --files-from that takes an arbitrary begin date, so I can pack them into daily or weekly sized units, which would be less work for Ariel and WMF.

On Mon, Apr 2, 2012 at 5:43 PM, Platonides platonides@gmail.com wrote:

...

On 02/04/12 21:55, Ariel T. Glenn wrote:

...
For "dumps" of images, we have no such thing; this rsync mirror is the first thing out of the gate and we can't possibly generate multiple copies of it on different dates as we do for the xml dumps.

That's not too hard to do. You just copy the image tree with hardlinks, making a version. Then the next rsync will only replace modified images. (Unless you manually added the --inplace parameter, but in such case you supposedly know what you're doing) You could also use --link-dest instead of manaully building hardlink copies.

Ariel T. Glenn

3 Apr 3 Apr

6:25 a.m.

New subject: uploaded media for WMF projects available via rsync

Στις 02-04-2012, ημέρα Δευ, και ώρα 23:43 +0200, ο/η Platonides έγραψε:

...

On 02/04/12 21:55, Ariel T. Glenn wrote:

...
For "dumps" of images, we have no such thing; this rsync mirror is the first thing out of the gate and we can't possibly generate multiple copies of it on different dates as we do for the xml dumps.

That's not too hard to do. You just copy the image tree with hardlinks, making a version. Then the next rsync will only replace modified images. (Unless you manually added the --inplace parameter, but in such case you supposedly know what you're doing) You could also use --link-dest instead of manaully building hardlink copies.

Yes, that would work fine under the existing setup; what I don't know and what needs to be figured out is what we will do when images are moved into swift, Real Soon Now.

Ariel

emijrp

5:55 p.m.

New subject: uploaded media for WMF projects available via rsync

Does rsync mirror allow to download Commons images by date? I mean, in day-by-day packages. That was the method we wanted to use at WikiTeam to archive Wikimedia Commons.

2012/4/3 Ariel T. Glenn ariel@wikimedia.org

...

Στις 02-04-2012, ημέρα Δευ, και ώρα 23:43 +0200, ο/η Platonides έγραψε:

...
On 02/04/12 21:55, Ariel T. Glenn wrote:

...
For "dumps" of images, we have no such thing; this rsync mirror is the first thing out of the gate and we can't possibly generate multiple copies of it on different dates as we do for the xml dumps.

That's not too hard to do. You just copy the image tree with hardlinks, making a version. Then the next rsync will only replace modified images. (Unless you manually added the --inplace parameter, but in such case you supposedly know what you're doing) You could also use --link-dest instead of manaully building hardlink

copies.

Yes, that would work fine under the existing setup; what I don't know and what needs to be figured out is what we will do when images are moved into swift, Real Soon Now.

Ariel

Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

Ariel T. Glenn

6:32 p.m.

New subject: uploaded media for WMF projects available via rsync

Not as such. However I intend to produce a list of images per project blah blah (some handwaving here) with the date of last upload, and excerpts from this could be used as input to rsync.

Ariel

Στις 03-04-2012, ημέρα Τρι, και ώρα 19:55 +0200, ο/η emijrp έγραψε:

...

Does rsync mirror allow to download Commons images by date? I mean, in day-by-day packages. That was the method we wanted to use at WikiTeam to archive Wikimedia Commons.

2012/4/3 Ariel T. Glenn ariel@wikimedia.org Στις 02-04-2012, ημέρα Δευ, και ώρα 23:43 +0200, ο/η Platonides έγραψε: > On 02/04/12 21:55, Ariel T. Glenn wrote: > > For "dumps" of images, we have no such thing; this rsync mirror is the > > first thing out of the gate and we can't possibly generate multiple > > copies of it on different dates as we do for the xml dumps. > > That's not too hard to do. You just copy the image tree with hardlinks, > making a version. Then the next rsync will only replace modified images. > (Unless you manually added the --inplace parameter, but in such case you > supposedly know what you're doing) > You could also use --link-dest instead of manaully building hardlink copies.
    Yes, that would work fine under the existing setup; what I
    don't know
    and what needs to be figured out is what we will do when
    images are
    moved into swift, Real Soon Now.

    Ariel






    _______________________________________________
    Xmldatadumps-l mailing list
    Xmldatadumps-l@lists.wikimedia.org
    https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

Alex Buie

6:56 p.m.

New subject: uploaded media for WMF projects available via rsync

No, they don't currently, but I'm working with the your.org guys to get a copy of the mirror mounted as NFS, and then I should be able to combine that with the XML dumps for the images to process media from certain time periods.

I wonder how well python's lxml handles multigigabyte XML files... Guess we'll see :)

Alex

2012/4/3 emijrp emijrp@gmail.com:

...

Does rsync mirror allow to download Commons images by date? I mean, in day-by-day packages. That was the method we wanted to use at WikiTeam to archive Wikimedia Commons.

2012/4/3 Ariel T. Glenn ariel@wikimedia.org

...
Στις 02-04-2012, ημέρα Δευ, και ώρα 23:43 +0200, ο/η Platonides έγραψε:

...
On 02/04/12 21:55, Ariel T. Glenn wrote:

...
For "dumps" of images, we have no such thing; this rsync mirror is the first thing out of the gate and we can't possibly generate multiple copies of it on different dates as we do for the xml dumps.

That's not too hard to do. You just copy the image tree with hardlinks, making a version. Then the next rsync will only replace modified images. (Unless you manually added the --inplace parameter, but in such case you supposedly know what you're doing) You could also use --link-dest instead of manaully building hardlink copies.

Yes, that would work fine under the existing setup; what I don't know and what needs to be figured out is what we will do when images are moved into swift, Real Soon Now.

Ariel

Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

emijrp

6 Apr 6 Apr

8:15 a.m.

New subject: uploaded media for WMF projects available via rsync

2012/4/3 Alex Buie abuie@archive.org

...

No, they don't currently, but I'm working with the your.org guys to get a copy of the mirror mounted as NFS, and then I should be able to combine that with the XML dumps for the images to process media from certain time periods.

I wonder how well python's lxml handles multigigabyte XML files... Guess we'll see :)

Pywikipediabot uses cElementTree for Python, which is fast as hell.

...

Alex

2012/4/3 emijrp emijrp@gmail.com:

...
Does rsync mirror allow to download Commons images by date? I mean, in day-by-day packages. That was the method we wanted to use at WikiTeam to archive Wikimedia Commons.

2012/4/3 Ariel T. Glenn ariel@wikimedia.org

...
Στις 02-04-2012, ημέρα Δευ, και ώρα 23:43 +0200, ο/η Platonides έγραψε:

...
On 02/04/12 21:55, Ariel T. Glenn wrote:

...
For "dumps" of images, we have no such thing; this rsync mirror is

the

...
...
...
...
first thing out of the gate and we can't possibly generate multiple copies of it on different dates as we do for the xml dumps.

That's not too hard to do. You just copy the image tree with

hardlinks,

...
...
...
making a version. Then the next rsync will only replace modified

images.

...
...
...
(Unless you manually added the --inplace parameter, but in such case

you

...
...
...
supposedly know what you're doing) You could also use --link-dest instead of manaully building hardlink copies.

Yes, that would work fine under the existing setup; what I don't know and what needs to be figured out is what we will do when images are moved into swift, Real Soon Now.

Ariel

Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

fox

8:43 a.m.

New subject: uploaded media for WMF projects available via rsync

On 06/04/2012 10:15, emijrp wrote:

...

2012/4/3 Alex Buie <abuie@archive.org mailto:abuie@archive.org> I wonder how well python's lxml handles multigigabyte XML files... Guess we'll see :)

Pywikipediabot uses cElementTree for Python, which is fast as hell.

We've been using cElementTree for a lot of time in wiki-network (https://github.com/volpino/wiki-network) a suite of scripts to analyize dumps of wikipedia, in particular for social network analysis purposes. It's really fast even on huge dumps, like enwiki-pages-meta-history

It's open source so you are welcome to use it and contribute to the project!

-- f. "Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live." (Martin Golding) () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments http://about.me/fox91

Alex Buie

2:01 p.m.

New subject: uploaded media for WMF projects available via rsync

Excellent, thanks guys. I'm assuming that I shouldn't have to worry about malformed xml (hopefully, haha), which makes it even easier/faster.

Alex On Apr 6, 2012 4:43 AM, "fox" fox91@anche.no wrote:

...

On 06/04/2012 10:15, emijrp wrote:

...
2012/4/3 Alex Buie <abuie@archive.org mailto:abuie@archive.org> I wonder how well python's lxml handles multigigabyte XML files... Guess we'll see :)

Pywikipediabot uses cElementTree for Python, which is fast as hell.

We've been using cElementTree for a lot of time in wiki-network (https://github.com/volpino/wiki-network) a suite of scripts to analyize dumps of wikipedia, in particular for social network analysis purposes. It's really fast even on huge dumps, like enwiki-pages-meta-history

It's open source so you are welcome to use it and contribute to the project!

-- f.

"Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live." (Martin Golding)

() ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments

http://about.me/fox91

Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

fox

6:16 p.m.

New subject: uploaded media for WMF projects available via rsync

On 06/04/2012 16:01, Alex Buie wrote:

...

Excellent, thanks guys. I'm assuming that I shouldn't have to worry about malformed xml (hopefully, haha), which makes it even easier/faster.

The dumps are well formed xml of course, the problem is that not always the tags are in the same order or the revision are on chronological order...and of course the revision text is a real mess!

I suggest you to have a look at our library and start by using it for building simple scripts. It's really easy! All you have to do is to write a method for every tag called process_tag (e.g.: process_title for title tag). Have a look at https://github.com/volpino/wiki-network/blob/master/revisions_page.py for an example, it's a simple script that takes a pages-meta-history dump and extracts the revisions of a specific page set to a csv file.

Feel free to write me for more information ;)

4609

Age (days ago)

4613

Last active (days ago)

xmldatadumps-l@lists.wikimedia.org

13 comments

6 participants

tags (0)

participants (6)

Alex Buie
Alex Buie
Ariel T. Glenn
emijrp
fox
Platonides