Hey guys,
Sorry for breaking the thread, but I just subscribed, so I think this'll probably break mailman's threading headers.
This is very exciting news, and IA would love to have a copy! We're more interested in being a historical mirror (on our item infrastructure), rather than a live rsync/http/ftp mirror, but perhaps we can also work something out mirroring the latest dumps. (How big are the last 2 or so?)
I suppose the next step is for me and Ariel to talk about technical procedures and details, et cetera, but I just wanted to subscribe to this ml and introduce myself.
Ariel, when you have a minute to chat, shoot me an email (or skype). I'm thinking we just pull things at whatever frequency you guys push out the data to your.org (which may or may not be scheduled yet) and throw them into new items on the cluster.
Others' thoughts are, of course, always welcome.
Thanks!
Alex Buie Collections Group Internet Archive, a registered California non-profit library abuie@archive.org
Of course, right after I send this, I got pointed here. https://meta.wikimedia.org/wiki/Mailing_lists#Using_digests
Sorry 'bout that, heh.
For space requirements etc for the xml dumps, see: http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps (these figures are pretty up to date).
For "dumps" of images, we have no such thing; this rsync mirror is the first thing out of the gate and we can't possibly generate multiple copies of it on different dates as we do for the xml dumps.
Come find me on irc (wikimedia-tech on freenode) or send me an email off list and we can talk through the technical end of things.
I'm already working on automating copying up the xml archives to you guys at 6month intervals or so.
Ariel
Στις 02-04-2012, ημέρα Δευ, και ώρα 15:38 -0400, ο/η Alex Buie έγραψε:
Hey guys,
Sorry for breaking the thread, but I just subscribed, so I think this'll probably break mailman's threading headers.
This is very exciting news, and IA would love to have a copy! We're more interested in being a historical mirror (on our item infrastructure), rather than a live rsync/http/ftp mirror, but perhaps we can also work something out mirroring the latest dumps. (How big are the last 2 or so?)
I suppose the next step is for me and Ariel to talk about technical procedures and details, et cetera, but I just wanted to subscribe to this ml and introduce myself.
Ariel, when you have a minute to chat, shoot me an email (or skype). I'm thinking we just pull things at whatever frequency you guys push out the data to your.org (which may or may not be scheduled yet) and throw them into new items on the cluster.
Others' thoughts are, of course, always welcome.
Thanks!
Alex Buie Collections Group Internet Archive, a registered California non-profit library abuie@archive.org
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Ok, great. I'm cycling right now but I'll /join when I get back.
As far as image "dumps", I was unclear. I just meant the rsync mirror. We'll talk more later!
Thanks again.
Alex
On 02/04/12 21:55, Ariel T. Glenn wrote:
For "dumps" of images, we have no such thing; this rsync mirror is the first thing out of the gate and we can't possibly generate multiple copies of it on different dates as we do for the xml dumps.
That's not too hard to do. You just copy the image tree with hardlinks, making a version. Then the next rsync will only replace modified images. (Unless you manually added the --inplace parameter, but in such case you supposedly know what you're doing) You could also use --link-dest instead of manaully building hardlink copies.
I also have a little script that uses rsync's --list-only output to generate a list to feed to --files-from that takes an arbitrary begin date, so I can pack them into daily or weekly sized units, which would be less work for Ariel and WMF.
On Mon, Apr 2, 2012 at 5:43 PM, Platonides platonides@gmail.com wrote:
On 02/04/12 21:55, Ariel T. Glenn wrote:
For "dumps" of images, we have no such thing; this rsync mirror is the first thing out of the gate and we can't possibly generate multiple copies of it on different dates as we do for the xml dumps.
That's not too hard to do. You just copy the image tree with hardlinks, making a version. Then the next rsync will only replace modified images. (Unless you manually added the --inplace parameter, but in such case you supposedly know what you're doing) You could also use --link-dest instead of manaully building hardlink copies.
Στις 02-04-2012, ημέρα Δευ, και ώρα 23:43 +0200, ο/η Platonides έγραψε:
On 02/04/12 21:55, Ariel T. Glenn wrote:
For "dumps" of images, we have no such thing; this rsync mirror is the first thing out of the gate and we can't possibly generate multiple copies of it on different dates as we do for the xml dumps.
That's not too hard to do. You just copy the image tree with hardlinks, making a version. Then the next rsync will only replace modified images. (Unless you manually added the --inplace parameter, but in such case you supposedly know what you're doing) You could also use --link-dest instead of manaully building hardlink copies.
Yes, that would work fine under the existing setup; what I don't know and what needs to be figured out is what we will do when images are moved into swift, Real Soon Now.
Ariel
Does rsync mirror allow to download Commons images by date? I mean, in day-by-day packages. That was the method we wanted to use at WikiTeam to archive Wikimedia Commons.
2012/4/3 Ariel T. Glenn ariel@wikimedia.org
Στις 02-04-2012, ημέρα Δευ, και ώρα 23:43 +0200, ο/η Platonides έγραψε:
On 02/04/12 21:55, Ariel T. Glenn wrote:
For "dumps" of images, we have no such thing; this rsync mirror is the first thing out of the gate and we can't possibly generate multiple copies of it on different dates as we do for the xml dumps.
That's not too hard to do. You just copy the image tree with hardlinks, making a version. Then the next rsync will only replace modified images. (Unless you manually added the --inplace parameter, but in such case you supposedly know what you're doing) You could also use --link-dest instead of manaully building hardlink
copies.
Yes, that would work fine under the existing setup; what I don't know and what needs to be figured out is what we will do when images are moved into swift, Real Soon Now.
Ariel
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Not as such. However I intend to produce a list of images per project blah blah (some handwaving here) with the date of last upload, and excerpts from this could be used as input to rsync.
Ariel
Στις 03-04-2012, ημέρα Τρι, και ώρα 19:55 +0200, ο/η emijrp έγραψε:
Does rsync mirror allow to download Commons images by date? I mean, in day-by-day packages. That was the method we wanted to use at WikiTeam to archive Wikimedia Commons.
2012/4/3 Ariel T. Glenn ariel@wikimedia.org Στις 02-04-2012, ημέρα Δευ, και ώρα 23:43 +0200, ο/η Platonides έγραψε: > On 02/04/12 21:55, Ariel T. Glenn wrote: > > For "dumps" of images, we have no such thing; this rsync mirror is the > > first thing out of the gate and we can't possibly generate multiple > > copies of it on different dates as we do for the xml dumps. > > That's not too hard to do. You just copy the image tree with hardlinks, > making a version. Then the next rsync will only replace modified images. > (Unless you manually added the --inplace parameter, but in such case you > supposedly know what you're doing) > You could also use --link-dest instead of manaully building hardlink copies.
Yes, that would work fine under the existing setup; what I don't know and what needs to be figured out is what we will do when images are moved into swift, Real Soon Now. Ariel _______________________________________________ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
No, they don't currently, but I'm working with the your.org guys to get a copy of the mirror mounted as NFS, and then I should be able to combine that with the XML dumps for the images to process media from certain time periods.
I wonder how well python's lxml handles multigigabyte XML files... Guess we'll see :)
Alex
2012/4/3 emijrp emijrp@gmail.com:
Does rsync mirror allow to download Commons images by date? I mean, in day-by-day packages. That was the method we wanted to use at WikiTeam to archive Wikimedia Commons.
2012/4/3 Ariel T. Glenn ariel@wikimedia.org
Στις 02-04-2012, ημέρα Δευ, και ώρα 23:43 +0200, ο/η Platonides έγραψε:
On 02/04/12 21:55, Ariel T. Glenn wrote:
For "dumps" of images, we have no such thing; this rsync mirror is the first thing out of the gate and we can't possibly generate multiple copies of it on different dates as we do for the xml dumps.
That's not too hard to do. You just copy the image tree with hardlinks, making a version. Then the next rsync will only replace modified images. (Unless you manually added the --inplace parameter, but in such case you supposedly know what you're doing) You could also use --link-dest instead of manaully building hardlink copies.
Yes, that would work fine under the existing setup; what I don't know and what needs to be figured out is what we will do when images are moved into swift, Real Soon Now.
Ariel
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
2012/4/3 Alex Buie abuie@archive.org
No, they don't currently, but I'm working with the your.org guys to get a copy of the mirror mounted as NFS, and then I should be able to combine that with the XML dumps for the images to process media from certain time periods.
I wonder how well python's lxml handles multigigabyte XML files... Guess we'll see :)
Pywikipediabot uses cElementTree for Python, which is fast as hell.
Alex
2012/4/3 emijrp emijrp@gmail.com:
Does rsync mirror allow to download Commons images by date? I mean, in day-by-day packages. That was the method we wanted to use at WikiTeam to archive Wikimedia Commons.
2012/4/3 Ariel T. Glenn ariel@wikimedia.org
Στις 02-04-2012, ημέρα Δευ, και ώρα 23:43 +0200, ο/η Platonides έγραψε:
On 02/04/12 21:55, Ariel T. Glenn wrote:
For "dumps" of images, we have no such thing; this rsync mirror is
the
first thing out of the gate and we can't possibly generate multiple copies of it on different dates as we do for the xml dumps.
That's not too hard to do. You just copy the image tree with
hardlinks,
making a version. Then the next rsync will only replace modified
images.
(Unless you manually added the --inplace parameter, but in such case
you
supposedly know what you're doing) You could also use --link-dest instead of manaully building hardlink copies.
Yes, that would work fine under the existing setup; what I don't know and what needs to be figured out is what we will do when images are moved into swift, Real Soon Now.
Ariel
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
On 06/04/2012 10:15, emijrp wrote:
2012/4/3 Alex Buie <abuie@archive.org mailto:abuie@archive.org> I wonder how well python's lxml handles multigigabyte XML files... Guess we'll see :)
Pywikipediabot uses cElementTree for Python, which is fast as hell.
We've been using cElementTree for a lot of time in wiki-network (https://github.com/volpino/wiki-network) a suite of scripts to analyize dumps of wikipedia, in particular for social network analysis purposes. It's really fast even on huge dumps, like enwiki-pages-meta-history
It's open source so you are welcome to use it and contribute to the project!
Excellent, thanks guys. I'm assuming that I shouldn't have to worry about malformed xml (hopefully, haha), which makes it even easier/faster.
Alex On Apr 6, 2012 4:43 AM, "fox" fox91@anche.no wrote:
On 06/04/2012 10:15, emijrp wrote:
2012/4/3 Alex Buie <abuie@archive.org mailto:abuie@archive.org> I wonder how well python's lxml handles multigigabyte XML files... Guess we'll see :)
Pywikipediabot uses cElementTree for Python, which is fast as hell.
We've been using cElementTree for a lot of time in wiki-network (https://github.com/volpino/wiki-network) a suite of scripts to analyize dumps of wikipedia, in particular for social network analysis purposes. It's really fast even on huge dumps, like enwiki-pages-meta-history
It's open source so you are welcome to use it and contribute to the project!
-- f.
"Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live." (Martin Golding)
() ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
On 06/04/2012 16:01, Alex Buie wrote:
Excellent, thanks guys. I'm assuming that I shouldn't have to worry about malformed xml (hopefully, haha), which makes it even easier/faster.
The dumps are well formed xml of course, the problem is that not always the tags are in the same order or the revision are on chronological order...and of course the revision text is a real mess!
I suggest you to have a look at our library and start by using it for building simple scripts. It's really easy! All you have to do is to write a method for every tag called process_tag (e.g.: process_title for title tag). Have a look at https://github.com/volpino/wiki-network/blob/master/revisions_page.py for an example, it's a simple script that takes a pages-meta-history dump and extracts the revisions of a specific page set to a csv file.
Feel free to write me for more information ;)
xmldatadumps-l@lists.wikimedia.org