i've started running an image dump for en.wp using a version of trickle with large-file support (the last one died after 2GB). if this works i'll set up regular image dumps again along with the db backups.
the copy is running slowly so as not to overload the fileserver, so dumps may not be entirely up to date when done.
kate.
i've started running an image dump for en.wp using a version of trickle with large-file support (the last one died after 2GB). if this works i'll set up regular image dumps again along with the db backups.
the copy is running slowly so as not to overload the fileserver, so dumps may not be entirely up to date when done.
This sounds like great news - any idea as to when it will be completed? I would love to include images asap.... where would we download the dump from though?
On Wed, 25 May 2005, Kate Turner wrote:
i've started running an image dump for en.wp using a version of trickle with large-file support (the last one died after 2GB). if this works i'll set up regular image dumps again along with the db backups.
Can you set it up so it can be rsync'ed ?
It will greatly save your bandwidth and mine, at the expense of some CPU.
Cheers, Andy!
Andy Rabagliati wrote in gmane.science.linguistics.wikipedia.technical:
On Wed, 25 May 2005, Kate Turner wrote:
i've started running an image dump for en.wp using a version of trickle with large-file support (the last one died after 2GB). if this works i'll set up regular image dumps again along with the db backups.
Can you set it up so it can be rsync'ed ?
do you want to rsync the tar file of all images, or the image directory as its stored on disk?
i don't see any benefit from the latter, and the former is not something that's feasible right now, although it may be doable in future...
It will greatly save your bandwidth and mine, at the expense of some CPU.
we currently have a lot more problems in terms of CPU(/disk) use than bandwidth, particularly in the area of images & the dumps service.
Cheers, Andy!
kate.
On Thu, 26 May 2005, Kate Turner wrote:
Andy Rabagliati wrote in gmane.science.linguistics.wikipedia.technical:
On Wed, 25 May 2005, Kate Turner wrote:
i've started running an image dump for en.wp using a version of trickle with large-file support (the last one died after 2GB). if this works i'll set up regular image dumps again along with the db backups.
Can you set it up so it can be rsync'ed ?
do you want to rsync the tar file of all images, or the image directory as its stored on disk?
The latter.
i don't see any benefit from the latter, and the former is not something that's feasible right now, although it may be doable in future...
I keep an identical tree this side, and rsync does the rest.
Even if you change the tree arrangement, I can write a script to re-arrange this end. But .. dont :-)
If the archive has only changed by 5%, I only download 5%. I can ignore archive trees and rescaled thumbnails.
It will greatly save your bandwidth and mine, at the expense of some CPU.
we currently have a lot more problems in terms of CPU(/disk) use than bandwidth, particularly in the area of images & the dumps service.
Bandwidth is the priority in Africa. However, the magnitude of information means that even an out-of-date copy of wikipedia is very useable. My worry sometimes is that a newer SQL dump refers to newer pictures, where I have some perfectly appropriate ones here :-)
So perhaps I should use an SQL dump of the same vintage as the picture archive for 'least astonishment'.
And, one day, there could be a trickle-back of edits, maybe moderated at the remote end.
Eric has offered access to other (SQL ?) update methods, but I am busy on other things and have not had time to investigate.
However, rsync to me is understandable and optimal for my needs, and takes little of my time. And I'll just download SQL dumps.
Cheers, Andy!
Kate:
i've started running an image dump for en.wp using a version of trickle with large-file support (the last one died after 2GB). if this works i'll set up regular image dumps again along with the db backups.
the copy is running slowly so as not to overload the fileserver, so dumps may not be entirely up to date when done.
This is great news, but it's also worth noting that this will not include images from the Commons, so if you're trying to set up a complete mirror, this will be increasingly difficult as the free material gets moved over there. The Commons dump itself will be prohibitively large for most users.
To address this, I've written a very basic Perl script a while ago that makes a dump of a wiki's images that are only in the Commons. It's in /home/erik/extractdb.pl. I'm sure it could be done a lot faster, though. Ideally, such a solution could be used to create combined dumps that include *all* the images used in a particular wiki.
From a legal standpoint, we have to be careful with distributing image dumps separately from the metadata that includes the licensing information, as many licenses prohibit this.
Erik
Erik Moeller wrote in gmane.science.linguistics.wikipedia.technical:
To address this, I've written a very basic Perl script a while ago that makes a dump of a wiki's images that are only in the Commons. It's in /home/erik/extractdb.pl. I'm sure it could be done a lot faster, though.
in fact, it should rather be slower than faster. this is the idea of trickle. otherwise it's too much load on albert.
Ideally, such a solution could be used to create combined dumps that include *all* the images used in a particular wiki.
yes.
From a legal standpoint, we have to be careful with distributing image dumps separately from the metadata that includes the licensing information, as many licenses prohibit this.
can we just include the image description pages in the image dumps? that shouldn't increase their size by a large amount, i wouldn't've thought.
Erik
kate.
Kate:
To address this, I've written a very basic Perl script a while ago that makes a dump of a wiki's images that are only in the Commons. It's in /home/erik/extractdb.pl. I'm sure it could be done a lot faster, though.
in fact, it should rather be slower than faster. this is the idea of trickle. otherwise it's too much load on albert.
Indeed. I should have said "efficiently." The script above is a hack.
can we just include the image description pages in the image dumps?
Yes, including cur_namespace 6 and 10 should be sufficient (10 because most copyright information makes use of templates).
Erik
wikitech-l@lists.wikimedia.org