Hi,
What do you mean by "opening"? enwiki pages-meta-history is hard due to its size, not because Ariel or Tomasz being more stupid than any volunteer. I trust them to do it at least as well as a volunteer would. Of course, if you can perform better I'm all for giving you a shell to fix it, and the scripts are there for improvements as well.
I wasn't aware that the dump scripts were publicly available, where can they be downloaded from or are they part of mediawiki?
What do you need exactly about the images? Which image dumps do you want? Do you have enough terabytes to store them? Dumps/Access has been given by request in the past to that data. If it's not there it's because: a) Those dumps would take a lot of space.
I don't think that is a valid reason, thumbnail dumps of all the images from enwiki would probably be a smaller file than the current enwiki pages-meta-history bz2 file.
b) Nobody feels particulary interested in them.
I disagree, there has been a lot of interest in having image dumps available for download. There was a discussion on this recently on the xmldatadumps list, that basically concluded that subsets of images (ie. enwiki thumbnails) would be useful. There are wiki pages dedicated to this topic of how to download images, this is because there are no image dumps available. Is the wikimedia foundation interested to host image dumps again? If they are maybe we can start a discussion on how to make the script and what image dumps to start with.
cheers, Jamie
Jamie Morken wrote:
Hi,
What do you mean by "opening"? enwiki pages-meta-history is hard due to its size, not because Ariel or Tomasz being more stupid than any volunteer. I trust them to do it at least as well as a volunteer would. Of course, if you can perform better I'm all for giving you a shell to fix it, and the scripts are there for improvements as well.
I wasn't aware that the dump scripts were publicly available, where can they be downloaded from or are they part of mediawiki?
It is in http://svn.wikimedia.org/viewvc/mediawiki/trunk/backup/ although the files look a bit old, so perhaps there are some uncommitted changes? /me looks for offenders
What do you need exactly about the images? Which image dumps do you want? Do you have enough terabytes to store them? Dumps/Access has been given by request in the past to that data. If it's not there it's because: a) Those dumps would take a lot of space.
I don't think that is a valid reason, thumbnail dumps of all the images from enwiki would probably be a smaller file than the current enwiki pages-meta-history bz2 file.
We have thumbs on lots of sizes. Which size do you want the thumbs? It's easy to tar all the images used on a wiki, since that's tracked in the database, but not at all knowing which exact size was each of them used.
enwiki has a total of 858979 local files which sum 229 GB (and there's still commmons). 2357967 unique images (37050694 uses) are in their articles. Assuming 20Kb per image thumb (is that a good value?), that's 48 Gb, more than the 31.9 GB of the (really compressed) pages-meta-history.xml.7z but we would need to agree. They would tie at 14 Kb.
Even if all thumbs were unrealistically small, 1Kb each, they would still be several GB.
b) Nobody feels particulary interested in them. I disagree, there has been a lot of interest in having image dumps available for download. There was a discussion on this recently on the xmldatadumps list, that basically concluded that subsets of images (ie. enwiki thumbnails) would be useful.
I am unable to find it, although a thread like that somewhere rings a bell to me.
There are wiki pages dedicated to this topic of how to download images, this is because there are no image dumps available. Is the wikimedia foundation interested to host image dumps again? If they are maybe we can start a discussion on how to make the script and what image dumps to start with.
cheers, Jamie
On Tue, Sep 7, 2010 at 9:50 AM, Platonides Platonides@gmail.com wrote:
enwiki has a total of 858979 local files which sum 229 GB (and there's still commmons). 2357967 unique images (37050694 uses) are in their articles. Assuming 20Kb per image thumb (is that a good value?), that's 48 Gb, more than the 31.9 GB of the (really compressed) pages-meta-history.xml.7z but we would need to agree. They would tie at 14 Kb.
Even if all thumbs were unrealistically small, 1Kb each, they would still be several GB.
Comparing the size to pages-meta-history isn't all that fair since with the images they wont change much, so you only need to do the base copy then on the next run you just need to update/add the appropriate ones that have changed/been added or delete the ones that are gone.
Also does that figure take into the fair use images which we wouldn't be able to dump?
-Peachey
wikitech-l@lists.wikimedia.org