Hi!
I'm trying to build dumps similar to those on static.wikipedia.org (I want to create dumps for an offline wikipedia on mobile linux devices). It seems that the dumps are in some way incomplete. Because it is a small wikipedia that I can understand, I first tried als.wikipedia.org. The problem is that some image description pages seem to be missing. For example:
On the page http://als.wikipedia.org/wiki/W%C3%BCrenlos, the picture http://als.wikipedia.org/wiki/Datei:Wuerenlos_AG.jpg is used. When I search the xml dump, I can find the reference, but not the image page itself, whereas other image pages do exist.
Can someone explain this issue to me?
Kind regards, Christian Reitwießner
Christian Reitwießner hett schreven:
Hi!
I'm trying to build dumps similar to those on static.wikipedia.org (I want to create dumps for an offline wikipedia on mobile linux devices). It seems that the dumps are in some way incomplete. Because it is a small wikipedia that I can understand, I first tried als.wikipedia.org. The problem is that some image description pages seem to be missing. For example:
On the page http://als.wikipedia.org/wiki/W%C3%BCrenlos, the picture http://als.wikipedia.org/wiki/Datei:Wuerenlos_AG.jpg is used. When I search the xml dump, I can find the reference, but not the image page itself, whereas other image pages do exist.
Can someone explain this issue to me?
Kind regards, Christian Reitwießner
The file and the image description page are stored on Wikimedia Commons, the central repository for images:
http://commons.wikimedia.org/wiki/File:Wuerenlos_AG.jpg
Marcus Buck User:Slomox
Since people are doing dump redesign right now, might I suggest that providing better integration / information on Commons-hosted images would actually be useful. As far as I know the current system has no way to distinguish between Commons images and missing images except by downloading the Commons dump files. That can be frustrating since the Commons dumps are larger (and hence more trouble to work with) than all but a handful of other wikis.
-Robert Rohde
On Sun, May 10, 2009 at 7:31 AM, Marcus Buck wiki@marcusbuck.org wrote:
Christian Reitwießner hett schreven:
Hi!
I'm trying to build dumps similar to those on static.wikipedia.org (I want to create dumps for an offline wikipedia on mobile linux devices). It seems that the dumps are in some way incomplete. Because it is a small wikipedia that I can understand, I first tried als.wikipedia.org. The problem is that some image description pages seem to be missing. For example:
On the page http://als.wikipedia.org/wiki/W%C3%BCrenlos, the picture http://als.wikipedia.org/wiki/Datei:Wuerenlos_AG.jpg is used. When I search the xml dump, I can find the reference, but not the image page itself, whereas other image pages do exist.
Can someone explain this issue to me?
Kind regards, Christian Reitwießner
The file and the image description page are stored on Wikimedia Commons, the central repository for images:
http://commons.wikimedia.org/wiki/File:Wuerenlos_AG.jpg
Marcus Buck User:Slomox
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Robert Rohde wrote:
Since people are doing dump redesign right now, might I suggest that providing better integration / information on Commons-hosted images would actually be useful. As far as I know the current system has no way to distinguish between Commons images and missing images except by downloading the Commons dump files. That can be frustrating since the Commons dumps are larger (and hence more trouble to work with) than all but a handful of other wikis.
-Robert Rohde
You only need the image.sql dump from commons to determine if the image exists there (it will also include other useful and not-so-useful data like filetype, image size, metadata...). http://download.wikimedia.org/commonswiki/20090510/
Platonides schrieb:
Robert Rohde wrote:
Since people are doing dump redesign right now, might I suggest that providing better integration / information on Commons-hosted images would actually be useful. As far as I know the current system has no way to distinguish between Commons images and missing images except by downloading the Commons dump files. That can be frustrating since the Commons dumps are larger (and hence more trouble to work with) than all but a handful of other wikis.
-Robert Rohde
You only need the image.sql dump from commons to determine if the image exists there (it will also include other useful and not-so-useful data like filetype, image size, metadata...). http://download.wikimedia.org/commonswiki/20090510/
This is strange: The image dump is larger than the pages-articles dump. I assume that this is because the first dump is in sql format and the second is in xml format which is more efficient. But nevertheless, thanks for the hint. Using that file the import should be faster.
Christian
El 5/11/09 3:14 PM, Christian Reitwießner escribió:
Platonides schrieb:
This is strange: The image dump is larger than the pages-articles dump. I assume that this is because the first dump is in sql format and the second is in xml format which is more efficient.
They contain completely different information, in different formats, with different compression.
-- brion
On Mon, May 11, 2009 at 3:07 PM, Platonides Platonides@gmail.com wrote:
Robert Rohde wrote:
Since people are doing dump redesign right now, might I suggest that providing better integration / information on Commons-hosted images would actually be useful. As far as I know the current system has no way to distinguish between Commons images and missing images except by downloading the Commons dump files. That can be frustrating since the Commons dumps are larger (and hence more trouble to work with) than all but a handful of other wikis.
-Robert Rohde
You only need the image.sql dump from commons to determine if the image exists there (it will also include other useful and not-so-useful data like filetype, image size, metadata...). http://download.wikimedia.org/commonswiki/20090510/
That wouldn't get you file descriptions or copyright status, etc. If your goal is something like mirroring a wiki, you really need access to page descriptions as well.
At present, the main solution is to copy all of Commons, which is overkill for many applications. It would be nice if the dump generator had a way of parsing out only the relevant Commons content.
-Robert Rohde
Robert Rohde wrote:
That wouldn't get you file descriptions or copyright status, etc. If your goal is something like mirroring a wiki, you really need access to page descriptions as well.
At present, the main solution is to copy all of Commons, which is overkill for many applications. It would be nice if the dump generator had a way of parsing out only the relevant Commons content.
-Robert Rohde
I'd expect a "commons selected dump" to be pretty similar to pages-articles. What you can do is to request just the images used with Special:Export or the API (depending of how small those wiki really are, it could be feasible or not).
wikitech-l@lists.wikimedia.org