Images in Dumps

List overview All Threads
Download

newer

older

Dump process does not work

Re: [Wikitech-l] [WikiEN-l]...

Christian Reitwießner

10 May 2009 10 May '09

3:39 p.m.

Hi!

I'm trying to build dumps similar to those on static.wikipedia.org (I want to create dumps for an offline wikipedia on mobile linux devices). It seems that the dumps are in some way incomplete. Because it is a small wikipedia that I can understand, I first tried als.wikipedia.org. The problem is that some image description pages seem to be missing. For example:

On the page http://als.wikipedia.org/wiki/W%C3%BCrenlos, the picture http://als.wikipedia.org/wiki/Datei:Wuerenlos_AG.jpg is used. When I search the xml dump, I can find the reference, but not the image page itself, whereas other image pages do exist.

Can someone explain this issue to me?

Kind regards, Christian Reitwießner

Show replies by date

Marcus Buck

10 May 10 May

5:31 p.m.

Christian Reitwießner hett schreven:

...

Hi!

I'm trying to build dumps similar to those on static.wikipedia.org (I want to create dumps for an offline wikipedia on mobile linux devices). It seems that the dumps are in some way incomplete. Because it is a small wikipedia that I can understand, I first tried als.wikipedia.org. The problem is that some image description pages seem to be missing. For example:

On the page http://als.wikipedia.org/wiki/W%C3%BCrenlos, the picture http://als.wikipedia.org/wiki/Datei:Wuerenlos_AG.jpg is used. When I search the xml dump, I can find the reference, but not the image page itself, whereas other image pages do exist.

Can someone explain this issue to me?

Kind regards, Christian Reitwießner

The file and the image description page are stored on Wikimedia Commons, the central repository for images:

http://commons.wikimedia.org/wiki/File:Wuerenlos_AG.jpg

Marcus Buck User:Slomox

Robert Rohde

8:08 p.m.

Since people are doing dump redesign right now, might I suggest that providing better integration / information on Commons-hosted images would actually be useful. As far as I know the current system has no way to distinguish between Commons images and missing images except by downloading the Commons dump files. That can be frustrating since the Commons dumps are larger (and hence more trouble to work with) than all but a handful of other wikis.

-Robert Rohde

On Sun, May 10, 2009 at 7:31 AM, Marcus Buck wiki@marcusbuck.org wrote:

...

Christian Reitwießner hett schreven:

...
Hi!

I'm trying to build dumps similar to those on static.wikipedia.org (I want to create dumps for an offline wikipedia on mobile linux devices). It seems that the dumps are in some way incomplete. Because it is a small wikipedia that I can understand, I first tried als.wikipedia.org. The problem is that some image description pages seem to be missing. For example:

On the page http://als.wikipedia.org/wiki/W%C3%BCrenlos, the picture http://als.wikipedia.org/wiki/Datei:Wuerenlos_AG.jpg is used. When I search the xml dump, I can find the reference, but not the image page itself, whereas other image pages do exist.

Can someone explain this issue to me?

Kind regards, Christian Reitwießner

The file and the image description page are stored on Wikimedia Commons, the central repository for images:

http://commons.wikimedia.org/wiki/File:Wuerenlos_AG.jpg

Marcus Buck User:Slomox

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Platonides

12 May 12 May

1:07 a.m.

Robert Rohde wrote:

...

Since people are doing dump redesign right now, might I suggest that providing better integration / information on Commons-hosted images would actually be useful. As far as I know the current system has no way to distinguish between Commons images and missing images except by downloading the Commons dump files. That can be frustrating since the Commons dumps are larger (and hence more trouble to work with) than all but a handful of other wikis.

-Robert Rohde

You only need the image.sql dump from commons to determine if the image exists there (it will also include other useful and not-so-useful data like filetype, image size, metadata...). http://download.wikimedia.org/commonswiki/20090510/

Christian Reitwießner

1:14 a.m.

Platonides schrieb:

...

Robert Rohde wrote:

...
Since people are doing dump redesign right now, might I suggest that providing better integration / information on Commons-hosted images would actually be useful. As far as I know the current system has no way to distinguish between Commons images and missing images except by downloading the Commons dump files. That can be frustrating since the Commons dumps are larger (and hence more trouble to work with) than all but a handful of other wikis.

-Robert Rohde

You only need the image.sql dump from commons to determine if the image exists there (it will also include other useful and not-so-useful data like filetype, image size, metadata...). http://download.wikimedia.org/commonswiki/20090510/

This is strange: The image dump is larger than the pages-articles dump. I assume that this is because the first dump is in sql format and the second is in xml format which is more efficient. But nevertheless, thanks for the hint. Using that file the import should be faster.

Christian

Brion Vibber

1:33 a.m.

El 5/11/09 3:14 PM, Christian Reitwießner escribió:

...

Platonides schrieb:

...
http://download.wikimedia.org/commonswiki/20090510/

This is strange: The image dump is larger than the pages-articles dump. I assume that this is because the first dump is in sql format and the second is in xml format which is more efficient.

They contain completely different information, in different formats, with different compression.

-- brion

Robert Rohde

2:30 a.m.

On Mon, May 11, 2009 at 3:07 PM, Platonides Platonides@gmail.com wrote:

...

Robert Rohde wrote:

...
Since people are doing dump redesign right now, might I suggest that providing better integration / information on Commons-hosted images would actually be useful. As far as I know the current system has no way to distinguish between Commons images and missing images except by downloading the Commons dump files. That can be frustrating since the Commons dumps are larger (and hence more trouble to work with) than all but a handful of other wikis.

-Robert Rohde

You only need the image.sql dump from commons to determine if the image exists there (it will also include other useful and not-so-useful data like filetype, image size, metadata...). http://download.wikimedia.org/commonswiki/20090510/

That wouldn't get you file descriptions or copyright status, etc. If your goal is something like mirroring a wiki, you really need access to page descriptions as well.

At present, the main solution is to copy all of Commons, which is overkill for many applications. It would be nice if the dump generator had a way of parsing out only the relevant Commons content.

-Robert Rohde

Platonides

14 May 14 May

1:41 a.m.

Robert Rohde wrote:

...

That wouldn't get you file descriptions or copyright status, etc. If your goal is something like mirroring a wiki, you really need access to page descriptions as well.

At present, the main solution is to copy all of Commons, which is overkill for many applications. It would be nice if the dump generator had a way of parsing out only the relevant Commons content.

-Robert Rohde

I'd expect a "commons selected dump" to be pretty similar to pages-articles. What you can do is to request just the images used with Special:Export or the API (depending of how small those wiki really are, it could be feasible or not).

K. Peachey

2:31 a.m.

If it's for commons, why don't you just set up commons as a foregin files repo?

5718

Age (days ago)

5721

Last active (days ago)

wikitech-l@lists.wikimedia.org

8 comments

6 participants

tags (0)

participants (6)

Brion Vibber
Christian Reitwießner
K. Peachey
Marcus Buck
Platonides
Robert Rohde