On Sun, Oct 2, 2011 at 5:28 PM, Platonides <platonides@gmail.com> wrote:

Kilian Kluge wrote:

There is a problem with the german umlaut ß (all the others, ä, ö and ü,
work), the pictures aren't downloaded. I don't have the time to look
into that now, so I keep downloading all the others and will try to find
a solution tomorrow (and then download the ß-ones).

So if your language has some special characters that are accepted on
commons and widely used (unfortunately, the german word for street is
Straße ;) ) you should test that first. Once I have a workaround, I'll
let you know.

I don't even know how your script can work.
You get a list of images (eg. "Foo.jpg Bar.jpg") and then you call wget with that. wget expects urls, not filenames.

The list actually has the urls in it, it's not a list of plain filenames. I should have mentioned that.

I suspect your ß problems are related to encodings. You are calling wget without even quoting it (you would also need to escape the quote characters, but that's an improvement). I think you will have problems with &, ' and ". Also, you are executing that in the shell, seems you have a command injection vulnerability. I hope nobody called his file Monument`rm -rf /`.JPG :)

The ß problem kind of solved itself. I downloaded a new list with the tool Martin aka DerHexer provided (see elya's mail) and now everything's fine. It took about 12 hours to download all files without ß, now I'm running a slightly altered script that downloads the missing ones.

Calling wget each time instead is a bit unefficient, I would recommend using wget -i if you can.

Hmm, time isn't really an issue and I'm like 90% done right now ;-) How much faster does it actually work? The limiting factor is my internet connection anyway, isn't it? Even though it's really fast, I need between 1 to 4 seconds per image, so it took roughly 12 hours to download all the ones without ß (about 55GB).

Thanks for your suggestions!

Kilian