Hi Platonides,
On Sun, Oct 2, 2011 at 5:28 PM, Platonides platonides@gmail.com wrote:
Kilian Kluge wrote:
There is a problem with the german umlaut ß (all the others, ä, ö and ü, work), the pictures aren't downloaded. I don't have the time to look into that now, so I keep downloading all the others and will try to find a solution tomorrow (and then download the ß-ones).
So if your language has some special characters that are accepted on commons and widely used (unfortunately, the german word for street is Straße ;) ) you should test that first. Once I have a workaround, I'll let you know.
I don't even know how your script can work. You get a list of images (eg. "Foo.jpg Bar.jpg") and then you call wget with that. wget expects urls, not filenames.
The list actually has the urls in it, it's not a list of plain filenames. I should have mentioned that.
I suspect your ß problems are related to encodings. You are calling wget without even quoting it (you would also need to escape the quote characters, but that's an improvement). I think you will have problems with &, ' and ". Also, you are executing that in the shell, seems you have a command injection vulnerability. I hope nobody called his file Monument`rm -rf /`.JPG :)
The ß problem kind of solved itself. I downloaded a new list with the tool Martin aka DerHexer provided (see elya's mail) and now everything's fine. It took about 12 hours to download all files without ß, now I'm running a slightly altered script that downloads the missing ones.
Calling wget each time instead is a bit unefficient, I would recommend using wget -i if you can.
Hmm, time isn't really an issue and I'm like 90% done right now ;-) How much faster does it actually work? The limiting factor is my internet connection anyway, isn't it? Even though it's really fast, I need between 1 to 4 seconds per image, so it took roughly 12 hours to download all the ones without ß (about 55GB).
Thanks for your suggestions!
Kilian