This is a triple-crosspost. I suggest you reply to wikitech-l only.
A mistake I made caused the loss of 496 full-resolution images from Wikimedia servers.
I have recovered as many images as I can, drawing on the following sources:
* Squid cache (pmtpa, knams and yaseo) * May 8 backup of some wikis on storage1 * Duplicates with the same signature, found on the same or other wikis
That brought the number lost down from about 3000 to the current 496. For the remaining files, I made a copy of their thumbnail directories:
http://upload.wikimedia.org/lost-image-thumb-backup/
A list of missing images can be found here:
http://noc.wikimedia.org/~tstarling/missing-images-2008-09
If anyone has any ideas about where to find more backup files, I'd be willing to hear them. Otherwise, the community will just have to reupload as many as possible.
The technical details were as follows: I fixed a bug in File.php, and without checking what other changes were made to it, deployed the most recent version of the file on the Wikimedia servers, without also updating the rest of MediaWiki. Because FileRepo::$thumbDir was unset, LocalFile::migrateThumbFile() had the effect of deleting the source image for any thumbnail request which reached the backend. I reverted the change after about 20 minutes, following a report on IRC.
My sincere apologies.
-- Tim Starling
On Fri, Sep 5, 2008 at 11:11 AM, Tim Starling tstarling@wikimedia.org wrote:
This is a triple-crosspost. I suggest you reply to wikitech-l only.
A mistake I made caused the loss of 496 full-resolution images from Wikimedia servers.
I have recovered as many images as I can, drawing on the following sources:
- Squid cache (pmtpa, knams and yaseo)
- May 8 backup of some wikis on storage1
- Duplicates with the same signature, found on the same or other wikis
That brought the number lost down from about 3000 to the current 496. For the remaining files, I made a copy of their thumbnail directories:
http://upload.wikimedia.org/lost-image-thumb-backup/
A list of missing images can be found here:
http://noc.wikimedia.org/~tstarling/missing-images-2008-09
If anyone has any ideas about where to find more backup files, I'd be willing to hear them. Otherwise, the community will just have to reupload as many as possible.
At least one of them ( Clan_member_crest_badge_-_Clan_MacTavish.svg ) was reuploaded in a coincidence :-)
How about a script adding a message to the talk page of the respective uploader?
Magnus
2008/9/5 Tim Starling tstarling@wikimedia.org:
A mistake I made caused the loss of 496 full-resolution images from Wikimedia servers.
*facepalm* One of them just had to be the Flag of Palestine, didn't it ... ;-p
- d.
Could we perhaps get a list of links to these images' image pages? That way we might be able to recover a few of them by noting that their original source is still available.
2008/9/5 Andre Engels andreengels@gmail.com:
Could we perhaps get a list of links to these images' image pages? That way we might be able to recover a few of them by noting that their original source is still available.
I've made the list of links at http://meta.wikimedia.org/wiki/Missing_images_2008-09
(Just be formatting the original list http://noc.wikimedia.org/~tstarling/missing-images-2008-09)
-- [[cs:User:Mormegil | Petr Kadlec]]
Petr Kadlec wrote:
2008/9/5 Andre Engels andreengels@gmail.com:
Could we perhaps get a list of links to these images' image pages?
I've made the list of links at http://meta.wikimedia.org/wiki/Missing_images_2008-09
And I've made a list augmented with each image's uploader/editor(s). http://meta.wikimedia.org/wiki/Missing_images_%2B_editors_2008-09
On Fri, Sep 5, 2008 at 6:16 PM, Steve Summit scs@eskimo.com wrote:
Petr Kadlec wrote:
2008/9/5 Andre Engels andreengels@gmail.com:
Could we perhaps get a list of links to these images' image pages?
I've made the list of links at http://meta.wikimedia.org/wiki/Missing_images_2008-09
And I've made a list augmented with each image's uploader/editor(s). http://meta.wikimedia.org/wiki/Missing_images_%2B_editors_2008-09
It turned out that I had even more than I thought, thanks to Platonides who has been running a bot on my system that has the stale mirror, the bot has been patiently mirroring every file uploaded to commons. So the files in his directory added another 150 to the 308 that I had, and a number of other people filled in some as well.
I'm still generating SHA1SUMs so I still may find a few more yet based on content hashes.
The last concrete number I heard was 47 missing, but I think it's probably less than that now.
I'm not sure I'd call 496, out of however many hundreds of thousands of images we have, "massive".
Is there enough metainformation available to derive the uploaders or recent editors of the lost images? That'd make it much easier for concerned editors to grep -- er, search :-) -- for images they might be in a position to reupload.
On Fri, Sep 5, 2008 at 6:11 AM, Tim Starling tstarling@wikimedia.org wrote:
This is a triple-crosspost. I suggest you reply to wikitech-l only.
A mistake I made caused the loss of 496 full-resolution images from Wikimedia servers.
[snip]
http://noc.wikimedia.org/~tstarling/missing-images-2008-09
If anyone has any ideas about where to find more backup files, I'd be willing to hear them. Otherwise, the community will just have to reupload as many as possible.
[snip]
I have 30 of the 496 images in that list based on an exact path match. It's possible that I have more based on hash matches for image which were moved between sites or 'renamed' after my last sync.
I have some chores to run, but I will later pull the hashes from the database and check for hash matches.
I would likely have had nearly all of them if the rsync push to me had not been down most of the year.
:(
Gregory Maxwell wrote:
On Fri, Sep 5, 2008 at 6:11 AM, Tim Starling tstarling@wikimedia.org wrote:
This is a triple-crosspost. I suggest you reply to wikitech-l only.
^^^^^^^^^^^^^^^^^^^^^^^^ I think some people missed this line.
A mistake I made caused the loss of 496 full-resolution images from Wikimedia servers.
[snip]
http://noc.wikimedia.org/~tstarling/missing-images-2008-09
If anyone has any ideas about where to find more backup files, I'd be willing to hear them. Otherwise, the community will just have to reupload as many as possible.
[snip]
I have 30 of the 496 images in that list based on an exact path match. It's possible that I have more based on hash matches for image which were moved between sites or 'renamed' after my last sync.
I have some chores to run, but I will later pull the hashes from the database and check for hash matches.
I would likely have had nearly all of them if the rsync push to me had not been down most of the year.
If it helps, this file has the hashes already:
http://noc.wikimedia.org/~tstarling/pass-3-targets-hashes
-- Tim Starling
On Fri, Sep 5, 2008 at 8:54 AM, Tim Starling tstarling@wikimedia.org wrote:
If it helps, this file has the hashes already: http://noc.wikimedia.org/~tstarling/pass-3-targets-hashes
Thanks. Saved me a step… and fortunately I already had base conversion code handy.
Sadly, it takes a long time to SHA1 many tbytes of data. I started the process this morning, but I had made an error in assuming the xargs parallel argument (-P) wouldn't result in badly interleaved output, since it didn't in a limited test. Turns out it did so I had to start the hashing over again.
(Might I suggest, beyond not invoking unlink() that if your filesystem can handle some additional inode pressure that you make daily or weekly hardlink snapshots in a directory tree inaccessible to the web front end? It's not as good as a real backup system, but it's cheap and easy. On my system (xfs) I have a dozen or so hardlink snapshots of the Wikimedia image collection: while I was getting updates I was creating snapshots which roughly coincided with the released database dumps)
Since the hashing is going to take a while I'll hop on IRC and pass you a link to a tar with the file name matches. Turns out that I have *most* of them based on name match alone. (dunno why my earlier count was wrong… perhaps a unicode handling bug on my part, I'd just woken up when I sent my prior email)
On Fri, Sep 5, 2008 at 6:11 AM, Tim Starling tstarling@wikimedia.org wrote:
[snip]
The technical details were as follows: I fixed a bug in File.php, and without checking what other changes were made to it, deployed the most recent version of the file on the Wikimedia servers, without also updating the rest of MediaWiki. Because FileRepo::$thumbDir was unset, LocalFile::migrateThumbFile() had the effect of deleting the source image for any thumbnail request which reached the backend. I reverted the change after about 20 minutes, following a report on IRC.
My sincere apologies.
-- Tim Starling
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
And mine as well. I introduced the $thumbDir code in r40385. I thought I had set a sane default of 'thumb/' in the constructor (which would work per current behavior of using hardcoded 'thumb/'. Is there a code-path in which $thumbDir isn't being set? If so, that needs fixing asap. Would a revert be in order, or is everything ok as-is?
-Chad
Chad wrote:
On Fri, Sep 5, 2008 at 6:11 AM, Tim Starling tstarling@wikimedia.org wrote:
[snip]
The technical details were as follows: I fixed a bug in File.php, and without checking what other changes were made to it, deployed the most recent version of the file on the Wikimedia servers, without also updating the rest of MediaWiki. Because FileRepo::$thumbDir was unset, LocalFile::migrateThumbFile() had the effect of deleting the source image for any thumbnail request which reached the backend. I reverted the change after about 20 minutes, following a report on IRC.
My sincere apologies.
And mine as well. I introduced the $thumbDir code in r40385. I thought I had set a sane default of 'thumb/' in the constructor (which would work per current behavior of using hardcoded 'thumb/'. Is there a code-path in which $thumbDir isn't being set? If so, that needs fixing asap. Would a revert be in order, or is everything ok as-is?
If you had followed my example and used an accessor function, instead of having the File class access member variables of the repo directly, then there would have been no problem. Adding an accessor is good style in any case, and you should make that change. But it wasn't your fault.
I patched two files in quick succession: GlobalFunctions.php and then File.php. With GlobalFunctions.php, I checked the diff carefully for any dependencies before I updated it on Wikimedia. There were no changes other than my own. With File.php, I assumed it would be OK and didn't check. I didn't think about it at the time, I was working quickly. Call it cognitive bias, loss of concentration, laziness, whatever. Not your fault.
There was a second programming error here, and that was the fact that I put an unlink() call in the code in the first place. It didn't seem dangerous at the time, but obviously migrateThumbFile() is a recipe for disaster if there's a potential for adverse input coming from getThumbPath().
However, the thumb directory is inherently temporary, and lots of things delete from it. I think I'd be most comfortable not having the thumbDir feature at all. Is there some reason for it?
-- Tim Starling
On Fri, Sep 5, 2008 at 11:14 AM, Tim Starling tstarling@wikimedia.org wrote:
[snip]
However, the thumb directory is inherently temporary, and lots of things delete from it. I think I'd be most comfortable not having the thumbDir feature at all. Is there some reason for it?
-- Tim Starling
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Customization options for sysadmins. Reverted in r40504. Largely useless unless someone can think up a use-case for _needing_ it to be a different location than /thumb.
In any case, I've removed it pending a reason for it (or at least a better implementation with accessors and the like).
-Chad
Tim Starling wrote: <snip>
A list of missing images can be found here:
I used your list to generate a basic gallery :
http://noc.wikimedia.org/~hashar/200809-missing/
Maybe it can help people.
On Fri, 05 Sep 2008 20:11:10 +1000, Tim Starling wrote:
This is a triple-crosspost. I suggest you reply to wikitech-l only.
A mistake I made caused the loss of 496 full-resolution images from Wikimedia servers.
I have recovered as many images as I can, drawing on the following sources:
- Squid cache (pmtpa, knams and yaseo)
- May 8 backup of some wikis on storage1
- Duplicates with the same signature, found on the same or other wikis
That brought the number lost down from about 3000 to the current 496. For the remaining files, I made a copy of their thumbnail directories:
http://upload.wikimedia.org/lost-image-thumb-backup/
A list of missing images can be found here:
http://noc.wikimedia.org/~tstarling/missing-images-2008-09
If anyone has any ideas about where to find more backup files, I'd be willing to hear them. Otherwise, the community will just have to reupload as many as possible.
The technical details were as follows: I fixed a bug in File.php, and without checking what other changes were made to it, deployed the most recent version of the file on the Wikimedia servers, without also updating the rest of MediaWiki. Because FileRepo::$thumbDir was unset, LocalFile::migrateThumbFile() had the effect of deleting the source image for any thumbnail request which reached the backend. I reverted the change after about 20 minutes, following a report on IRC.
My sincere apologies.
-- Tim Starling
I just checked that list with my collection; it looks like I've got about 250 of them. Is there someplace I can drop a tarball or somethings?
Steve Sanbeg wrote:
I just checked that list with my collection; it looks like I've got about 250 of them. Is there someplace I can drop a tarball or somethings?
We don't have any FTP upload server set up if that's what you mean. The easiest thing would be if you could set up an HTTP server that I can download the tarball from. If that's not feasible, grab me on IRC and we'll sort something out.
-- Tim Starling
An "out of the blue" idea that I haven't checked: Are those pages stored in archive.org? Because if yes, then a copy of the image my also be there.
Hojjat (aka Huji)
On 9/6/08, Tim Starling tstarling@wikimedia.org wrote:
Steve Sanbeg wrote:
I just checked that list with my collection; it looks like I've got about 250 of them. Is there someplace I can drop a tarball or somethings?
We don't have any FTP upload server set up if that's what you mean. The easiest thing would be if you could set up an HTTP server that I can download the tarball from. If that's not feasible, grab me on IRC and we'll sort something out.
-- Tim Starling
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Sat, Sep 6, 2008 at 12:26 PM, Huji huji.huji@gmail.com wrote:
An "out of the blue" idea that I haven't checked: Are those pages stored in archive.org? Because if yes, then a copy of the image my also be there.
I think what you'll find is that most mirrors and copies do not have the full resolution image. I think we had thumbs for all of the remainders already.
On 9/6/08, Gregory Maxwell gmaxwell@gmail.com wrote:
On Sat, Sep 6, 2008 at 12:26 PM, Huji huji.huji@gmail.com wrote:
An "out of the blue" idea that I haven't checked: Are those pages stored
in
archive.org? Because if yes, then a copy of the image my also be there.
I think what you'll find is that most mirrors and copies do not have the full resolution image. I think we had thumbs for all of the remainders already.
Well I was hoping otherwise. I hoped that some crawlers like archive.org may store not only the image page, but also the full res image (which is linked from the image page). I tested some examples, and it seems they don't even store the image pages!
Huji
On Mon, Sep 08, 2008 at 04:10:48PM +0100, MinuteElectron wrote:
Huji wrote:
I tested some examples, and it seems they don't even store the image pages!
The Wayback Machine does not release archived data until six months after it is captured.
And, annoyingly enough, they also block access to data if a current robots.txt says to... even if the domain has changed hands. That makes little sense to me, but what can you do; they have a staff of, what, 6?
Cheers, -- jra
Woops .... it is actually a miracle big mistakes like this haven't occured before over the years! Which says a lot about the high quality of the developers and maintainers of the site. Don't worry to much Tim, it will work itself out. No more 24 our days behind the computer though ;)
Walter van Kalken (waerth)
This is a triple-crosspost. I suggest you reply to wikitech-l only.
A mistake I made caused the loss of 496 full-resolution images from Wikimedia servers.
I have recovered as many images as I can, drawing on the following sources:
- Squid cache (pmtpa, knams and yaseo)
- May 8 backup of some wikis on storage1
- Duplicates with the same signature, found on the same or other wikis
That brought the number lost down from about 3000 to the current 496. For the remaining files, I made a copy of their thumbnail directories:
http://upload.wikimedia.org/lost-image-thumb-backup/
A list of missing images can be found here:
http://noc.wikimedia.org/~tstarling/missing-images-2008-09
If anyone has any ideas about where to find more backup files, I'd be willing to hear them. Otherwise, the community will just have to reupload as many as possible.
The technical details were as follows: I fixed a bug in File.php, and without checking what other changes were made to it, deployed the most recent version of the file on the Wikimedia servers, without also updating the rest of MediaWiki. Because FileRepo::$thumbDir was unset, LocalFile::migrateThumbFile() had the effect of deleting the source image for any thumbnail request which reached the backend. I reverted the change after about 20 minutes, following a report on IRC.
My sincere apologies.
-- Tim Starling
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Tim Starling wrote:
A list of missing images can be found here:
Some images on ja.wikipedia seem to be lost too. Is it due to the same accident?
/wikipedia/ja/0/01/Shinkiryu_station.jpg /wikipedia/ja/0/02/Totsuka_station_nishiguchi.jpg /wikipedia/ja/0/04/Tsurugamineeki.jpg ... and at least a hundred more missing.
Brevam wrote:
Tim Starling wrote:
A list of missing images can be found here:
Some images on ja.wikipedia seem to be lost too. Is it due to the same accident?
/wikipedia/ja/0/01/Shinkiryu_station.jpg /wikipedia/ja/0/02/Totsuka_station_nishiguchi.jpg /wikipedia/ja/0/04/Tsurugamineeki.jpg ... and at least a hundred more missing.
That's an... interesting failure mode. That makes three different types of breakage I've seen so far: missing files, empty files and now directory entries where there should be files. Are these really all from the same bug?
Ilmari Karonen wrote:
Brevam wrote:
Tim Starling wrote:
A list of missing images can be found here:
Some images on ja.wikipedia seem to be lost too. Is it due to the same accident?
/wikipedia/ja/0/01/Shinkiryu_station.jpg /wikipedia/ja/0/02/Totsuka_station_nishiguchi.jpg /wikipedia/ja/0/04/Tsurugamineeki.jpg ... and at least a hundred more missing.
That's an... interesting failure mode. That makes three different types of breakage I've seen so far: missing files, empty files and now directory entries where there should be files. Are these really all from the same bug?
Yes. The bug itself deleted the file and put a directory entry in its place. I wrote a shell script to remove the directory entry and then do a wget to fetch the file from the squid cache. Wget created a zero-length file for all the cache misses. Some of those files were subsequently deleted.
-- Tim Starling
On Sat, Sep 20, 2008 at 9:38 PM, Tim Starling tstarling@wikimedia.org wrote:
Yes. The bug itself deleted the file and put a directory entry in its place. I wrote a shell script to remove the directory entry and then do a wget to fetch the file from the squid cache. Wget created a zero-length file for all the cache misses. Some of those files were subsequently deleted.
So does that mean there are more files which were not included in the prior list of missing files that I should check for?
Gregory Maxwell wrote:
On Sat, Sep 20, 2008 at 9:38 PM, Tim Starling tstarling@wikimedia.org wrote:
Yes. The bug itself deleted the file and put a directory entry in its place. I wrote a shell script to remove the directory entry and then do a wget to fetch the file from the squid cache. Wget created a zero-length file for all the cache misses. Some of those files were subsequently deleted.
So does that mean there are more files which were not included in the prior list of missing files that I should check for?
The list of missing files was derived from the initial scan for directory entries where files should have been. I'm not sure why files would have been missing from that list. We'll probably have to check the whole file repository against the DB. You could write a script for that if you feel like doing something to help.
-- Tim Starling
Tim Starling wrote:
Gregory Maxwell wrote:
On Sat, Sep 20, 2008 at 9:38 PM, Tim Starling tstarling@wikimedia.org wrote:
Yes. The bug itself deleted the file and put a directory entry in its place. I wrote a shell script to remove the directory entry and then do a wget to fetch the file from the squid cache. Wget created a zero-length file for all the cache misses. Some of those files were subsequently deleted.
So does that mean there are more files which were not included in the prior list of missing files that I should check for?
The list of missing files was derived from the initial scan for directory entries where files should have been. I'm not sure why files would have been missing from that list. We'll probably have to check the whole file repository against the DB. You could write a script for that if you feel like doing something to help.
Just running something like "find -type d" on the image directory and filtering out the expected legitimate entries would be a good start.
By the way, there also seem to be plenty of these under the "archive" directory, e.g. /wikipedia/en/archive/0/00/20060414204303!Uakari_male.jpg/ This probably has something to do with the problems we've been having with thumbnail generation in image histories.
wikitech-l@lists.wikimedia.org