Hi,
I was told yesterday that the mailman/pipermail archives were broken, in that permalinks were no longer linking to the messages they used to link to (therefore not being "permalinks" at all).
I know this happened at least once in the past, when the archives were rebuilt. Retroactively fixing permalinks on-wiki and elsewhere is a nightmare (particularly for old messages used to source early Wikimedia history), and we're still finding tons of obsolete links today. I'm hoping that whatever caused the permalinks to be changed again can be swiftly reverted, so that we don't end up with another huge pile of obsolete links.
Does anyone have any more information about what happened this time, and if there's any chance links will be returned to their previous state? I haven't been able to find a thread or recent bug about this issue.
Thanks,
On Thu, Aug 16, 2012 at 2:00 AM, Guillaume Paumier gpaumier@wikimedia.org wrote:
I was told yesterday that the mailman/pipermail archives were broken, in that permalinks were no longer linking to the messages they used to link to (therefore not being "permalinks" at all).\
Hi Guillaume,
the last time we had to rebuild archives was about 2 weeks ago. Unfortunately this is a major drawback of removing messages from archives as you pointed out and we are aware of it. We had a thread there though that contained private information and we also did not want to refuse the request of the person affected to remove their data. A subsequent request that followed shortly after was actually rejected for this very reason. In the future such requests will more likely rejected and if unavoidable we will just XXX out information instead of removing complete threads to avoid this from happening again. Everybody on this list please be extra careful about posting private information to a public list you might regret in the future. Sorry for breaking links, we are aware URLs should never change if at all possible.
reference ticket is RT-3281
Thanks Daniel. I don't understand, how can a message need to be removed completely? I can't imagine anything which couldn't just be redacted by leaving at least the message's "skeleton" as demanded by https://wikitech.wikimedia.org/view/Remove_a_message_from_mailing_list_archive
Nemo
On Thu, Aug 16, 2012 at 1:49 PM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Thanks Daniel. I don't understand, how can a message need to be removed completely?
In this case the request was for a complete thread to be removed. Since many people reply with full quotes it usually repeats the information in almost every message. ("TOFU"-posting). But you are right, even in these cases we should, and will, just replace content of every message with a "deleted" message.
Daniel Zahn wrote:
On Thu, Aug 16, 2012 at 1:49 PM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Thanks Daniel. I don't understand, how can a message need to be removed completely?
In this case the request was for a complete thread to be removed. Since many people reply with full quotes it usually repeats the information in almost every message. ("TOFU"-posting). But you are right, even in these cases we should, and will, just replace content of every message with a "deleted" message.
What is your plan to clean up the mess you made?
MZMcBride
On 17 August 2012 02:23, MZMcBride z@mzmcbride.com wrote:
Daniel Zahn wrote:
In this case the request was for a complete thread to be removed. Since many people reply with full quotes it usually repeats the information in almost every message. ("TOFU"-posting). But you are right, even in these cases we should, and will, just replace content of every message with a "deleted" message.
What is your plan to clean up the mess you made?
Rewrite the sucky archiver in Mailman?
One thing I would like to see is Google indexing of the WMF archive enabled again. All the third-party archives not under our control are in the search engines, there's not actually any sane reason not to have the official archive indexed - unless it's just to reduce the noise of complaints from people who erroneously think it's possible to remove their own words from the Internet. (We used to substitute it with ht://dig, which was so incredibly awful that nothing at all was a reasonable alternative.)
- d.
I doubt fixing this requires rewriting mailman. It only requires dummy messages to be reinserted where they've been deleted and the archives to be rebuilt after this, just as if the correct procedure had been followed from the start. This, by the way, is by some orders of magnitude easier and quicker than fixing all the thousands of broken links across all the wikis.
While we're on it, maybe someone will understand why the August archive is now full with "no subject" emails which seem to come from other eras and have the most random ids. http://lists.wikimedia.org/pipermail/wikitech-l/2012-August/thread.html#1052
Nemo
Hi,
On Fri, Aug 17, 2012 at 12:38 PM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
I doubt fixing this requires rewriting mailman. It only requires dummy messages to be reinserted where they've been deleted and the archives to be rebuilt after this
I've added your suggestion to a new RT ticket to "Attempt to fix mailman/pipermail permalinks", and let the list know if it's not possible.
On Fri, Aug 17, 2012 at 3:38 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
I doubt fixing this requires rewriting mailman. It only requires dummy messages to be reinserted where they've been deleted and the archives to be rebuilt after this, just as if the correct procedure had been followed from the start.
7 messages have been deleted. 4 have been between the messages "Code review backlog.." by Jeroen and "Daring to consider .." by Roan. 3 have been between "Code review backlog .." by Daniel Friesen and "Save to userspace.." by PetrB.
I have inserted 7 fake messages in exactly these places, keeping the original message IDs, "in-reply-to" and timestamps.
I am rebuilding the archives again right now but it takes a while. I really hope this fixes it now.
David Gerard wrote:
On 17 August 2012 02:23, MZMcBride z@mzmcbride.com wrote:
Daniel Zahn wrote:
In this case the request was for a complete thread to be removed. Since many people reply with full quotes it usually repeats the information in almost every message. ("TOFU"-posting). But you are right, even in these cases we should, and will, just replace content of every message with a "deleted" message.
What is your plan to clean up the mess you made?
Rewrite the sucky archiver in Mailman?
I always figured it was a "feature" of Mailman that it's so difficult to modify the archives. They're really not supposed to be tampered with.
One thing I would like to see is Google indexing of the WMF archive enabled again. All the third-party archives not under our control are in the search engines, there's not actually any sane reason not to have the official archive indexed - unless it's just to reduce the noise of complaints from people who erroneously think it's possible to remove their own words from the Internet. (We used to substitute it with ht://dig, which was so incredibly awful that nothing at all was a reasonable alternative.)
Yes, this probably makes sense. Bugzilla went the same route (excluded from search engines, everyone relied on mirrors of the wikibugs-l mailing list, finally allowed back in to search engine indices).
The situation is even more bleak for private lists. With those lists, there's no way to search the lists at all, as they're excluded from external search engines indices and the internal search has been disabled for years. The relevant bug is https://bugzilla.wikimedia.org/17390.
As MaxSem commented, perhaps Mailman ought to be re-evaluated as the mailing list software, though I've yet to come across (m)any software packages that are better, unfortunately.
MZMcBride
On 17 August 2012 11:46, MZMcBride z@mzmcbride.com wrote:
As MaxSem commented, perhaps Mailman ought to be re-evaluated as the mailing list software, though I've yet to come across (m)any software packages that are better, unfortunately.
There isn't really anything better. It's ridiculously better than any of its precedessors, which I recall with a shudder.
- d.
On 17 August 2012 12:17, David Gerard dgerard@gmail.com wrote:
On 17 August 2012 11:46, MZMcBride z@mzmcbride.com wrote:
As MaxSem commented, perhaps Mailman ought to be re-evaluated as the
mailing
list software, though I've yet to come across (m)any software packages
that
are better, unfortunately.
There isn't really anything better. It's ridiculously better than any of its precedessors, which I recall with a shudder.
Lamson/Librelist is pretty good (and a LOT more recent - couple of years old at most).
https://github.com/zedshaw/lamson/tree/master/examples/librelist
Tom
On Fri, Aug 17, 2012 at 7:17 AM, David Gerard dgerard@gmail.com wrote:
On 17 August 2012 11:46, MZMcBride z@mzmcbride.com wrote:
As MaxSem commented, perhaps Mailman ought to be re-evaluated as the mailing list software, though I've yet to come across (m)any software packages that are better, unfortunately.
There isn't really anything better. It's ridiculously better than any of its precedessors, which I recall with a shudder.
I think none of our problems (that i've seen mentioned here so far at least) will be fixed in Mailman 2 releases; OTOH, Mailman 3 isn't that far away IIRC. (but I don't know the timeline exactly)
-Jeremy
Hi,
On Thu, Aug 16, 2012 at 7:07 PM, Daniel Zahn dzahn@wikimedia.org wrote:
the last time we had to rebuild archives was about 2 weeks ago. Unfortunately this is a major drawback of removing messages from archives as you pointed out and we are aware of it. We had a thread there though that contained private information and we also did not want to refuse the request of the person affected to remove their data. A subsequent request that followed shortly after was actually rejected for this very reason. In the future such requests will more likely rejected and if unavoidable we will just XXX out information instead of removing complete threads to avoid this from happening again. Everybody on this list please be extra careful about posting private information to a public list you might regret in the future. Sorry for breaking links, we are aware URLs should never change if at all possible.
Thank you for the explanation, Daniel.
Guillaume Paumier wrote:
I was told yesterday that the mailman/pipermail archives were broken, in that permalinks were no longer linking to the messages they used to link to (therefore not being "permalinks" at all).
This is pretty devastating. It's difficult to overstate the importance of Mailman archives in documenting Wikimedia's history (or even history before Wikimedia was a concept). I've come across links such as the one at https://en.wikipedia.org/wiki/Wikipedia:Tim_Starling_Day that I can't even find anywhere in the Mailman archives any longer. :-(
MZMcBride
On Fri, Aug 17, 2012 at 4:26 AM, MZMcBride z@mzmcbride.com wrote:
Guillaume Paumier wrote:
I was told yesterday that the mailman/pipermail archives were broken, in that permalinks were no longer linking to the messages they used to link to (therefore not being "permalinks" at all).
This is pretty devastating. It's difficult to overstate the importance of Mailman archives in documenting Wikimedia's history (or even history before Wikimedia was a concept). I've come across links such as the one at https://en.wikipedia.org/wiki/Wikipedia:Tim_Starling_Day that I can't even find anywhere in the Mailman archives any longer. :-(
MZMcBride
Many historical Signpost articles are affected as well: https://en.wikipedia.org/w/index.php?title=Special%3ASearch&search=piper...
BTW, here's Brion dreaming about a stable archiving system in 2007 ... http://article.gmane.org/gmane.science.linguistics.wikipedia.technical/28993
In the same year, the lead developer of Mailman said that fixing this problem of breaking URLs was "absolutely critical" (http://mail.python.org/pipermail/mailman-developers/2007-July/019632.html ) and some ideas were thrown around (http://wiki.list.org/display/DEV/Stable+URLs ), but apparently this huge data integrity problem still hasn't been solved.
Tilman, thanks for those links.
I thought the base-32 encoded hash of Message-Id discussed in [http://wiki.list.org/display/DEV/Stable+URLs] gives us a straightforward and effective solution to the problem. Ten characters or so should be plenty. This would produce URLs like, [http://lists.wikimedia.org/pipermail/wikitech-l/2012-August/OHRDQGOX35.html]
We could prefix these with a parent directory that serves as a versioning scheme for our hash, allowing us to create forwarding rules if the permalink rules change in the future. For example (and I have no experience, this might not work), we can generate an ".htaccess" at the root of old archive directories, which redirects each of the old sequential URLs to the new, hashed location.
-Adam
On 08/17/2012 08:00 AM, Tilman Bayer wrote:
On Fri, Aug 17, 2012 at 4:26 AM, MZMcBride z@mzmcbride.com wrote:
Guillaume Paumier wrote:
I was told yesterday that the mailman/pipermail archives were broken, in that permalinks were no longer linking to the messages they used to link to (therefore not being "permalinks" at all).
This is pretty devastating. It's difficult to overstate the importance of Mailman archives in documenting Wikimedia's history (or even history before Wikimedia was a concept). I've come across links such as the one at https://en.wikipedia.org/wiki/Wikipedia:Tim_Starling_Day that I can't even find anywhere in the Mailman archives any longer. :-(
MZMcBride
Many historical Signpost articles are affected as well: https://en.wikipedia.org/w/index.php?title=Special%3ASearch&search=piper...
BTW, here's Brion dreaming about a stable archiving system in 2007 ... http://article.gmane.org/gmane.science.linguistics.wikipedia.technical/28993
In the same year, the lead developer of Mailman said that fixing this problem of breaking URLs was "absolutely critical" (http://mail.python.org/pipermail/mailman-developers/2007-July/019632.html ) and some ideas were thrown around (http://wiki.list.org/display/DEV/Stable+URLs ), but apparently this huge data integrity problem still hasn't been solved.
Mailman 3 already has code to add this X-Message-ID-Hash header, and integrate with mail archiving tools.
-Adam
On 08/17/2012 11:32 AM, Adam Wight wrote:
Tilman, thanks for those links.
I thought the base-32 encoded hash of Message-Id discussed in [http://wiki.list.org/display/DEV/Stable+URLs] gives us a straightforward and effective solution to the problem. Ten characters or so should be plenty. This would produce URLs like, [http://lists.wikimedia.org/pipermail/wikitech-l/2012-August/OHRDQGOX35.html]
We could prefix these with a parent directory that serves as a versioning scheme for our hash, allowing us to create forwarding rules if the permalink rules change in the future. For example (and I have no experience, this might not work), we can generate an ".htaccess" at the root of old archive directories, which redirects each of the old sequential URLs to the new, hashed location.
-Adam
On 08/17/2012 08:00 AM, Tilman Bayer wrote:
On Fri, Aug 17, 2012 at 4:26 AM, MZMcBride z@mzmcbride.com wrote:
Guillaume Paumier wrote:
I was told yesterday that the mailman/pipermail archives were broken, in that permalinks were no longer linking to the messages they used to link to (therefore not being "permalinks" at all).
This is pretty devastating. It's difficult to overstate the importance of Mailman archives in documenting Wikimedia's history (or even history before Wikimedia was a concept). I've come across links such as the one at https://en.wikipedia.org/wiki/Wikipedia:Tim_Starling_Day that I can't even find anywhere in the Mailman archives any longer. :-(
MZMcBride
Many historical Signpost articles are affected as well: https://en.wikipedia.org/w/index.php?title=Special%3ASearch&search=piper...
BTW, here's Brion dreaming about a stable archiving system in 2007 ... http://article.gmane.org/gmane.science.linguistics.wikipedia.technical/28993
In the same year, the lead developer of Mailman said that fixing this problem of breaking URLs was "absolutely critical" (http://mail.python.org/pipermail/mailman-developers/2007-July/019632.html
) and some ideas were thrown around (http://wiki.list.org/display/DEV/Stable+URLs ), but apparently this huge data integrity problem still hasn't been solved.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Fri, Aug 17, 2012 at 8:00 AM, Tilman Bayer tbayer@wikimedia.org wrote:
Many historical Signpost articles are affected as well: https://en.wikipedia.org/w/index.php?title=Special%3ASearch&search=piper...
All messages i removed on August 2nd have been posted in April 2012 (9th and 10th). Since the message numbering is just counting up by date, i don't see how this would have influenced historical posts before that currently.
(see the date view vs. thread view https://lists.wikimedia.org/mailman/private/wmfall/2012-April/date.html#star...)
Currently you will get "Private archive file not found" when trying to look at the wikitech-l archives. This is because the rebuilding process is running. Currently it is working on the year 2010.. Also i made a backup of the .mbox file before editing of course.
On 12-08-16 02:00 AM, Guillaume Paumier wrote:
I was told yesterday that the mailman/pipermail archives were broken, in that permalinks were no longer linking to the messages they used to link to (therefore not being "permalinks" at all).
Is the current state of the archives related to these events? It appears to be only text files, with no indices, and improper sorting....what's going on!?
Maybe someone is rebuilding the archives? Could we have gotten notice about that?
On Fri, Aug 17, 2012 at 2:55 PM, Mark Holmquist mtraceur@member.fsf.org wrote:
Maybe someone is rebuilding the archives? Could we have gotten notice about that?
Ohh.. yes, absolutely, i sent messages about it, yet they did not arrive on the list until just now since the mailbox is locked during the rebuilding process :/
The rebuilding is now done. I inserted 7 messages you can see as from "mailman root at wikimedia.org", like here:
https://lists.wikimedia.org/mailman/private/wikitech-l/2012-April/059880.htm...
Alright, so inserted the exact number of messages i deleted on Aug. 2 in the same places/dates, that should bring message numbering and links back to the same state before i deleted that thread. As others have mentioned before there have been other inconsistencies in it before though, so you can most likely still find other issues but to the best of my knowledge they should be unrelated. Especially anything that is older than April 2012 should not have been affected by my recent change.
On 18/08/12 00:42, Daniel Zahn wrote:
Alright, so inserted the exact number of messages i deleted on Aug. 2 in the same places/dates, that should bring message numbering and links back to the same state before i deleted that thread. As others have mentioned before there have been other inconsistencies in it before though, so you can most likely still find other issues but to the best of my knowledge they should be unrelated. Especially anything that is older than April 2012 should not have been affected by my recent change.
Thanks Daniel, I hope the original sender, as well as people sending those mails, handle them more carefully in the future, to avoid this. I don't see anything "obviously bad" there, but if it was to the claiming person, all's good to me.
Regards
Hi,
On Sat, Aug 18, 2012 at 12:42 AM, Daniel Zahn dzahn@wikimedia.org wrote:
Alright, so inserted the exact number of messages i deleted on Aug. 2 in the same places/dates, that should bring message numbering and links back to the same state before i deleted that thread. As others have mentioned before there have been other inconsistencies in it before though, so you can most likely still find other issues but to the best of my knowledge they should be unrelated. Especially anything that is older than April 2012 should not have been affected by my recent change.
Thanks for your efforts, Daniel.
It doesn't appear that they've been entirely successful from what I can see (details below), but I appreciate that you've gone out of your way to try to fix this.
== Examples ==
After April 2012: The link http://lists.wikimedia.org/pipermail/wikitech-l/2012-July/061691.html was posted on meta to reference a message of mine from July 2012. That ID (061691) had to be changed to 061614 after the rebuild from 2 weeks ago (i.e. a translation of -77). After yesterday's rebuild, it's now at ID 061621 (a translation of +7 consistent with the 7 empty messages you've reinserted).
Before April 2012: The link http://lists.wikimedia.org/pipermail/wikitech-l/2004-February/008418.html was posted recently on wiki to reference the 2004 server move from San Diego to Tampa. That link now points to an unrelated message. I've tried a translation of -77 but I don't think that's the original message either (there are several messages from Feb. 2004 about the server move).
So, it appears that the archives have been corrupted inconsistently besides the simple translations of -77 or -7. Someone can probably verify that with other links (e.g. from the Signpost pages).
This is also consistent with the fact, pointed out by MZMcBride, that the August 2012 archive page contains several "No subject" messages that clearly don't belong there They've had their headers removed, and they have IDs like 001363 or 004210 (that would roughly put them around November 2002 and May 2003 respectively).
The conclusion is that the archives have probably been irrevocably corrupted and that we'll have to fix all links manually (we can't use a bot since there is no consistent translation of IDs).
If I remember right, the issue of deleting old mails was not just that the ids moved the number of deleted mails, but that when rebuilding the archive, new versions rebuilt it differently. Thus the changed numeration.
Can we restore the old files from backups?
On 18 August 2012 19:35, Platonides Platonides@gmail.com wrote:
Can we restore the old files from backups?
How far back do we have backups? Is there any automated way to detect corruption in the archives?
- d.
First of all, thank you Ryan, Faidon, Brandon and others for the related thread. Indeed it feels like a rough environment sometimes, especially on lists, sometimes also on IRC and I really appreciated seeing colleagues step in for me in that way.
On Fri, Aug 17, 2012 at 1:59 PM, MZMcBride z@mzmcbride.com wrote:
I've always found you to be incredibly helpful on IRC, on the mailing lists, and elsewhere and I've always appreciated having you around. I apologize if my initial message suggested otherwise.
That said, also thank you for that, MZ. apology gladly accepted.
I read your reply to Guillom's post as "shit happens." And it most certainly does. But you said that the archives were last rebuilt two weeks ago, which is where the timeline kind of fell apart in my head.
That's understandable. To me it sounded like "he just walked away" when most of the messages had been sent during the night in PST and i had just arrived at the office. Until Guillaume brought it up on the list I thought of it as a drawback of removing mails, which i had mentioned before and which makes us do as little removals as possible but we would have to live with as it had happened before.
There was no communication to the list and its members and the archive being rebuilt two weeks ago and the consequences of doing so.
One of the reasons for not sending any announcement for this to the list was that it was about removing private data, so i did not want to go "Look, here is this private data i am now going to delete". Of course i could have still pointed out that archives are being rebuilt without giving the details.
Ok, back to the technical issue:
On Sat, Aug 18, 2012 at 1:19 AM, Guillaume Paumier gpaumier@wikimedia.org wrote:
After April 2012: The link http://lists.wikimedia.org/pipermail/wikitech-l/2012-July/061691.html was posted on meta to reference a message of mine from July 2012. That ID (061691) had to be changed to 061614 after the rebuild from 2 weeks ago (i.e. a translation of -77). After yesterday's rebuild, it's now at ID 061621 (a translation of +7 consistent with the 7 empty messages you've reinserted).
:( This is really unfortunate, but sorry, i don't have an explanation for the difference of -77 unless there has been another rebuild that i am not aware of or it actually was broken before my latest change...or it's a mailman bug..:/
So, it appears that the archives have been corrupted inconsistently besides the simple translations of -77 or -7. Someone can probably verify that with other links (e.g. from the Signpost pages).
Is it really random or at least consistent -77 since after April 2012? Well, but as Ryan pointed out this whole issue has happened several times in the past, so i expect you could always find broken links somewhere depending on the time they have been created.
the August 2012 archive page contains several "No subject" messages
I don't know where these come from. Are they really new since this incident? I would really prefer to not delete anything at this point and break links once again.
On Sat, Aug 18, 2012 at 11:35 AM, Platonides Platonides@gmail.com wrote:
If I remember right, the issue of deleting old mails was not just that the ids moved the number of deleted mails, but that when rebuilding the archive, new versions rebuilt it differently. Thus the changed numeration.
That would indeed explain inconsistencies in old links before April. I think though that we have rebuilt archives more than once with this current mailman version. Still would explain corruption from the past though.
Can we restore the old files from backups?
Again, really unfortunate, but we can't at this point. Backups are going back one week.
I'm afraid the best i can do now is to help fixing external links. If somebody wants to point me to WP pages with broken links to mailman archives, i would gladly help to fix them in an edit sprint.
Daniel Zahn dzahn@wikimedia.org Operations Engineer
I don't know where these come from. Are they really new since this
incident? I would really prefer to not delete anything at this point and break links once again.
Those emails are definitely new. I didn't see them in the archive earlier this month when I was going through it.
Backups are going back one week.
Usually not one to complain about gramatically improper sentences, as I write them regularly, but you lost me on this sentence. Do our backups only go back a week? Or are backups only taken once a week?
Thank you, Derric Atzrott
On Mon, Aug 20, 2012 at 11:29 AM, Derric Atzrott datzrott@alizeepathology.com wrote:
Do our backups only go back a week? Or are backups only taken once a week?
They are taken daily and go back one week. To be exact 6 "Daily"-slots exist in Amanda backup.
Is this issue fixed?
I found a link on https://bugzilla.wikimedia.org/show_bug.cgi?id=35785 which points to http://lists.wikimedia.org/pipermail/wikitech-l/2012-April/059838.html but I don't see any relation between the bug and the thread on wikitech.
The bug was opened on 2012-04-07.
Best regards, Helder
wikitech-l@lists.wikimedia.org