[courtesy copy to foundation-l, though I suggest that discussion, if any, be centralised on wikitech-l]
Hi all, the search index for the mailinglist archives was last rebuilt in January. Now, after having made quite a few queries about this here and at other places, I learnt (and obviously had to accept) that rebuilding the search index is quite a resources-consuming process which resulted in crashes.
To put it bluntly, I dare suggest from a non-technical POV that the "htdig" (that's the name, isn't it?) experiment has failed. If we can only update our search index every 6 months or so, it is pointless to have it.
Instead, I suggest that http://lists.wikimedia.org/robots.txt be modified as to allow Google (and other search engines) to crawl /pipermail/ again. I do not really see the privacy issues of this, nabble, gmane etc. are google-searchable as well and I really don't see the point in barring Google from our own archive.
If I am very honest, I do not even remember anymore, why we decided to bar Google from http://lists.wikimedia.org/pipermail. Was it due to privacy concerns? If so, which, and why is lists.wikimedia.orgas an archive different from Nabble/Gmane?
Thanks, Michael
On 27.04.2008 19:30:19, Michael Bimmler wrote:
Instead, I suggest that http://lists.wikimedia.org/robots.txt be modified as to allow Google (and other search engines) to crawl /pipermail/ again. I do not really see the privacy issues of this, nabble, gmane etc. are google-searchable as well and I really don't see the point in barring Google from our own archive.
Then our mailinglists should already be accessible through Google via 'nabble, gmane etc.', no?
If I am very honest, I do not even remember anymore, why we decided to bar Google from http://lists.wikimedia.org/pipermail. Was it due to privacy concerns? If so, which, and why is lists.wikimedia.orgas an archive different from Nabble/Gmane?
I remember some cases where people got in really serious trouble due to their discussions on a mailing list, for example when the employer googled their name and found obscenities. It just caused much work for admins to remove single mails from the archives.
Leon
Hoi, When it is clear to people that obscenities can be found and are found, it may help people not to utter obscenities. Many people argue that the Foundation-l is irrelevant but given that there are few other ways to reach the people that are interested in things to do with the foundation itself, they are imho wrong. Many people argue that one of the reasons why is excessive output to the list and the quality of the discourse. I could argue that allowing Google to spider our mail archive might help somewhat. Thanks, GerardM
On Sun, Apr 27, 2008 at 7:44 PM, Leon Weber leon@leonweber.de wrote:
On 27.04.2008 19:30:19, Michael Bimmler wrote:
Instead, I suggest that http://lists.wikimedia.org/robots.txt be
modified as
to allow Google (and other search engines) to crawl /pipermail/ again. I
do
not really see the privacy issues of this, nabble, gmane etc. are google-searchable as well and I really don't see the point in barring
from our own archive.
Then our mailinglists should already be accessible through Google via 'nabble, gmane etc.', no?
If I am very honest, I do not even remember anymore, why we decided to
bar
Google from http://lists.wikimedia.org/pipermail. Was it due to privacy concerns? If so, which, and why is lists.wikimedia.orgas an archive different from Nabble/Gmane?
I remember some cases where people got in really serious trouble due to their discussions on a mailing list, for example when the employer googled their name and found obscenities. It just caused much work for admins to remove single mails from the archives.
Leon
-- Leon Weber, leon@leonweber.de 0x8E04D7FC blog: https://leonweber.de/blog jabber: leon@jabber.ccc.de (icq: 261067046) -- Geizige Menschen sind unangenehme Zeitgenossen - aber angenehme Vorfahren!
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
"Michael Bimmler" mbimmler@gmail.com wrote:
[...] Instead, I suggest that http://lists.wikimedia.org/robots.txt be modified as to allow Google (and other search engines) to crawl /pipermail/ again. I do not really see the privacy issues of this, nabble, gmane etc. are google-searchable as well and I really don't see the point in barring Google from our own archive. [...]
On a related topic: Unfortunately, in most (all?) Wikimedia lists at gmane at the moment the email addresses are en- crypted which seems odd as they can be downloaded at piper- mail. So I'd suggest to remove that restriction as well.
Tim
Michael Bimmler wrote:
To put it bluntly, I dare suggest from a non-technical POV that the "htdig" (that's the name, isn't it?) experiment has failed. If we can only update our search index every 6 months or so, it is pointless to have it.
Yeah, it doesn't work as well as advertised.
Instead, I suggest that http://lists.wikimedia.org/robots.txt be modified as to allow Google (and other search engines) to crawl /pipermail/ again. I do not really see the privacy issues of this, nabble, gmane etc. are google-searchable as well and I really don't see the point in barring Google from our own archive.
For the meantime, I'm going to have to recommend not doing this (see my notes below for why).
As you note, it's already possible to search via third-party archives. It would probably not be difficult to replace the broken htdig search form with a link to a nice offsite archive, though.
If I am very honest, I do not even remember anymore, why we decided to bar Google from http://lists.wikimedia.org/pipermail.
Because:
a) The current mailman/pipermail system makes it a *huge* pain in the butt to remove mails from archives on request
b) I got tired of the volume of requests to remove mails from archives, with the consequent time required in handling them
c) With the wildly popular wikimedia.org domain out of the running, third-party list archives aren't as visible in general search engine results
d) Therefore, the volume of requests go down
e) and I don't feel bad turning down most of the remaining requests.
If and when mailman's archiving system is fixed up to make it quick & easy to take a mail out of archives (eg, *not* involving shutting down all mail processing, rebuilding an entire list's archives since 2001, and discovering that all the links are now broken because mailman's internal behavior has changed in the intervening years and it splits up messages differently), then I'll be happy to pop us back into general search engine indexes.
Was it due to privacy concerns? If so, which, and why is lists.wikimedia.orgas an archive different from Nabble/Gmane?
That'd be c) above.
-- brion vibber (brion @ wikimedia.org)
On Mon, Apr 28, 2008 at 1:55 AM, Brion Vibber brion@wikimedia.org wrote:
For the meantime, I'm going to have to recommend not doing this (see my notes below for why).
Okay, point taken. So, for the meantime, I'll wait for some bored developer to a) fix mailman and introduce a nice 'remove this message' feature b) develop a working full text search for pipermail archives of which the index can be rebuilt a tad more often.
Anyway, the point about "replace the broken htdig search form with a link to a nice offsite archive," would probably be a good idea if this can be done as a short time measure ;-)
Thanks for the explanations! Michael
On 28.04.2008, 14:47 Michael wrote:
Okay, point taken. So, for the meantime, I'll wait for some bored developer to a) fix mailman and introduce a nice 'remove this message' feature b) develop a working full text search for pipermail archives of which the index can be rebuilt a tad more often.
Anyway, the point about "replace the broken htdig search form with a link to a nice offsite archive," would probably be a good idea if this can be done as a short time measure ;-)
Thanks for the explanations! Michael
There definitely must be another open-source mailing list manager, written after the stone age. (Kicks the dead horse: you wouldn't find this mail in the archives, for some reason - mailman ignores me :( ).
2008/4/28 Max Semenik maxsem.wiki@gmail.com:
There definitely must be another open-source mailing list manager, written after the stone age. (Kicks the dead horse: you wouldn't find this mail in the archives, for some reason - mailman ignores me :( ).
Just thank your lucky stars we're not using majordomo. MailMan is *heavenly* after that.
- d.
On Mon, Apr 28, 2008 at 04:36:21PM +0400, Max Semenik wrote:
There definitely must be another open-source mailing list manager, written after the stone age. (Kicks the dead horse: you wouldn't find this mail in the archives, for some reason - mailman ignores me :( ).
Mailman is quite nice, actually. It's *pipermail* that sucks..
I gather it merely doesn't suck *enough* to get someone motivated to replace it.
Cheers, -- jra
On 5/8/08, Jay R. Ashworth jra@baylink.com wrote:
On Mon, Apr 28, 2008 at 04:36:21PM +0400, Max Semenik wrote:
There definitely must be another open-source mailing list manager, written after the stone age. (Kicks the dead horse: you wouldn't find this mail in the archives, for some reason - mailman ignores me :( ).
Mailman is quite nice, actually. It's *pipermail* that sucks..
I gather it merely doesn't suck *enough* to get someone motivated to replace it.
Cheers,
-- jra
Mailman has problems, too. Yes, I can believe that pipermail fails to publish my emails although they are properly saved to the database, but mailman also fails to send me back a copy of my messages despite checked "Receive your own posts to the list" in preferences. (Sending this from Gmail web interface instead of The Bat! to check if my problems are related to it).
Max Semenik wrote:
Mailman has problems, too. Yes, I can believe that pipermail fails to publish my emails although they are properly saved to the database, but mailman also fails to send me back a copy of my messages despite checked "Receive your own posts to the list" in preferences. (Sending this from Gmail web interface instead of The Bat! to check if my problems are related to it).
I'm afraid you're blaming the wrong victim here; I just checked the mail server logs, and they show that the corresponding copy of your message was delivered to Google mail servers at 17:35:21 UTC, roughly two minutes after your message had been received by Mailman.
2008/5/10 Mark Bergsma mark@wikimedia.org:
Max Semenik wrote:
Mailman has problems, too. Yes, I can believe that pipermail fails to publish my emails although they are properly saved to the database, but mailman also fails to send me back a copy of my messages despite checked "Receive your own posts to the list" in preferences. (Sending this from Gmail web interface instead of The Bat! to check if my problems are related to it).
I'm afraid you're blaming the wrong victim here; I just checked the mail server logs, and they show that the corresponding copy of your message was delivered to Google mail servers at 17:35:21 UTC, roughly two minutes after your message had been received by Mailman.
Yes. This is caused by a Google misfeature: Gmail WILL NOT show you copies of messages you sent, even if you want it to, and this cannot be turned off.
http://mail.google.com/support/bin/answer.py?answer=6588
I went there, ticked "not helpful" and when asked why said "it doesn't tell me how to switch this Gmail misfeature off." You may care to do something similar.
- d.
On Thu, May 8, 2008 at 9:38 PM, Jay R. Ashworth jra@baylink.com wrote:
Mailman is quite nice, actually. It's *pipermail* that sucks..
Oh, I'd have a wishlist for mailman, too. Something like an automated and sophisticated log of administrative actions, and user accounts with "list admin" flags, so that the software can actually recognise *who* changed option X, who moderated user Y and who forcefully desubscribed user Z.
Moreover, a "reason for moderation" field in the subscribers table would be nice, too. It's a bit of a nuisance if you have to remodel all this off-mailman (especially if you're used to MediaWiki-style logging of actions).
Michael
My general view on this is that if somebody sends an e-mail, that is their problem. If they don't want their employer to find it, they should have thought of that before they sent it.
Mark
2008/4/27 Brion Vibber brion@wikimedia.org:
Michael Bimmler wrote:
To put it bluntly, I dare suggest from a non-technical POV that the "htdig" (that's the name, isn't it?) experiment has failed. If we can only update our search index every 6 months or so, it is pointless to have it.
Yeah, it doesn't work as well as advertised.
Instead, I suggest that http://lists.wikimedia.org/robots.txt be modified as to allow Google (and other search engines) to crawl /pipermail/ again. I do not really see the privacy issues of this, nabble, gmane etc. are google-searchable as well and I really don't see the point in barring Google from our own archive.
For the meantime, I'm going to have to recommend not doing this (see my notes below for why).
As you note, it's already possible to search via third-party archives. It would probably not be difficult to replace the broken htdig search form with a link to a nice offsite archive, though.
If I am very honest, I do not even remember anymore, why we decided to bar Google from http://lists.wikimedia.org/pipermail.
Because:
a) The current mailman/pipermail system makes it a *huge* pain in the butt to remove mails from archives on request
b) I got tired of the volume of requests to remove mails from archives, with the consequent time required in handling them
c) With the wildly popular wikimedia.org domain out of the running, third-party list archives aren't as visible in general search engine results
d) Therefore, the volume of requests go down
e) and I don't feel bad turning down most of the remaining requests.
If and when mailman's archiving system is fixed up to make it quick & easy to take a mail out of archives (eg, *not* involving shutting down all mail processing, rebuilding an entire list's archives since 2001, and discovering that all the links are now broken because mailman's internal behavior has changed in the intervening years and it splits up messages differently), then I'll be happy to pop us back into general search engine indexes.
Was it due to privacy concerns? If so, which, and why is lists.wikimedia.orgas an archive different from Nabble/Gmane?
That'd be c) above.
-- brion vibber (brion @ wikimedia.org)
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
2008/4/28 Mark Williamson node.ue@gmail.com:
My general view on this is that if somebody sends an e-mail, that is their problem. If they don't want their employer to find it, they should have thought of that before they sent it.
People occasionally slip up horribly. But almost all requests for message removal I've *ever* seen are specious.
As for the search - we have lots of experience with that rather nice Lucene search thing. Can anything usable be done with that on mail.wikimedia.org without crippling the box?
- d.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
David Gerard wrote:
2008/4/28 Mark Williamson node.ue@gmail.com:
My general view on this is that if somebody sends an e-mail, that is their problem. If they don't want their employer to find it, they should have thought of that before they sent it.
People occasionally slip up horribly. But almost all requests for message removal I've *ever* seen are specious.
Most requests go like this:
"I sent a mail to the list (accidentally instead of offlist / forgetting about my default work email signature) and it includes my (real name / private phone number / address) which I don't want on the internet, can you please remove it from the archives so it's not the top Google hit forever?"
We may scoff and roll our eyes and say "If you don't want it on Google, don't post it on a public list!" but the point is they *didn't* intend to publish it. Refusing to remove them on principle violates the "don't be a dick" rule, from which all other ethical principles can be logically derived.
So, I'm stuck at refusing to remove them most of the time because it's a very disruptive operation, and doing annoying things to the search to reduce the "Google signature" of the mails still floating in the archive.
As for the search - we have lots of experience with that rather nice Lucene search thing. Can anything usable be done with that on mail.wikimedia.org without crippling the box?
Hypothetically, if someone develops it (or can find an existing patch to mailman).
- -- brion
Well, I think there's a point where it goes from being a dick to just not bending over backwards to comply with unreasonable requests. I think this is a case of the latter.
Mark
2008/4/28 Brion Vibber brion@wikimedia.org:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
David Gerard wrote:
2008/4/28 Mark Williamson node.ue@gmail.com:
My general view on this is that if somebody sends an e-mail, that is their problem. If they don't want their employer to find it, they should have thought of that before they sent it.
People occasionally slip up horribly. But almost all requests for message removal I've *ever* seen are specious.
Most requests go like this:
"I sent a mail to the list (accidentally instead of offlist / forgetting about my default work email signature) and it includes my (real name / private phone number / address) which I don't want on the internet, can you please remove it from the archives so it's not the top Google hit forever?"
We may scoff and roll our eyes and say "If you don't want it on Google, don't post it on a public list!" but the point is they *didn't* intend to publish it. Refusing to remove them on principle violates the "don't be a dick" rule, from which all other ethical principles can be logically derived.
So, I'm stuck at refusing to remove them most of the time because it's a very disruptive operation, and doing annoying things to the search to reduce the "Google signature" of the mails still floating in the archive.
As for the search - we have lots of experience with that rather nice Lucene search thing. Can anything usable be done with that on mail.wikimedia.org without crippling the box?
Hypothetically, if someone develops it (or can find an existing patch to mailman).
- -- brion
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkgWHn0ACgkQwRnhpk1wk47QMwCeKkU3Q671QNJ1GQPPuuJhleKq xcIAoI4LCBrh0b+QdlBDUuHUQDEKtuJF =/0p5 -----END PGP SIGNATURE-----
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
"Mark Williamson" node.ue@gmail.com wrote in message news:849f98ed0804281327h3c1f2e07w612a3c37ce32024c@mail.gmail.com...
Well, I think there's a point where it goes from being a dick to just not bending over backwards to comply with unreasonable requests. I think this is a case of the latter.
Mark
2008/4/28 Brion Vibber brion@wikimedia.org:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
David Gerard wrote:
2008/4/28 Mark Williamson
My general view on this is that if somebody sends an e-mail, that is their problem. If they don't want their employer to find it, they should have thought of that before they sent it.
People occasionally slip up horribly. But almost all requests for message removal I've *ever* seen are specious.
Most requests go like this:
"I sent a mail to the list (accidentally instead of offlist /
forgetting
about my default work email signature) and it includes my (real name / private phone number / address) which I don't want on the internet, can you please remove it from the archives so it's not the top Google hit forever?"
We may scoff and roll our eyes and say "If you don't want it on Google, don't post it on a public list!" but the point is they *didn't* intend to publish it. Refusing to remove them on principle violates the "don't be a dick" rule, from which all other ethical principles can be logically derived.
So, I'm stuck at refusing to remove them most of the time because it's
a
very disruptive operation, and doing annoying things to the search to reduce the "Google signature" of the mails still floating in the
archive.
"Hello - you're through to Royal Mail, Elizabeth speaking. How may I help?" "I've just posted a letter, can I have it back please?" "No. I'm sorry, that's not possible." "I include something I shouldn't have done - and it's really important that the recipient doesn't see it!" "I'm sorry, you should have thought of that before you posted it." "I didn't realise! You see, the thing is I wrote it on the back of some scrap paper and I've just realised that there was some highly confidential information on the other side. It's vital that it doesn't get delivered." "Oh well, if it was just a mistake then that's a different story. Of course we'll be happy to drop everything else we're doing and look through the 80 MILLION items of mail that we deliver every single day in order to retrieve your letter. It will take a week or so, but I'm sure it doesn't matter if it inconveniences EVERYONE else."
Hmmm... I think not... :-)
- Mark Clements (HappyDog)
Mark Clements wrote:
"Mark Williamson" node.ue@gmail.com wrote in message news:849f98ed0804281327h3c1f2e07w612a3c37ce32024c@mail.gmail.com...
Well, I think there's a point where it goes from being a dick to just not bending over backwards to comply with unreasonable requests. I think this is a case of the latter.
Mark
2008/4/28 Brion Vibber brion@wikimedia.org:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
David Gerard wrote:
2008/4/28 Mark Williamson
My general view on this is that if somebody sends an e-mail, that is their problem. If they don't want their employer to find it, they should have thought of that before they sent it.
People occasionally slip up horribly. But almost all requests for message removal I've *ever* seen are specious.
Most requests go like this:
"I sent a mail to the list (accidentally instead of offlist /
forgetting
about my default work email signature) and it includes my (real name / private phone number / address) which I don't want on the internet, can you please remove it from the archives so it's not the top Google hit forever?"
We may scoff and roll our eyes and say "If you don't want it on Google, don't post it on a public list!" but the point is they *didn't* intend to publish it. Refusing to remove them on principle violates the "don't be a dick" rule, from which all other ethical principles can be logically derived.
So, I'm stuck at refusing to remove them most of the time because it's
a
very disruptive operation, and doing annoying things to the search to reduce the "Google signature" of the mails still floating in the
archive.
"Hello - you're through to Royal Mail, Elizabeth speaking. How may I help?" "I've just posted a letter, can I have it back please?" "No. I'm sorry, that's not possible." "I include something I shouldn't have done - and it's really important that the recipient doesn't see it!" "I'm sorry, you should have thought of that before you posted it." "I didn't realise! You see, the thing is I wrote it on the back of some scrap paper and I've just realised that there was some highly confidential information on the other side. It's vital that it doesn't get delivered." "Oh well, if it was just a mistake then that's a different story. Of course we'll be happy to drop everything else we're doing and look through the 80 MILLION items of mail that we deliver every single day in order to retrieve your letter. It will take a week or so, but I'm sure it doesn't matter if it inconveniences EVERYONE else."
Hmmm... I think not... :-)
- Mark Clements (HappyDog)
My husband, who work in a lab of about 100 people, still has a very fond memory.
It was early afternoon when one of the last year PhD student, a friend of my husband, sent a desperate love letter to one of the permanent researcher, who was leaving the lab for 6 months or so. She had no idea he was in love with her. Nor did anyone actually.
With a click, the email left for the general mailing list. The student realized immediately the error and called the IT admin on the spot.
The IT admin succeeded to clean the mailboxes of about 1/3 of the list who had their computer shut off (no automatic delivery), but about 60 people read the message within the next 5 mn. Several of them rushed to the IT admin to inform him of the mess. But too late.
I read the message. It was fabulously romantic and quite hot. The lady fortunately had left the lab, but the student spent the next 6 months hiding himself in corners.
(lucky enough that he did not earn a lawsuit for sexual harassment on top)
Ant
Anthere wrote:
With a click, the email left for the general mailing list. The student realized immediately the error and called the IT admin on the spot. The IT admin succeeded to clean the mailboxes of about 1/3 of the list...
Wow. That admin was evidently the very antithesis of the [[BOFH]]...
On Mon, Apr 28, 2008 at 2:59 PM, Brion Vibber brion@wikimedia.org wrote:
So, I'm stuck at refusing to remove them most of the time because it's a very disruptive operation, and doing annoying things to the search to reduce the "Google signature" of the mails still floating in the archive.
About how many emails *have* you removed? I've come across at least half a dozen of them, and I haven't even been looking for them. It seems like it can't be all that hard to do.
wikitech-l@lists.wikimedia.org