Hi,
We got the email validation stuff sorted out properly tonight. We even have javascript tests (thanks Krinkle)!
Revisions got reviewed by Brion and bugs 959 & 22449 are now fixed.
I opened bug https://bugzilla.wikimedia.org/26910 as a merge request for Roan.
Thanks everyone!
Out of interest, do you know what percentage of emails in the database don't validate under the new scheme?
Conrad
On 24 January 2011 13:55, Ashar Voultoiz hashar+wmf@free.fr wrote:
Hi,
We got the email validation stuff sorted out properly tonight. We even have javascript tests (thanks Krinkle)!
Revisions got reviewed by Brion and bugs 959 & 22449 are now fixed.
I opened bug https://bugzilla.wikimedia.org/26910 as a merge request for Roan.
Thanks everyone!
-- Ashar Voultoiz
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Mon, Jan 24, 2011 at 2:08 PM, Conrad Irwin conrad.irwin@gmail.comwrote:
Out of interest, do you know what percentage of emails in the database don't validate under the new scheme?
That's actually a wise thing to check -- most fails will probably be legitimately bogus entries, but if we can find any that don't validate but *do* work (eg they've been confirmed as functional) that's info we need to report upstream as well -- the new code is using the specs for HTML 5's client-side form validation, which is starting to go into the latest generation of browsers.
In theory the validation rules should be pretty liberal, and you should need to do something very esoteric to not pass. (The old validation regexes from ~2004-2005 got kicked out for failing to deal with things like '+' which turned out to be more common than we thought.)
Folks actually already pushed a fix upstream to the whatwg spec page to allow single-part domains like 'localhost', needed for local-network testing and perhaps some weird intranet setups.
-- brion
It would seem that the bugzilla https://bugzilla.wikimedia.org/show_bug.cgi?id=23710 would fall under that category, and to note that it is still marked as new. Can it be tied to this process?
Regards, Andrew
Quoting Brion Vibber brion@pobox.com:
On Mon, Jan 24, 2011 at 2:08 PM, Conrad Irwin conrad.irwin@gmail.comwrote:
Out of interest, do you know what percentage of emails in the database don't validate under the new scheme?
That's actually a wise thing to check -- most fails will probably be legitimately bogus entries, but if we can find any that don't validate but *do* work (eg they've been confirmed as functional) that's info we need to report upstream as well -- the new code is using the specs for HTML 5's client-side form validation, which is starting to go into the latest generation of browsers.
In theory the validation rules should be pretty liberal, and you should need to do something very esoteric to not pass. (The old validation regexes from ~2004-2005 got kicked out for failing to deal with things like '+' which turned out to be more common than we thought.)
Folks actually already pushed a fix upstream to the whatwg spec page to allow single-part domains like 'localhost', needed for local-network testing and perhaps some weird intranet setups.
-- brion _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
---------------------------------------------------------------- This message was sent using iSage/AuNix webmail http://www.isage.net.au/
On Mon, Jan 24, 2011 at 3:50 PM, Billinghurst billinghurst@gmail.comwrote:
It would seem that the bugzilla https://bugzilla.wikimedia.org/show_bug.cgi?id=23710 would fall under that category, and to note that it is still marked as new. Can it be tied to this process?
That's an issue about clickable links in the body of outgoing mails generated by the system, and is not related to the format or validation of email addresses.
It should be addressed (either by ensuring that links inserted into email are escaped clearly, or that they're arranged nicely in brackets that email clients commonly understand as delimiters, or by supplementing the plaintext emails with HTML emails that can mark their links explicitly) but is an entirely separate issue.
-- brion
Brion Vibber wrote:
On Mon, Jan 24, 2011 at 2:08 PM, Conrad Irwin conrad.irwin@gmail.comwrote:
Out of interest, do you know what percentage of emails in the database don't validate under the new scheme?
That's actually a wise thing to check -- most fails will probably be legitimately bogus entries, but if we can find any that don't validate but *do* work (eg they've been confirmed as functional) that's info we need to report upstream as well -- the new code is using the specs for HTML 5's client-side form validation, which is starting to go into the latest generation of browsers.
In theory the validation rules should be pretty liberal, and you should need to do something very esoteric to not pass. (The old validation regexes from ~2004-2005 got kicked out for failing to deal with things like '+' which turned out to be more common than we thought.)
Folks actually already pushed a fix upstream to the whatwg spec page to allow single-part domains like 'localhost', needed for local-network testing and perhaps some weird intranet setups.
-- brion
The original spec had feedback based precisely on enwiki numbers. http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2009-August/022220.html
So about 100? Note that there are invalid addresses marked as confirmed in wikipedia.
On Mon, Jan 24, 2011 at 4:02 PM, Platonides Platonides@gmail.com wrote:
The original spec had feedback based precisely on enwiki numbers. http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2009-August/022220.html
So about 100? Note that there are invalid addresses marked as confirmed in wikipedia.
Ok so from the breakdown at http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2009-August/022237.htmlw... 202 email address records that were marked as confirmed, but failed the proposed validation check at the time and couldn't be corrected by stripping whitespace:
The breakdown of the 202 is as follows.
Reordered into:
Now allowed by the current revision of the HTML 5 spec as implemented in User::isValidEmailAddr:
Single trailing dot in local part: 40 (prohibited by RFC but plausibly
deliverable)
Multiple consecutive dots: 20 (prohibited by RFC but plausibly
deliverable)
Easily correctable by the user removing the extra bits upon being prompted, as doing so would not change the actual delivery:
- Single trailing dot in domain part: 100 (prohibited by RFC but plausibly
deliverable)
Valid address in angle brackets (with other junk around it): 21 (permitted by RFC, kind of, and plausibly deliverable)
- Comment: 3 (permitted by RFC and plausibly deliverable)
v---- LINE OF DOOM ---v
Clearly wrong in typical context, should indeed be rejected (or changed to @localhost for legit cases):
- No @: 9 (unlikely to be deliverable)
Not quite sure what's going on but most look like stray chars that would be ignored or else invalid and possibly bogusly marked as confirmed:
- Miscellaneous: 9 (one containing [NO]@[SPAM], two with trailing >,
one in "quotes", one with single leading dot in local part, two with single leading comma in local part, one with leading ": ", one with leading "")
So from the August 2009 survey on English Wikipedia, that leaves 18 email addresses out of over 3 million listed as confirmed, of which a few *might* be deliverable addresses that could not be fixed by the user tweaking them during input (ie, they actually rely on those extra chars being there in order to be delivered to the right person).
To me it sounds like we're pretty good with this; it wouldn't hurt to make sure that existing addresses that are stored funny (eg with extra whitespace or trailing dots on the domain name) continue to work as long as they've been previously.
Also wouldn't hurt to do a current survey, and to include some other language sites.
Of interest -- gmail's validation rules were also posted in that thread: http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2009-August/022268.html
-- brion
Brion Vibber (2011-01-25 02:51):
On Mon, Jan 24, 2011 at 4:02 PM, PlatonidesPlatonides@gmail.com wrote:
The original spec had feedback based precisely on enwiki numbers. http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2009-August/022220.html
So about 100? Note that there are invalid addresses marked as confirmed in wikipedia.
Ok so from the breakdown at http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2009-August/022237.html with 202 email address records that were marked as confirmed, but failed the proposed validation check at the time and couldn't be corrected by stripping whitespace: [...]
Could you check for validated address containing commas in user names part? The RegExp from mediawiki.util.js did/does allow them.
Regards, Nux.
On 25/01/11 23:37, Maciej Jaros wrote: <snip>
Could you check for validated address containing commas in user names part? The RegExp from mediawiki.util.js did/does allow them.
Regards, Nux.
Nux opened bug 26948 for the comma issue (assigned myself).
https://bugzilla.wikimedia.org/26948
On Mon, Jan 24, 2011 at 8:51 PM, Brion Vibber brion@pobox.com wrote:
So from the August 2009 survey on English Wikipedia, that leaves 18 email addresses out of over 3 million listed as confirmed, of which a few *might* be deliverable addresses that could not be fixed by the user tweaking them during input (ie, they actually rely on those extra chars being there in order to be delivered to the right person).
But note that I counted only the worsening of false negatives, not the improvement in false positives -- I only looked at confirmed addresses. It seems likely that a vastly larger number of undeliverable addresses would be rejected at an early stage with the more restrictive check, allowing users to correct them before submitting so that the e-mail doesn't get lost in the ether. So it seems like a clear improvement overall.
However, the code should probably at least strip whitespace (including internal whitespace, not just trailing/leading) onkeypress/oninput/onsubmit or such. This came up in something like 0.1% of the sample. Non-techy users would probably get pretty confused by a trailing space messing the whole thing up.
I suggest that for Firefox 4, we shortcut the JavaScript logic and just use type=email. It's pretty nicely designed. Here's a test case (this is a URL, just paste all the lines in Firefox 4's URL bar):
data:text/html,<form><input type=email name=email placeholder=E-mail> <br><input placeholder="Some other input"><br><input type=submit></form>
Firefox 4b9 seems not to implement the latest spec update, so foo@localhost doesn't work, but that should be an easy fix. If someone is willing to poke at their source code, I imagine it'd be possible to get the updated check in the final release. Unfortunately, WebKit's form constraint validation implementation is broken all over the place, and Opera's is pretty ugly (like only checking validity onsubmit), so we should probably blacklist them if we use any form validation features.
On Wed, Jan 26, 2011 at 3:09 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
Firefox 4b9 seems not to implement the latest spec update, so foo@localhost doesn't work, but that should be an easy fix. If someone is willing to poke at their source code, I imagine it'd be possible to get the updated check in the final release.
Was already done, actually:
* Aryeh Gregor Simetrical+wikilist@gmail.com [Wed, 26 Jan 2011 15:09:17 -0500]:
However, the code should probably at least strip whitespace (including internal whitespace, not just trailing/leading) onkeypress/oninput/onsubmit or such. This came up in something like 0.1% of the sample. Non-techy users would probably get pretty confused by a trailing space messing the whole thing up.
Surely it should. In a very similar manner, I've had a trouble with local MediaWiki installation (old 1.14, haven't checked with newer ones), when I've created user accounts and sent these via the email, people were unable to login, because when you select a text line using a mouse, Thunderbird mail sometimes copies line feed character into clipboard, so it was pasted into the password field then and the password didn't match. Users were frustrated. I've explained them that line feed is being placed into the clipboard which is visible when you paste it into the text editor. I am unsure which browser they have been used, maybe some browsers strip 13 / 10 from text inputs, maybe don't. Dmitriy
On Thu, Jan 27, 2011 at 1:58 AM, Dmitriy Sintsov questpc@rambler.ru wrote:
Surely it should. In a very similar manner, I've had a trouble with local MediaWiki installation (old 1.14, haven't checked with newer ones), when I've created user accounts and sent these via the email, people were unable to login, because when you select a text line using a mouse, Thunderbird mail sometimes copies line feed character into clipboard, so it was pasted into the password field then and the password didn't match. Users were frustrated. I've explained them that line feed is being placed into the clipboard which is visible when you paste it into the text editor. I am unsure which browser they have been used, maybe some browsers strip 13 / 10 from text inputs, maybe don't.
HTML5 specifies that they should, for passwords:
"User agents must not allow users to insert U+000A LINE FEED (LF) or U+000D CARRIAGE RETURN (CR) characters into the value." http://www.whatwg.org/specs/web-apps/current-work/multipage/states-of-the-ty...
The value sanitization algorithm also makes sure this holds for default values and script-inserted values.
* Aryeh Gregor Simetrical+wikilist@gmail.com [Thu, 27 Jan 2011 14:27:21 -0500]:
HTML5 specifies that they should, for passwords:
"User agents must not allow users to insert U+000A LINE FEED (LF) or U+000D CARRIAGE RETURN (CR) characters into the value."
http://www.whatwg.org/specs/web-apps/current-work/multipage/states-of-the-ty...
The value sanitization algorithm also makes sure this holds for default values and script-inserted values.
Oops.. My mistake - it seems that Thunderbird mail appends extra space character (32) to the end of selection in the clipboard instead (when the password is located in separated text line and one selects the complete line using mouse), not CR / LF. However, as the password field input value is hidden, users cannot realize why he / she cannot login when copying / pasting the password from TB mail. It would be more user-friendly in case trim() was used. Dmitriy
wikitech-l@lists.wikimedia.org