Please give a quick look to the images for the languages you know: https://www.dropbox.com/sh/i2af7xvn4y593gc/-RRtFyoJji/captchas
If you spot something we're not aware of, please comment: https://bugzilla.wikimedia.org/show_bug.cgi?id=5309#c32
Nemo
For Finnish about half of the captchas are about as meaningful as random sequence of letters given it has chosen low frequency words and even lower frequency inflected forms.
I would recommend to stick to dictionary forms or choose high frequency words from a corpus, or both.
-Niklas
Serbian (sr) in some cases uses a combination of Cyrillic and Latin scripts, which is a bit awkward. CAPTCHA should be in only one script, preferably Latin, as some users of projects in Serbian don't have Cyrillic readily set on their computers.
Cheers, Filip
On Sun, Mar 30, 2014 at 11:40 AM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
Please give a quick look to the images for the languages you know: https://www.dropbox.com/sh/i2af7xvn4y593gc/-RRtFyoJji/captchas
If you spot something we're not aware of, please comment: https://bugzilla.wikimedia.org/show_bug.cgi?id=5309#c32
Nemo
Mediawiki-i18n mailing list Mediawiki-i18n@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-i18n
The same is true for Croatian - they should only use the Latin script.
Best regards, Bence
On Sun, Mar 30, 2014 at 1:09 PM, Filip Maljković dungodung@gmail.comwrote:
Serbian (sr) in some cases uses a combination of Cyrillic and Latin scripts, which is a bit awkward. CAPTCHA should be in only one script, preferably Latin, as some users of projects in Serbian don't have Cyrillic readily set on their computers.
Cheers, Filip
On Sun, Mar 30, 2014 at 11:40 AM, Federico Leva (Nemo) <nemowiki@gmail.com
wrote:
Please give a quick look to the images for the languages you know: https://www.dropbox.com/sh/i2af7xvn4y593gc/-RRtFyoJji/captchas
If you spot something we're not aware of, please comment: https://bugzilla.wikimedia.org/show_bug.cgi?id=5309#c32
Nemo
Mediawiki-i18n mailing list Mediawiki-i18n@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-i18n
Mediawiki-i18n mailing list Mediawiki-i18n@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-i18n
Yes, Wiktionary is not perfect. People unhappy with it are encouraged to edit. :) Alternatively, if someone produces suitable word lists in 150+ languages from higher quality sources, I'll be happy to use them.
Filip Maljković, 30/03/2014 13:09:
Serbian (sr) in some cases uses a combination of Cyrillic and Latin scripts, which is a bit awkward.
Indeed, the situation of Serbian, Croatian and Serbo-Croatian on Wiktionary is super-confusing. If it's too hard to get consistent output we'll just skip them.
CAPTCHA should be in only one script
True. This is easy to do in PHP with the ICU libraries but I've yet to find a python interface (as with https://github.com/mitsuhiko/babel/issues/89), anyway we'll add a sanitisation of the word lists in a way or another.
Nemo
In this style, many of Malayalam captchas are too difficult to read, infact only some images are readable ( image_deb406cc_f8419aa5c2a1d891.png, image_7a8d523a_6b8546bbc5dc3608.png, image_d3e539c0_f856b4c90b2ceeeb.png etc are some of the difficult ones). The image - image_5dbc3fc3_0e1a119f02c122b8.png - (ാനിതംബം) using a vowel sign in the beginning of the captcha, which is not common and almost impossible to type in most transliteration keyboards. The word നിതംബം after vowel sign ാ exactly means 'buttocks'. BTW, the word ലിംഗം repeated couple of times in images, which means 'sexual organ'.
In the image - image_b5d2be0d_7223dc2282b35e15.png -, readable part is problematic (last letter is not recognizable). In the readable part, letters appear as ച െവി where two different vowel signs are applied on same letter (typing probably not possible). This may be rendering error, which is very common in many rendering engines. Actual word may be (ചെവി). Same kind of problem happens in image_077ebd23_d890a7083e967d92.png where two vowel symbols appears together. It appears vowel sign ാ used independently to create these captchas which should be avoided.
https://en.wikipedia.org/wiki/Malayalam_alphabet
Praveen
On Sunday 30 March 2014 06:00:53 PM IST, Federico Leva (Nemo) wrote:
Yes, Wiktionary is not perfect. People unhappy with it are encouraged to edit. :) Alternatively, if someone produces suitable word lists in 150+ languages from higher quality sources, I'll be happy to use them.
Filip Maljković, 30/03/2014 13:09:
Serbian (sr) in some cases uses a combination of Cyrillic and Latin scripts, which is a bit awkward.
Indeed, the situation of Serbian, Croatian and Serbo-Croatian on Wiktionary is super-confusing. If it's too hard to get consistent output we'll just skip them.
CAPTCHA should be in only one script
True. This is easy to do in PHP with the ICU libraries but I've yet to find a python interface (as with https://github.com/mitsuhiko/babel/issues/89), anyway we'll add a sanitisation of the word lists in a way or another.
Nemo
Mediawiki-i18n mailing list Mediawiki-i18n@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-i18n
As said, if you find https://en.wiktionary.org/?oldid=23646739 or others offensive, please just edit and add {{context|vulgar}} or |obscene or whatever appropriate.
praveenp, 30/03/2014 16:11:
It appears vowel sign ാ used independently to create these captchas which should be avoided.
Yes, all the problems you mention seem just to be consequences of this. The entry in question is https://en.wiktionary.org/wiki/%E0%B4%BE Is there some generalisable learning here? Exclude letters? (Wiktionary experts should tell us if they're all tagged as such.) Only use "words" of at least two unicode characters?
Nemo
Ukrainian seems to be OK from the first view. But where are these captchas to be used? Would it be connected to interface language of a user or to language of content of a wiki? I'm just wondering if it would not cause problems in wikis in languages that e.g. your mobile phone doesn't support. --Base
30.03.2014 18:10, Federico Leva (Nemo) написав(ла):
As said, if you find https://en.wiktionary.org/?oldid=23646739 or others offensive, please just edit and add {{context|vulgar}} or |obscene or whatever appropriate.
praveenp, 30/03/2014 16:11:
It appears vowel sign ാ used independently to create these captchas which should be avoided.
Yes, all the problems you mention seem just to be consequences of this. The entry in question is https://en.wiktionary.org/wiki/%E0%B4%BE Is there some generalisable learning here? Exclude letters? (Wiktionary experts should tell us if they're all tagged as such.) Only use "words" of at least two unicode characters?
Nemo
Mediawiki-i18n mailing list Mediawiki-i18n@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-i18n
On Sunday 30 March 2014 08:40 PM, Federico Leva (Nemo) wrote:
As said, if you find https://en.wiktionary.org/?oldid=23646739 or others offensive, please just edit and add {{context|vulgar}} or |obscene or whatever appropriate.
I'll try.
Yes, all the problems you mention seem just to be consequences of this. The entry in question is https://en.wiktionary.org/wiki/%E0%B4%BE
Could you check any rendering issues also? In this image - image_077ebd23_d890a7083e967d92.png - vowel sign appears after the letter as മുംബ ൈ (without space) correct one is മുംബൈ. Images image_00896685_3f5db13f53a2f352.png , image_35628971_fbfc5b67d488e883.png , image_b5d2be0d_7223dc2282b35e15.png etc.. also share similar problem.
Is there some generalisable learning here? Exclude letters? (Wiktionary experts should tell us if they're all tagged as such.) Only use "words" of at least two unicode characters?
Vowel signs should not start a captcha (or any of the words in captcha) and no two vowel signs should appear side by side.
Vowel signs for Malayalam: ാ, ി, ീ, ു, ൂ, ൃ, െ, േ, ൈ, ൊ, ോ, ൗ, ൌ Other signs (above same rule should be applied on these signs also) : ്, ം, ഃ
Vowel letters should not be in the middle of a word (or captcha) Vowel letters: അ, ആ, ഇ, ഈ, ഉ, ഊ, ഋ, ഌ, എ, ഏ, ഐ, ഒ, ഔ
(Possibly these rules are applicable to other Indic languages also because their vowel letters and vowel signs act very similar to Malayalam.)
If possible, do not include Malayalam chillu characters [1] in captcha (atleast for now) because they have two encodings possible since Unicode 5.1.0. Normalization enabled only in ml.wikis and bug to enable normalization in all wikimedia wikis still pending [2].
If possible, limit the Malayalam block to U+0D02 to U+0D57, because other characters (except chillu characters) are not popular and probably not even mapped in keyboards. In the limit itself U+0D3A, 0D3D and 0D4E should be avoided which are also facing similar uncertainty.
[1]: http://unicode.org/versions/Unicode5.1.0/#Malayalam_Chillu_Characters [2]: https://bugzilla.wikimedia.org/show_bug.cgi?id=45476
"Filip Maljković" dungodung@gmail.com writes:
Serbian (sr) in some cases uses a combination of Cyrillic and Latin scripts, which is a bit awkward. CAPTCHA should be in only one script, preferably Latin, as some users of projects in Serbian don't have Cyrillic readily set on their computers.
Knowing some Serbian minds :-) better make captchas having two lines of identcal text, one Latin, one Cyrillic, and accept either input.
Purodha
There is no Kannada (India) in the list
Regards, Pavanaja
-----Original Message----- From: mediawiki-i18n-bounces@lists.wikimedia.org [mailto:mediawiki-i18n-bounces@lists.wikimedia.org] On Behalf Of Federico Leva (Nemo) Sent: 30 March 2014 15:11 To: MediaWiki internationalisation Subject: [Mediawiki-i18n] Please view and comment CAPTCHA images in 154 languages
Please give a quick look to the images for the languages you know: https://www.dropbox.com/sh/i2af7xvn4y593gc/-RRtFyoJji/captchas
If you spot something we're not aware of, please comment: https://bugzilla.wikimedia.org/show_bug.cgi?id=5309#c32
Nemo
_______________________________________________ Mediawiki-i18n mailing list Mediawiki-i18n@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-i18n
On 30/03/2014 10:40, Federico Leva (Nemo) wrote:
Please give a quick look to the images for the languages you know: https://www.dropbox.com/sh/i2af7xvn4y593gc/-RRtFyoJji/captchas
If you spot something we're not aware of, please comment: https://bugzilla.wikimedia.org/show_bug.cgi?id=5309#c32
Nemo
Mediawiki-i18n mailing list Mediawiki-i18n@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-i18n
Re: GV captchas
I suppose it depends what you're going for. At present it looks reasonable, words combined together arbitrarily to type out. The one thing I would suggest is maybe excluding words with a cedilla (ç) as not everyone knows how to type this, and they may confuse similar words with c or c-cedilla, but it's not crucial.
If you actually want the captchas to make any sense in terms of word combination and construction, that would be a whole different issue. There's inflection, rules on what happens when words are run together (spelling changes for one), and so on.
You don't need to worry much about swearwords because it's deeply unlikely any online Manx source includes them, people didn't write that stuff down when dictionaries were being compiled. There may well be some awkward terms
*Specific problems:*
Quite a few of the l look like i in this font, which seems problematic.
There are also a number of captchas that resemble actual words, but aren't. I don't know whether this is the intention or not. I also don't know whether you care about using proper names or not.
I don't recognise the word "leight" from "leightfoalsey" and can't find it in the dictionary. (https://www.dropbox.com/sh/i2af7xvn4y593gc/sRpxnsZPXk/captchas/gv#lh:null-im...) Should this be "leigh"?
Neuscanshoily ??? https://www.dropbox.com/sh/i2af7xvn4y593gc/sRpxnsZPXk/captchas/gv#lh:null-im... Looks like "neuscanshoil" with a random -y added, a hangover from English behaviour?
"Broaçhaaue" doesn't mean anything. "Broaçh" does but "aaue" doesn't (though Aaue is a proper name) so this is a weird combination. https://www.dropbox.com/sh/i2af7xvn4y593gc/sRpxnsZPXk/captchas/gv#lh:null-im...
Perick is also a proper name https://www.dropbox.com/sh/i2af7xvn4y593gc/sRpxnsZPXk/captchas/gv#lh:null-im...
This one is particularly hard to make out https://www.dropbox.com/sh/i2af7xvn4y593gc/sRpxnsZPXk/captchas/gv#lh:null-im...
The form "vaayl" is a rare grammar-induced form of an unusual word "faayl" (turf-cutting spade). It's easy enough to make out, but if you're aiming for this to be words people might know then I'd change it. I might change this to "faayl" anyway. https://www.dropbox.com/sh/i2af7xvn4y593gc/sRpxnsZPXk/captchas/gv#lh:null-im...
Donal also a proper noun https://www.dropbox.com/sh/i2af7xvn4y593gc/sRpxnsZPXk/captchas/gv#lh:null-im...
Hard to read, could be "hiu shee" or "niu shee" https://www.dropbox.com/sh/i2af7xvn4y593gc/sRpxnsZPXk/captchas/gv#lh:null-im...
This one means "arctic castration" (spoiy = castration). Not obscene, but maybe not for everyone? https://www.dropbox.com/sh/i2af7xvn4y593gc/sRpxnsZPXk/captchas/gv#lh:null-im...
Cheers, Shimmin
Today I made a couple patches that should address most of the problems reported as well as handle RTL languages and multilingual blacklist. I'm mostly using some Unicode magic which is quite well hidden in some obscure libraries, we'll see if it works. :)
In case it's not clear, for now I'm focusing on the *MediaWiki* side of the matter; the Wikimedia side, i.e. where to use what and how, is something we'll worry about when we actually have this option (or others) available in the codebase.
A couple questions below.
P. Blissenbach, 31/03/2014 17:13:
captchas having two lines of identcal text [...] and accept either input.
This would need to be filed as separate enhancement request.
Shimmin, 31/03/2014 20:02:
If you actually want the captchas to make any sense in terms of word combination and construction, that would be a whole different issue. There's inflection, rules on what happens when words are run together (spelling changes for one), and so on.
I suppose you're only talking of the morphological side here, right? The current patch contains a couple lines to handle hyphenation for Finnish, because it was originally provided by Nikerabbit, but we're definitely not going to build a universal grammar of univerbation in a MediaWiki script. Unless someone comes up with a general solution I think we'll drop that part.
If this turns out to be confusing, I'd rather just show the two (or N) words as separate words, what do you think? This can be done in a separate patch; once we introduce some other security improvements, I think the challenge of identifying where one word ends and the next starts may be redundant.
Quite a few of the l look like i in this font, which seems problematic.
This is indeed a problem with sans serif fonts but the broad majority thinks they are better. We can try to pick clearer fonts but most help will come from words being familiar to humans. I may upload more tests with this font, though: https://commons.wikimedia.org/wiki/File:AndBasR.pdf
Should this be "leigh"?
Yes. If incorrect, please edit: https://en.wiktionary.org/?oldid=23059687
Looks like "neuscanshoil" with a random -y added, a hangover from English behaviour?
Same problem as with Malayam and others; the last version will avoid combining single letters to other words.
[...] (though Aaue is a proper name) [...]
Perick is also a proper name [...]
Do others think proper names are a problem? If yes they might be easy enough to remove, usually they're tagged as such on Wiktionary. Otherwise, this adds some cheap variety in our dictionaries.
The form "vaayl" is a rare grammar-induced form of an unusual word
In this case it's again a proper noun, no idea how correct or how current: https://en.wiktionary.org/?oldid=21902154
Hard to read, could be "hiu shee" or "niu shee"
It was "hiu": no "niu" in our dictionary. If the latter is a valid word, you should add it to Wiktionary and then we can try to figure out something to exclude confusable words.
Once again, the proposed approach is to rely on a mix of Unicode magic and self-healing (wiki) dictionary. Neither is enough alone.
This one means "arctic castration" (spoiy = castration). Not obscene, but maybe not for everyone?
Well, it could fall under "obscene" for some definition of the word. I'm now blacklisting also "pejorative" and "offensive" words, those who care can try and see if their label edits survive on the wiki. https://en.wiktionary.org/wiki/Wiktionary:Context_labels
Nemo
On 01/04/2014 23:30, Federico Leva (Nemo) wrote:
I suppose you're only talking of the morphological side here, right? The current patch contains a couple lines to handle hyphenation for Finnish, because it was originally provided by Nikerabbit, but we're definitely not going to build a universal grammar of univerbation in a MediaWiki script. Unless someone comes up with a general solution I think we'll drop that part.
If this turns out to be confusing, I'd rather just show the two (or N) words as separate words, what do you think? This can be done in a separate patch; once we introduce some other security improvements, I think the challenge of identifying where one word ends and the next starts may be redundant.
I thought you probably weren't trying for that, just wanted to check! It shouldn't really matter unless there are any desperately unclear images where people need to guess, and I haven't seen any.
Should this be "leigh"?
Yes. If incorrect, please edit: https://en.wiktionary.org/?oldid=23059687
The dictionary entry is correct, with no -t. Not sure where that can be coming from.
The form "vaayl" is a rare grammar-induced form of an unusual word
In this case it's again a proper noun, no idea how correct or how current: https://en.wiktionary.org/?oldid=21902154
Ah, got it. This is also grammar (it's the vocative/genitive). I would tend to recommend only using dictionary forms of words as some inflections are quite obscure, but it's not a huge problem. Also I appreciate getting only dictionary forms may be a challenge.
Hard to read, could be "hiu shee" or "niu shee"
It was "hiu": no "niu" in our dictionary. If the latter is a valid word, you should add it to Wiktionary and then we can try to figure out something to exclude confusable words.
Once again, the proposed approach is to rely on a mix of Unicode magic and self-healing (wiki) dictionary. Neither is enough alone.
There's no niu that I know of, but it'd be a valid word and there are many obscure terms around, so really this is an issue of the image being unclear I suppose, especially the apparent contrast betwen the first and second Hs.
Thanks, Shimmin
Hi Nemo,
I don't think CAPTCHA in Gujarati (gu) is a good option, not only because current CAPTCHA images like 9bf7ca13_4229ad03a9b83341.png, 1c678215_eec3d3740f31001e.png, 10bb46b0_7bc39a5f06ec4b4c.png and many more have incorrect signs, but also the main reason is that hardly any Gujarati speaking person would have functionality to type Gujarati using independent keyboard layout.
Situation of Indian scripts and especially of Gujarati, for which I can speak with surety, is that we don't have easy keyboards available like other scripts. Editors/users of this language use English keyboard only even for typing Gujarati.
I would suggest to have CAPTCHA in English only for all gu wikis, if we need, we can provide option to the users where they can opt for Gujarati CAPTCHA. If at all, we decide to go on that line, let me know, I can assist with words and point what is wrong (typographically) in the words currently used for CAPTCHA.
Thanks, Dhaval (User Dsvyas) Admin: gu.wp & gu.ws
On Sun, Mar 30, 2014 at 10:40 AM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
Please give a quick look to the images for the languages you know: https://www.dropbox.com/sh/i2af7xvn4y593gc/-RRtFyoJji/captchas
If you spot something we're not aware of, please comment: https://bugzilla.wikimedia.org/show_bug.cgi?id=5309#c32
Nemo
Mediawiki-i18n mailing list Mediawiki-i18n@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-i18n
mediawiki-i18n@lists.wikimedia.org