I have noticed that some people from other wikipedias change our swedish å on the Swedish Wikipedia, when they make language links. I have thought about why, and I suppose its because they see something different than we do.
In the normal HTML, our å is written å ä is written ä and ö is written ö. Is this the present configuration in the php script, or does our strange letters confuse people so much that they change them since they look like errors in their point of view? _________________________________________ Dan Koehl ICQ#: 4046787 _________________________________________
On Fri, Feb 21, 2003 at 01:05:46PM +0100, Dan Koehl wrote:
I have noticed that some people from other wikipedias change our swedish ? on the Swedish Wikipedia, when they make language links. I have thought about why, and I suppose its because they see something different than we do.
In the normal HTML, our ? is written å ä is written ä and ö is written ö. Is this the present configuration in the php script, or does our strange letters confuse people so much that they change them since they look like errors in their point of view?
It's not that they do that. This may be automatically done by their browser.
BTW, we should add proper html's http-equiv meta Content-Type for charset to all wiki pages, so that page saved to disk has charset information too.
On Fri, 21 Feb 2003, Tomasz Wegrzanowski wrote:
On Fri, Feb 21, 2003 at 01:05:46PM +0100, Dan Koehl wrote:
I have noticed that some people from other wikipedias change our swedish ? on the Swedish Wikipedia, when they make language links. I have thought about why, and I suppose its because they see something different than we do.
In the normal HTML, our ? is written å ä is written ä and ö is written ö. Is this the present configuration in the php script, or does our strange letters confuse people so much that they change them since they look like errors in their point of view?
It's not that they do that. This may be automatically done by their browser.
And by Tomasz's email program too...
Andre
On Fri, 21 Feb 2003, Dan Koehl wrote:
I have noticed that some people from other wikipedias change our swedish å on the Swedish Wikipedia, when they make language links. I have thought about why, and I suppose its because they see something different than we do.
In the normal HTML, our å is written å ä is written ä and ö is written ö. Is this the present configuration in the php script, or does our strange letters confuse people so much that they change them since they look like errors in their point of view?
In what way are they changed?
The most likely explanation seems to me that it is a problem of the browser. I know from my own browser that when I go to edit a page, some non-standard signs are being changed into question marks by the editor. Perhaps the å is one of those for some people?
Regarding your question about the PHP-script: No, the PHP-script leaves the å as it is.
Possible solutions: * Notify those users for who this happens * Replace å by å
Andre Engels
Im sorry, the letter å is to common in our language that it makes sence to replace it with code å, at least for new wikipedians, it has to be autmatically done, like any modern HTML editor would do.
Othervise were back to making editing difficult, when it should be easy??
Since ä and ö which is present in the German langauges is working OK, while å is the unique Swedish letter (not present in any other language?) I stille belive there is something with the script, and how it translates letters and code. Im pretty sure that somewehere in the code theres lines which makes ä and ö working, while å is lacking. Am I wrong?
- Dan
Andre wrote: Possible solutions:
- Notify those users for who this happens
- Replace å by å
Andre Engels
Wikitech-l mailing list Wikitech-l@wikipedia.org http://www.wikipedia.org/mailman/listinfo/wikitech-l
What's this all about?
$wgDBminWordLen = 3; # Match this to your MySQL fulltext
Fred
On Fri, 21 Feb 2003, Fred Bauder wrote:
What's this all about?
$wgDBminWordLen = 3; # Match this to your MySQL fulltext
MySQL's FULLTEXT indexing is used for the search function; the index by default ignores words shorted than some number of letters (3 or 4?).
Our search function does a primitive boolean search by parsing the search query into words (eg, "dime a dozen" -> "dime", "a", "dozen") and doing separate MATCH queries on each one using the index, then ANDing the results together logically. So an article has to match all words to come up in the results. But, the "a" is too short (and anyway a stopword -- another issue itself) and so ignored by the search; it doesn't match *any* articles.
So, we have to know which words are going to be ignored by MySQL's fulltext search so we can skip them. "dime a dozen" -> "dime" and "dozen".
This would go away using MySQL 4, which has a boolean search mode for its fulltext searching, but since there are a lot of mysql3 installations out there we'd have to keep this around as an option.
See the mySQL docs: http://www.mysql.com/doc/en/Fulltext_Search.html
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
MySQL's FULLTEXT indexing is used for the search function; the index by default ignores words shorted than some number of letters (3 or 4?).
Our search function does a primitive boolean search by parsing the search query into words (eg, "dime a dozen" -> "dime", "a", "dozen") and doing separate MATCH queries on each one using the index, then ANDing the results together logically. So an article has to match all words to come
If you just let MySQL 3.x match the phrase "dime a dozen", it will return any entries that contain either "dime" or "dozen" and give higher ranks to those that contain both or lots of these words. It's more of an "or" than an "and", but it works very well. Is it really worth the hassle to go through the splitting and anding? Do all users really want the strict "and"? When I use Google I don't really want an empty hit list.
On Mon, 24 Feb 2003, Lars Aronsson wrote:
If you just let MySQL 3.x match the phrase "dime a dozen", it will return any entries that contain either "dime" or "dozen" and give higher ranks to those that contain both or lots of these words.
Yup.
It's more of an "or" than an "and", but it works very well. Is it really worth the hassle to go through the splitting and anding? Do all users really want the strict "and"?
All I know is I hear complaints when I change to the straight search method as a performance hack and people find that "John Smith" turns up mostly results like "John Wigglesworth" and "Michael Smith".
-- brion vibber (brion @ pobox.com)
(Brion Vibber vibber@aludra.usc.edu): On Mon, 24 Feb 2003, Lars Aronsson wrote:
If you just let MySQL 3.x match the phrase "dime a dozen", it will return any entries that contain either "dime" or "dozen" and give higher ranks to those that contain both or lots of these words.
Yup.
It's more of an "or" than an "and", but it works very well. Is it really worth the hassle to go through the splitting and anding? Do all users really want the strict "and"?
All I know is I hear complaints when I change to the straight search method as a performance hack and people find that "John Smith" turns up mostly results like "John Wigglesworth" and "Michael Smith".
Also, the min-length thing is independent of that choice anyway. By default, MySQL won't index any word fewer than 4 letters long, and we got lots of complaints that you couldn't search for "PNG" or "XP". MySQL has to be recompiled to change that limit, and the local setting just informs the wiki software of how MySQL was compiled so it knows what it can hand to the indexer.
On Mon, 24 Feb 2003, Lee Daniel Crocker wrote:
Also, the min-length thing is independent of that choice anyway. By default, MySQL won't index any word fewer than 4 letters long, and we got lots of complaints that you couldn't search for "PNG" or "XP". MySQL has to be recompiled to change that limit, and the local setting just informs the wiki software of how MySQL was compiled so it knows what it can hand to the indexer.
Well, yes and no. If we don't parse the query into separate words, it doesn't matter how much we hand to it: a MATCH AGAINST( "Windows XP" ) will still turn up all the results for "Windows", whereas MATCH AGAINST( "Windows" ) AND MATCH AGAINST ("XP") returns nothing at all -- that's why we need to be aware of it and remove the too-short words from the search using our hackish boolean system.
However, it would be nice to spit out a little message to the effect of "The word 'XP' has been ignored in your search because MySQl doesn't like it," whatever the search method.
-- brion vibber (brion @ pobox.com)
(Brion Vibber vibber@aludra.usc.edu):
Well, yes and no. If we don't parse the query into separate words, it doesn't matter how much we hand to it: a MATCH AGAINST( "Windows XP" ) will still turn up all the results for "Windows", whereas MATCH AGAINST( "Windows" ) AND MATCH AGAINST ("XP") returns nothing at all -- that's why we need to be aware of it and remove the too-short words from the search using our hackish boolean system.
However, it would be nice to spit out a little message to the effect of "The word 'XP' has been ignored in your search because MySQl doesn't like it," whatever the search method.
Yeah, the issues are intertwined a bit. BTW, I compiled MySQL on the wikipedia server with a minimum word length of 2, so "XP" works fine.
On Mon, 24 Feb 2003, Lee Daniel Crocker wrote:
Yeah, the issues are intertwined a bit. BTW, I compiled MySQL on the wikipedia server with a minimum word length of 2, so "XP" works fine.
Still doesn't help with "vitamin E" or "C sharp"...
Any reason we can't take it down to 1?
-- brion vibber (brion @ pobox.com)
(Brion Vibber vibber@aludra.usc.edu): On Mon, 24 Feb 2003, Lee Daniel Crocker wrote:
Yeah, the issues are intertwined a bit. BTW, I compiled MySQL on the wikipedia server with a minimum word length of 2, so "XP" works fine.
Still doesn't help with "vitamin E" or "C sharp"...
Any reason we can't take it down to 1?
Two problems I can see: first, "A" and "I" clearly can't have meaningful indexes, so even if we make "Vitamin C" searchable, it won't work for "Vitamin A", causing confusion. Also, single letters often appear in articles as links (see list of diseases, for example) or at outline labels, etc., and would further make single-letter searches less meaningful.
On Mon, 24 Feb 2003, Lee Daniel Crocker wrote:
(Brion Vibber vibber@aludra.usc.edu): Still doesn't help with "vitamin E" or "C sharp"...
Any reason we can't take it down to 1?
Two problems I can see: first, "A" and "I" clearly can't have meaningful indexes, so even if we make "Vitamin C" searchable, it won't work for "Vitamin A", causing confusion.
24/26ths less confusion than none of them working, I'd wager.
Also, single letters often appear in articles as links (see list of diseases, for example) or at outline labels, etc., and would further make single-letter searches less meaningful.
In most cases they'd be meaningful _enough_ though, for two reasons:
a) alphabetical lists, "I", and "a" are very rare in article titles, which we search separately from body text. Yes, there's going to be the occasional spurious "Biographical index -- C through G", but does that negate the utilitity of returning "C programming language" (and a few other pages) instead of nothing to the hapless kid searching for that hip programming language?
b) in most cases such searches will be in conjunction with other words. "Vitamin C" or "Malcom X" will appear more often in conjunction when they are in fact mentioned together than in random unrelated lists. There will be some false positives, but that's better than many many false negatives.
-- brion vibber (brion @ pobox.com)
On my implimentation of the Wikipedia3 software at http://www.internet-encyclopedia.info/wiki.phtml
I can access via a control panel for field name user_rights which I could edit:
Field parameters Field name Data type tinyblob Allow nulls? Yes No Default value Part of primary key? Yes No
Table user in database wikidb ------------------------------------------------------------------------
Field name Type Allow nulls? Key Default value Extras user_rights tinyblob No None
What should I put in the cell user_rights for myself to make myself (or someone else) either a sysop or developer.
Or is there some other way and something I am missing?
Fred
On Sat, Mar 08, 2003 at 06:26:00AM -0700, Fred Bauder wrote:
On my implimentation of the Wikipedia3 software at http://www.internet-encyclopedia.info/wiki.phtml
I can access via a control panel for field name user_rights which I could edit:
Field parameters Field name Data type tinyblob Allow nulls? Yes No Default value Part of primary key? Yes No
Table user in database wikidb
Field name Type Allow nulls? Key Default value Extras user_rights tinyblob No None
What should I put in the cell user_rights for myself to make myself (or someone else) either a sysop or developer.
You should put "sysop" there to become a sysop.
On Fri, 21 Feb 2003, Dan Koehl wrote:
Im sorry, the letter å is to common in our language that it makes sence to replace it with code å, at least for new wikipedians, it has to be autmatically done, like any modern HTML editor would do.
Othervise were back to making editing difficult, when it should be easy??
Since ä and ö which is present in the German langauges is working OK, while å is the unique Swedish letter (not present in any other language?) I stille belive there is something with the script, and how it translates letters and code. Im pretty sure that somewehere in the code theres lines which makes ä and ö working, while å is lacking. Am I wrong?
Yes, you are wrong. The script is not doing anything, as I said before, it is the browsers (and/or more specifically the editors on them) that are at fault. As far as I understand it, these editors do not have å in their character set, while they do have ä and ö. Note that a similar thing happened in Tomasz Wegrzanowski's email. And the email surely won't go through Wikipedia's PHP script, will it?
Another way to see that it is not caused by the PHP script, is that it happens for some people but not all - the PHP script just sees the text that is given as input, so if it causes problems with some users, those users must be inputting a different text.
Andre
Ok, well the problem is not that great, I have corrected two files in two months, which is nothing. If theres nothing to do, its not the end of the world.
Strange though, that the browser accept ä and ö but not å.
I suppose its because a large number of european users are German, while theres too less of the old Swedes, thats why å was not implemented.
Nothing to do, and its not a big problem anyhow. Over and Out.- Dan
----- Original Message ----- From: "Andre Engels" engels@uni-koblenz.de To: wikitech-l@wikipedia.org Sent: Friday, February 21, 2003 2:53 PM Subject: Re: [Wikitech-l] Re: [Wikitech-l] Swedish umlaut å, ä, and ö.
On Fri, 21 Feb 2003, Dan Koehl wrote:
Im sorry, the letter å is to common in our language that it makes sence
to
replace it with code å, at least for new wikipedians, it has to be autmatically done, like any modern HTML editor would do.
Othervise were back to making editing difficult, when it should be
easy??
Since ä and ö which is present in the German langauges is working OK,
while
å is the unique Swedish letter (not present in any other language?) I
stille
belive there is something with the script, and how it translates letters
and
code. Im pretty sure that somewehere in the code theres lines which
makes ä
and ö working, while å is lacking. Am I wrong?
Yes, you are wrong. The script is not doing anything, as I said before, it
is
the browsers (and/or more specifically the editors on them) that are at
fault.
As far as I understand it, these editors do not have å in their character
set,
while they do have ä and ö. Note that a similar thing happened in Tomasz Wegrzanowski's email. And the email surely won't go through Wikipedia's
PHP
script, will it?
Another way to see that it is not caused by the PHP script, is that it
happens
for some people but not all - the PHP script just sees the text that is
given
as input, so if it causes problems with some users, those users must be inputting a different text.
Andre
Wikitech-l mailing list Wikitech-l@wikipedia.org http://www.wikipedia.org/mailman/listinfo/wikitech-l
On Fri, 21 Feb 2003, Dan Koehl wrote:
Since � and � which is present in the German langauges is working OK, while � is the unique Swedish letter (not present in any other language?) I stille belive there is something with the script, and how it translates letters and code. Im pretty sure that somewehere in the code theres lines which makes � and � working, while � is lacking. Am I wrong?
No, the wiki doesn't make any attempt to convert characters of input text to html references. It does to a limited degree convert incoming URLs, so URLs can come in either UTF-8 or the native character set. And outgoing language links have named references turned into latin-1 bytes, and numeric references turned into utf-8 bytes in the generated URL. But except on the Esperanto wiki where we do some replacement of /[cghjsu]x/ into UTF-8 and vice-versa due to the limited availability of appropriate keyboard layout drivers, text is not touched, but saved as given to us by the submitting browser.
If you can point out a specific example, I can check the web server logs and see what web browser is being used that might be doing this.
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org