Hi,
Google miscategorizes the language of some of the Hungarian Wikipedia pages. E.g. it thinks that our Adolf Hitler article is in czech. Examples:
http://hu.wikipedia.org/wiki/Adolf_Hitler http://hu.wikipedia.org/wiki/Janis_Joplin http://hu.wikipedia.org/wiki/Milgram-k%C3%ADs%C3%A9rlet http://hu.wikipedia.org/wiki/Vietnami_h%C3%A1bor%C3%BA http://hu.wikipedia.org/wiki/Brad_Pitt http://hu.wikipedia.org/wiki/Pamela_Anderson
As far as I can see, the HTTP response header contains: "Content-Language: hu"
and the returned page has:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="hu" lang="hu" dir="ltr">
Unfortunately when our target audience (Hungarian web users, who may not speak English) search for "Adolf Hitler" in Hungarian they won't get any Wikipedia hits. :(
So what else can we do to make Google believe that these pages are in Hungarian and not some other language?
Is there a (semi) formal address where I can submit a bug report about this to Google?
Thanks, nyenyec
On 9/2/05, Nyenyec N nyenyec@gmail.com wrote:
So what else can we do to make Google believe that these pages are in Hungarian and not some other language?
Is there a (semi) formal address where I can submit a bug report about this to Google?
The google blog's email addy is a good place to start; they've just recently posted two items on their translation services.
gblog@google.com
Blog here: http://googleblog.blogspot.com/
Also, see http://groups.google.com/group/google.public.translators (newsgroup google.public.translators)
Google miscategorizes the language of some of the Hungarian Wikipedia pages. E.g. it thinks that our Adolf Hitler article is in czech.
How do you know they are miscategorising the language?
http://www.google.com/search?q=inurl%3A%22Adolf+Hitler%22+site%3Ahu.wikipedia.org
This makes it seem like they haven't indexed the page at all, not that they've marked it as the wrong language.
Angela.
Because on many occasions Google does find the page in question in a certain other language.
According to Google, the Hungarian Wikipedia has 3870 articles written "in English": http://www.google.com/search?hl=en&lr=lang_en&as_q=h%C3%A1bor%C3%BA&...
Over 52,000 articles "in Czech": http://www.google.com/search?hl=en&lr=lang_cs&as_q=h%C3%A1bor%C3%BA&...
Two articles which Google thinks are in Chinese (simplified): http://www.google.com/search?lr=lang_zh-CN&as_sitesearch=hu.wikipedia.or...
The Hungarian article about the National anthem of Russia is supposedly in traditional Chinese: http://www.google.com/search?lr=lang_zh-CN&as_sitesearch=hu.wikipedia.or...
And so on.
Regards, Endre (KovacsUr@huwiki)
----- Original Message ----- From: "Angela" beesley@gmail.com To: "Wikimedia developers" wikitech-l@wikimedia.org Sent: Friday, September 02, 2005 10:22 PM Subject: Re: [Wikitech-l] Assisting Google's language recognition?
Google miscategorizes the language of some of the Hungarian Wikipedia pages. E.g. it thinks that our Adolf Hitler article is in czech.
How do you know they are miscategorising the language?
http://www.google.com/search?q=inurl%3A%22Adolf+Hitler%22+site%3Ahu.wikiped ia.org
This makes it seem like they haven't indexed the page at all, not that they've marked it as the wrong language.
Angela. _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
On 9/2/05, Angela beesley@gmail.com wrote:
Google miscategorizes the language of some of the Hungarian Wikipedia pages. E.g. it thinks that our Adolf Hitler article is in czech.
How do you know they are miscategorising the language? http://www.google.com/search?q=inurl%3A%22Adolf+Hitler%22+site%3Ahu.wikipedia.org
This makes it seem like they haven't indexed the page at all, not that they've marked it as the wrong language.
I believe this is correct. We've had problems in the past with overloading wikipedia so the crawl has been throttled way back.
Aren't almost all of the pages Google requests cached? At any rate, our robots.txt [1] makes no mention of this, so if this is the case we are either doing user-agent checking and throttling then dynamically or we have a negotiated agreement with them...Is this the case?
/Alterego
[1] http://en.wikipedia.org/robots.txt
On 9/3/05, Evan Martin evanm@google.com wrote:
On 9/2/05, Angela beesley@gmail.com wrote:
Google miscategorizes the language of some of the Hungarian Wikipedia pages. E.g. it thinks that our Adolf Hitler article is in czech.
How do you know they are miscategorising the language? <
http://www.google.com/search?q=inurl%3A%22Adolf+Hitler%22+site%3Ahu.wikipedi...
This makes it seem like they haven't indexed the page at all, not that they've marked it as the wrong language.
I believe this is correct. We've had problems in the past with overloading wikipedia so the crawl has been throttled way back. _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Evan Martin wrote:
On 9/2/05, Angela beesley@gmail.com wrote:
Google miscategorizes the language of some of the Hungarian Wikipedia pages. E.g. it thinks that our Adolf Hitler article is in czech.
How do you know they are miscategorising the language? http://www.google.com/search?q=inurl%3A%22Adolf+Hitler%22+site%3Ahu.wikipedia.org
This makes it seem like they haven't indexed the page at all, not that they've marked it as the wrong language.
I believe this is correct. We've had problems in the past with overloading wikipedia so the crawl has been throttled way back.
Throttled back to what? I remember having problems in late 2003 and early 2004, but we have 30 times as much hardware now, so we could handle a proportionally higher crawl rate, say 10 or 20 requests per second. That is, unless one of the other system administrators wants to disagree.
This doesn't mean every man and his dog can try to crawl our site with wget at 20 requests per second. Our number of users is higher, and those users still have to be careful not to take any more than their fair share of the resources, or else they risk being blocked without notice. I would still recommend a crawl delay of at least 2 seconds for users without specific approval. But I think we're in a position now where we can make an exception for the big search engines.
-- Tim Starling
On 9/3/05, Tim Starling t.starling@physics.unimelb.edu.au wrote:
Evan Martin wrote:
I believe this is correct. We've had problems in the past with overloading wikipedia so the crawl has been throttled way back.
Throttled back to what? I remember having problems in late 2003 and early 2004, but we have 30 times as much hardware now, so we could handle a proportionally higher crawl rate, say 10 or 20 requests per second. That is, unless one of the other system administrators wants to disagree.
We've upped the rate a bit. Please let me know if it causes any problems.
(Also, is there an official contact for this sort of thing? That is: who knows / gets to make the final decision of whether wikipedia can handle more load?)
The "official contact" would probably be this mailing list, or an individual developer (you can tell who they are here because they speak more authoritatively about things).
Having said that, I hope you don't mind if I stray quite a bit from the TaH and ask if you know what's up with the Google in Your Language program? It seems that queries have not been being responded to for quite some time now except occasionally by seemingly-automated responses, and whoever is currently coordinating the program doesn't seem to check the Usenet group for Google translators anymore as they once did.
Also, the group is getting EXTREMELY difficult to sift through, with the hundreds of posts by people asking for gmail invites, or wanting to translate gmail into their language -- is there a way to hide those annoying messages so that we can read about more expedient issues?
Cheers Mark
On 07/09/05, Evan Martin evanm@google.com wrote:
On 9/3/05, Tim Starling t.starling@physics.unimelb.edu.au wrote:
Evan Martin wrote:
I believe this is correct. We've had problems in the past with overloading wikipedia so the crawl has been throttled way back.
Throttled back to what? I remember having problems in late 2003 and early 2004, but we have 30 times as much hardware now, so we could handle a proportionally higher crawl rate, say 10 or 20 requests per second. That is, unless one of the other system administrators wants to disagree.
We've upped the rate a bit. Please let me know if it causes any problems.
(Also, is there an official contact for this sort of thing? That is: who knows / gets to make the final decision of whether wikipedia can handle more load?) _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
This doesn't sound like the correct mailing list for that.
/Alterego
On 9/7/05, Mark Williamson node.ue@gmail.com wrote:
The "official contact" would probably be this mailing list, or an individual developer (you can tell who they are here because they speak more authoritatively about things).
Having said that, I hope you don't mind if I stray quite a bit from the TaH and ask if you know what's up with the Google in Your Language program? It seems that queries have not been being responded to for quite some time now except occasionally by seemingly-automated responses, and whoever is currently coordinating the program doesn't seem to check the Usenet group for Google translators anymore as they once did.
Also, the group is getting EXTREMELY difficult to sift through, with the hundreds of posts by people asking for gmail invites, or wanting to translate gmail into their language -- is there a way to hide those annoying messages so that we can read about more expedient issues?
Cheers Mark
On 07/09/05, Evan Martin evanm@google.com wrote:
On 9/3/05, Tim Starling t.starling@physics.unimelb.edu.au wrote:
Evan Martin wrote:
I believe this is correct. We've had problems in the past with overloading wikipedia so the crawl has been throttled way back.
Throttled back to what? I remember having problems in late 2003 and early 2004, but we have 30 times as much hardware now, so we could handle a proportionally higher crawl rate, say 10 or 20 requests per second. That is, unless one of the other system administrators wants
to
disagree.
We've upped the rate a bit. Please let me know if it causes any
problems.
(Also, is there an official contact for this sort of thing? That is: who knows / gets to make the final decision of whether wikipedia can handle more load?) _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
-- SI HOC LEGERE SCIS NIMIVM ERVDITIONIS HABES QVANTVM MATERIAE MATERIETVR MARMOTA MONAX SI MARMOTA MONAX MATERIAM POSSIT MATERIARI ESTNE VOLVMEN IN TOGA AN SOLVM TIBI LIBET ME VIDERE _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
On 9/7/05, Evan Martin evanm@google.com wrote:
(Also, is there an official contact for this sort of thing? That is: who knows / gets to make the final decision of whether wikipedia can handle more load?)
The "official contacts" would be Brion Vibber brion@pobox.com and Domas Mituzas domas.mituzas@gmail.com.
http://meta.wikimedia.org/wiki/Chief_Technical_Officer http://meta.wikimedia.org/wiki/Hardware_Officer
But, asking on this mailing list may be better since others on the development/sys admin team may have opinions on the matter.
Angela.
Evan Martin wrote:
We've upped the rate a bit. Please let me know if it causes any problems.
You're making this sound like you work for Google and have some say or control over this, but you seem to specifically avoid saying this explicitly (or I missed it). Could you clarify this?
On 9/10/05, Timwi timwi@gmx.net wrote:
Evan Martin wrote:
We've upped the rate a bit. Please let me know if it causes any problems.
You're making this sound like you work for Google and have some say or control over this, but you seem to specifically avoid saying this explicitly (or I missed it). Could you clarify this?
http://mail.wikipedia.org/pipermail/wikitech-l/2005-August/031229.html
Dori wrote:
On 9/10/05, Timwi timwi@gmx.net wrote:
Evan Martin wrote:
We've upped the rate a bit. Please let me know if it causes any problems.
You're making this sound like you work for Google and have some say or control over this, but you seem to specifically avoid saying this explicitly (or I missed it). Could you clarify this?
http://mail.wikipedia.org/pipermail/wikitech-l/2005-August/031229.html
Thanks!
Angela wrote:
http://www.google.com/search?q=inurl%3A%22Adolf+Hitler%22+site%3Ahu.wikipedia.org
This makes it seem like they haven't indexed the page at all, not that they've marked it as the wrong language.
Uhm... URLs cannot contain spaces, so this query is never going to return anything.
Instead, you probably meant http://www.google.com/search?q=inurl%3AAdolf_Hitler+site%3Ahu.wikipedia.org, which does return two search results (the article and its talk page), although they do not have a snippet or even the page title. This means that Google knows that these pages are linked to from somewhere, but hasn't indexed them yet. I find that surprising because the page has existed for 4 months now; even at only 1 page crawl per second, you can crawl over 10 million pages in that time.
Timwi
Take a look at this query, then: http://www.google.com/search?q=inurl%3A%22Adolf+Hitler%22+site%3Awikipedia.o...
Cheers, Endre
----- Original Message ----- From: "Timwi" timwi@gmx.net To: wikitech-l@wikimedia.org Sent: Tuesday, September 06, 2005 8:57 PM Subject: [Wikitech-l] Re: Assisting Google's language recognition?
Uhm... URLs cannot contain spaces, so this query is never going to return anything.
Kovács Endre János wrote:
Take a look at this query, then: http://www.google.com/search?q=inurl%3A%22Adolf+Hitler%22+site%3Awikipedia.o...
Alright -- so it does the same kind of "punctuation-insensitive matching" as it does with phrases without "inurl"; but that doesn't change the fact that http://en.wikipedia.org/wiki/Adolf_Hitler still doesn't appear in the search results because it has an underscore there.
wikitech-l@lists.wikimedia.org