As I have reported earlier on wikide-l and wikipedia-l, I digitized two small encyclopedias in September and early October, one in German and one in English, and made them available on Wikisource for proofreading and reference. Every book page has a wiki subpage of its own, presenting the scanned image and the OCR text.
Since then I have monitored how fast Google has been to index these titles. More than half of the German pages were indexed within a few weeks, which is in line with my experience of how fast Google can be. But to my surprise, only very few of the English pages have yet been indexed.
It was only this summer that wikisource.org was split into language subdomains like de.wikisource.org and en.wikisource.org, so it is understandable that pages in the new subdomains still have a low Google rank. However, this is the same for all languages, and doesn't explain the difference that I see between the German and the English language subdomain.
Today de.wikisource.org reports to having 2311 articles and 4638 pages. A word that occurs on every page is "letzte" ("recent" in Recent Changes) and Google gives 770 hits for the search http://www.google.com/search?q=site%3Ade.wikisource.org+letzte
Of the 4638 pages, 443 are subpages to http://de.wikisource.org/wiki/Meyers_Blitz-Lexikon and Google finds 418 of them, http://www.google.com/search?q=site%3Ade.wikisource.org+%22Meyers+Blitz-Lexi...
This means 17% of the de.wikisource pages are indexed, but 94% of the pages from the book I scanned.
En.wikisource.org today has 19,006 articles and 23,780 total pages. A word that occurs on every page is "recent" and Google givs 12,500 hits for the query http://www.google.com/search?q=site%3Aen.wikisource.org+recent
Of the 23,780 pages, 2791 are subpages to http://en.wikisource.org/wiki/The_New_Student%27s_Reference_Work but Google only finds 4 of them, http://www.google.com/search?q=site%3Aen.wikisource.org+%22The+New+Student%2...
This means 53% of en.wikisource.org articles are indexed, but only 0.14 % of the book pages I scanned.
I can understand that some enthusiastic Germans have linked to "Meyers Blitz-Lexikon" and increased its Google rank. But it also seems that there is a negative Google rank for "The New Student's Reference Work". Has it been trapped in some spam filter?
It was only this summer that wikisource.org was split into language subdomains like de.wikisource.org and en.wikisource.org, so it is understandable that pages in the new subdomains still have a low Google rank. However, this is the same for all languages, and doesn't explain the difference that I see between the German and the English language subdomain.
This is wrong; it is not the same for all languages. The English one is privileged because wikisource.org/wiki/Anything redirects you to it.
Therefore, from Google's point of view, de.wikisource.org is "new", but en.wikisource.org is just a new name of a site that is "old".
Hence, I'm not surprised that the Google spider gives more priority to de.wikisource.org. In fact, it may still have been in the process of spidering the site when you put the stuff up, and then the spider may have come across a link to it, while Google might be thinking that it has finished an indexing run of the English one, thus taking a break before giving it another go.
Timwi
Timwi wrote:
It was only this summer that wikisource.org was split into language subdomains like de.wikisource.org and en.wikisource.org, so it is understandable that pages in the new subdomains still have a low Google rank. However, this is the same for all languages, and doesn't explain the difference that I see between the German and the English language subdomain.
This is wrong; it is not the same for all languages. The English one is privileged because wikisource.org/wiki/Anything redirects you to it.
No it doesn't. See for yourself: http://wikisource.org/wiki/Asdflkj
Note that there is a *multilingual* Wikisource at wikisource.org, and a number of *monolingual* Wikisources at XX.wikisource.org. This is a horrible ugly situation, but apparently people preferred that. *shrug*
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
Timwi wrote:
It was only this summer that wikisource.org was split into language subdomains like de.wikisource.org and en.wikisource.org, so it is understandable that pages in the new subdomains still have a low Google rank. However, this is the same for all languages, and doesn't explain the difference that I see between the German and the English language subdomain.
This is wrong; it is not the same for all languages. The English one is privileged because wikisource.org/wiki/Anything redirects you to it.
No it doesn't. See for yourself: http://wikisource.org/wiki/Asdflkj
I did see for myself, and I did notice it doesn't actually forward you to a URL with "en." in front, but I just assumed that it's the English one nonetheless... :)
Thanks for the clarification, Timwi
On 11/1/05, Lars Aronsson lars@aronsson.se wrote:
Since then I have monitored how fast Google has been to index these titles. More than half of the German pages were indexed within a few weeks, which is in line with my experience of how fast Google can be. But to my surprise, only very few of the English pages have yet been indexed.
[lots snipped]
Thanks for the detailed mail. I just wanted to let you know that the right people have been notified, and they're looking into it.
wikitech-l@lists.wikimedia.org