I could not store this comment on the blog server. Feel free to put it there if you can, or forward it elsewhere, if you see fit.
Since an interwiki link needing propagation may exist only once in one specific wiki in one specific page, all pages having the potential for interwiki linking in each language of a project need to be read. There is no reason, not to have a single bot doing this, but as pywikipediabot is currently structured, it is always operated starting from a selection of pages of one idividual wiki only. These selctions may be huge, such as all articles in the English wikipedia (but no non-article pages, such as templates, or category pages, and no other language) So with the current structure, it is advisable, for each language wiki, to have at least one bot starting from it regularly, propagating the "here only" set links to the remaining wikis.
There is another sad thing to mention. If only one link could not be set - be it because of an edit conflict, a transient network error, server overload, or because a bot is not allowed to access a specific wiki - the entire bot run for all linked articles in this interwiki class has to be repeated just to add this single missing link. The majority of interwiki bots is serving only a comparatively small number of wikis. Its hard to get a single bot to serve all language wikis. It requires a lot of labour due to the sheer number of wikis there is, each and every wiki requires an individual account to be set up and an inividual bot application by rules individual to each wiki, which you have to find, read, understand, and obbey, proceedings and procedures vary, and are in part contradictive between wikis. Even if you follow their rules, some wiki communities, or their bureaurocrats, just don't do it, for one or another reason or without.
An "interwiki class" is the set of pages each (needing to be) linked to each other in the same class. Such classes can be as little as two pages, and as big as one page from each wiki in a family.
A slightly redesigned interwiki bot reading replicated databases and tables on the toolserver could be collecting class information much more efficiently than interwiki.py currently does by exporting groups of articles from each wiki. Provided, there is no significant replication lag, it would be even more up to date when it comes to updating pages, because of its excessively higher speed of collecting the members of a class. Such a redesign would also allow to more easily implement various helpful new ways of selecting which pages to look at, e.g. "language='all', title='Amadeus Mozart'", or ones using SQL wildcards or regular expressions, etc.
Greetings. Purodha.
I'm not sure I understand you.
Searching for "Amadeus Mozart" in the replicated databases could help, yes, but the number of articles that share a common String through different languages is quite small, isn't ? It works for some specific concepts and personalities, but most of the article titles need to be translated, and a search using wildcards or regexps is not going to help for these.
Honestly, the pywikipedia team has a bit changed these last months, and the API edit will soon be available : I've been telling myself for days that interwiki.py will need sooner or later a rewrite. But this is not this easy.
I understand your concept of "interwiki class", but finding such a class does not appear to be this obvious.
If you have a general pseudo-algorithm being able to outline a specific class of articles on the same subject, please share it. But I think that the actual behavior -- starting from a specific page, building the interwikik links graph, and indexing the cycles -- if not optimal, can not be avoided this easily.
2008/6/11 Purodha toolserver-l.wikipedia.org@publi.purodha.net:
I could not store this comment on the blog server. Feel free to put it there if you can, or forward it elsewhere, if you see fit.
Since an interwiki link needing propagation may exist only once in one specific wiki in one specific page, all pages having the potential for interwiki linking in each language of a project need to be read. There is no reason, not to have a single bot doing this, but as pywikipediabot is currently structured, it is always operated starting from a selection of pages of one idividual wiki only. These selctions may be huge, such as all articles in the English wikipedia (but no non-article pages, such as templates, or category pages, and no other language) So with the current structure, it is advisable, for each language wiki, to have at least one bot starting from it regularly, propagating the "here only" set links to the remaining wikis.
There is another sad thing to mention. If only one link could not be set - be it because of an edit conflict, a transient network error, server overload, or because a bot is not allowed to access a specific wiki - the entire bot run for all linked articles in this interwiki class has to be repeated just to add this single missing link. The majority of interwiki bots is serving only a comparatively small number of wikis. Its hard to get a single bot to serve all language wikis. It requires a lot of labour due to the sheer number of wikis there is, each and every wiki requires an individual account to be set up and an inividual bot application by rules individual to each wiki, which you have to find, read, understand, and obbey, proceedings and procedures vary, and are in part contradictive between wikis. Even if you follow their rules, some wiki communities, or their bureaurocrats, just don't do it, for one or another reason or without.
An "interwiki class" is the set of pages each (needing to be) linked to each other in the same class. Such classes can be as little as two pages, and as big as one page from each wiki in a family.
A slightly redesigned interwiki bot reading replicated databases and tables on the toolserver could be collecting class information much more efficiently than interwiki.py currently does by exporting groups of articles from each wiki. Provided, there is no significant replication lag, it would be even more up to date when it comes to updating pages, because of its excessively higher speed of collecting the members of a class. Such a redesign would also allow to more easily implement various helpful new ways of selecting which pages to look at, e.g. "language='all', title='Amadeus Mozart'", or ones using SQL wildcards or regular expressions, etc.
Greetings. Purodha.
Toolserver-l mailing list Toolserver-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Nicolas Dumazet wrote:
Honestly, the pywikipedia team has a bit changed these last months, and the API edit will soon be available : I've been telling myself for days that interwiki.py will need sooner or later a rewrite. But this is not this easy.
I understand your concept of "interwiki class", but finding such a class does not appear to be this obvious.
If you have a general pseudo-algorithm being able to outline a specific class of articles on the same subject, please share it. But I think that the actual behavior -- starting from a specific page, building the interwikik links graph, and indexing the cycles -- if not optimal, can not be avoided this easily.
No, it can't be avoided, but Purodha is right in that using the toolserver dbs would be faster. Now, i don't know how is interwiki.py structured, buy i think it claims for different pluggable modules for whatever is doing get_interwikis_from_page() So you could have one acting as it's now, another obtaining the data via the API, and yet another one directly querying the langlinks table at the toolserver.
Directly querying the langlinks table not only saves time querying the wiki, but allows for querying interwikis for only those wikis you're writing to. This also opens the ability of completely changing the source wiki concept, and going instead querying each wiki db for links to a target wiki.
Platonides wrote:
Nicolas Dumazet wrote:
Honestly, the pywikipedia team has a bit changed these last months, and the API edit will soon be available : I've been telling myself for days that interwiki.py will need sooner or later a rewrite. But this is not this easy.
I understand your concept of "interwiki class", but finding such a class does not appear to be this obvious.
If you have a general pseudo-algorithm being able to outline a specific class of articles on the same subject, please share it. But I think that the actual behavior -- starting from a specific page, building the interwikik links graph, and indexing the cycles -- if not optimal, can not be avoided this easily.
No, it can't be avoided, but Purodha is right in that using the toolserver dbs would be faster. Now, i don't know how is interwiki.py structured, buy i think it claims for different pluggable modules for whatever is doing get_interwikis_from_page() So you could have one acting as it's now, another obtaining the data via the API, and yet another one directly querying the langlinks table at the toolserver.
Directly querying the langlinks table not only saves time querying the wiki, but allows for querying interwikis for only those wikis you're writing to. This also opens the ability of completely changing the source wiki concept, and going instead querying each wiki db for links to a target wiki.
Coincidentally, yesterday I released a MediaWiki extension which, if accepted on Wikimedia projects, may make interwiki bots much less busy. See http://meta.wikimedia.org/wiki/A_newer_look_at_the_interwiki_link
Coincidentally, yesterday I released a MediaWiki extension which, if accepted on Wikimedia projects, may make interwiki bots much less busy. See http://meta.wikimedia.org/wiki/A_newer_look_at_the_interwiki_link
Interesting. I have thought quite a bit about interwiki-links myself, and studied some spectes of them for my thesis. A short writeup of a slightly different proposal for interwiki-linking can be found here:
http://brightbyte.de/page/Ideas_for_a_smarter_inter-language_link_system
It works with a central database, but without a central wiki, and keeps interwiki-maintenance largely "in-place".
-- Daniel
About a year ago, I developed a monkey patch for pywikipediabot which uses the database (and wikiproxy if I'm right). It's in /home/valhallasw/libs/python/pywikipedia/wikipedia_ts.py; just add
import wikipedia_ts
after import wikipedia and it should work. Or not, as it's from before the database server split et cetera. Feel free to use it as a base for further code (it's MIT licensed :))
--valhallasw
On Wed, June 11, 2008 3:51 pm, Daniel Kinzler wrote:
Coincidentally, yesterday I released a MediaWiki extension which, if accepted on Wikimedia projects, may make interwiki bots much less busy. See http://meta.wikimedia.org/wiki/A_newer_look_at_the_interwiki_link
Interesting. I have thought quite a bit about interwiki-links myself, and studied some spectes of them for my thesis. A short writeup of a slightly different proposal for interwiki-linking can be found here:
http://brightbyte.de/page/Ideas_for_a_smarter_inter-language_link_system
It works with a central database, but without a central wiki, and keeps interwiki-maintenance largely "in-place".
-- Daniel
Toolserver-l mailing list Toolserver-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/toolserver-l
I'm Ccing Wikitech, i suggest we follow this thread there.
Nikola Smolenski wrote:
(thread about interwiki bots at toolserver)
Coincidentally, yesterday I released a MediaWiki extension which, if accepted on Wikimedia projects, may make interwiki bots much less busy. See http://meta.wikimedia.org/wiki/A_newer_look_at_the_interwiki_link
It also works by manual writing of the interwikis. I don't think it's the good way. *You're not taking into account page moves. What will you do when a page is moved? (by a low tech user which knows nothing about the global wiki) *The articles will still have a 'preferred' title at the interwiki wiki. That means discussing about article titles, "Move to English name", "No, that's not", "Interwikis with pages on Chinese are ugly!"...
IMHO it should be a shared table referencing the wiki and page ids. Then you provide a Special page showing all pages on that group. You'd reference it as 'include this page into the group XX:sometitle is on'. You can also provide some space for free-form commenting (such as explaining the difference with another page). Obviously, all of that must be properly logged, which with SUL should be much easier.
On Wednesday 11 June 2008 23:03:28 Platonides wrote:
I'm Ccing Wikitech, i suggest we follow this thread there.
I'm answering on the foundation-l, given that I don't follow wikitech-l, you do follow foundation-l, and the issues you raise are more community than software related.
Nikola Smolenski wrote:
(thread about interwiki bots at toolserver)
Coincidentally, yesterday I released a MediaWiki extension which, if accepted on Wikimedia projects, may make interwiki bots much less busy. See http://meta.wikimedia.org/wiki/A_newer_look_at_the_interwiki_link
It also works by manual writing of the interwikis. I don't think it's the good way. *You're not taking into account page moves. What will you do when a page is moved? (by a low tech user which knows nothing about the global wiki)
I am taking into account page moves. Right now, when a page is moved, if it has 20 interwiki links, someone has to update 20 pages on 20 Wikipedias. With the extension, someone has to update a single page on a single wiki - clearly, something that is easier to do.
*The articles will still have a 'preferred' title at the interwiki wiki. That means discussing about article titles, "Move to English name", "No, that's not", "Interwikis with pages on Chinese are ugly!"...
I proposed an easy and fair solution: use the name of the page on the first wiki that covered the topic. If a topic has first been written about on the Vietnamese Wikipedia, use the Vietnamese name. Either way, redirects work, and even edit wars of this kind should pose no problem.
IMHO it should be a shared table referencing the wiki and page ids. Then you provide a Special page showing all pages on that group. You'd reference it as 'include this page into the group XX:sometitle is on'. You can also provide some space for free-form commenting (such as explaining the difference with another page). Obviously, all of that must be properly logged, which with SUL should be much easier.
Everything that you described already exists, without the special page. The shared table is the langlinks table on the central wiki; you reference it by using {{#interlanguage:sometitle}}; free-form commenting is the text on the central wiki page; it is properly logged in the page history.
Nikola Smolenski wrote:
On Wednesday 11 June 2008 23:03:28 Platonides wrote:
I'm Ccing Wikitech, i suggest we follow this thread there.
I'm answering on the foundation-l, given that I don't follow wikitech-l, you do follow foundation-l, and the issues you raise are more community than software related.
I'm not, and i disagree on that, but if the messages arrive to some of the other two lists, i'll read them anyway :)
Nikola Smolenski wrote:
(thread about interwiki bots at toolserver) Coincidentally, yesterday I released a MediaWiki extension which, if accepted on Wikimedia projects, may make interwiki bots much less busy. See http://meta.wikimedia.org/wiki/A_newer_look_at_the_interwiki_link
It also works by manual writing of the interwikis. I don't think it's the good way. *You're not taking into account page moves. What will you do when a page is moved? (by a low tech user which knows nothing about the global wiki)
I am taking into account page moves. Right now, when a page is moved, if it has 20 interwiki links, someone has to update 20 pages on 20 Wikipedias. With the extension, someone has to update a single page on a single wiki - clearly, something that is easier to do.
But the page has 20 interwikis to the right version. So a
*The articles will still have a 'preferred' title at the interwiki wiki. That means discussing about article titles, "Move to English name", "No, that's not", "Interwikis with pages on Chinese are ugly!"...
I proposed an easy and fair solution: use the name of the page on the first wiki that covered the topic. If a topic has first been written about on the Vietnamese Wikipedia, use the Vietnamese name. Either way, redirects work, and even edit wars of this kind should pose no problem.
I know. But i think avoiding any name is better.
IMHO it should be a shared table referencing the wiki and page ids. Then you provide a Special page showing all pages on that group. You'd reference it as 'include this page into the group XX:sometitle is on'. You can also provide some space for free-form commenting (such as explaining the difference with another page). Obviously, all of that must be properly logged, which with SUL should be much easier.
Everything that you described already exists, without the special page. The shared table is the langlinks table on the central wiki; you reference it by using {{#interlanguage:sometitle}}; free-form commenting is the text on the central wiki page; it is properly logged in the page history.
Mmm, you're right. I'd prefer using page_ids, but a more db guy than me should determine the efficiency difference of using ll_title (the page title) instead. I notice now that ll_title can't hold any wiki title, as it's a varchar(255) with namespace, while titles are stored varchar(255) without namespace everywhere else.
toolserver-l@lists.wikimedia.org