By request, here is an update for interested persons on the status of the rewrite branch.
The major purpose of the rewrite branch is to implement a bot framework that uses the new MediaWiki API (see http://www.mediawiki.org/wiki/API for details) instead of the old approach of scraping HTML wiki pages. Note that many other potential areas for a rewrite have been suggested on this list, and at http://www.botwiki.sno.cc/wiki/Rewrite, but most of those are not currently being pursued due to a lack of resources.
The software in the rewrite branch currently is runnable, but incomplete, and with limited documentation. For the most part, the bot programming interface is intended to be very similar to the interface used in the current pywikipedia trunk, so that bot programs can be ported easily, but there are significant changes that we have started to document in the file README-conversion.txt. To date, most methods of the Page object that read from the wiki have been implemented; you can, for example, instantiate a Page object, get its text, and retrieve its templates, links, categories, backlinks, and so forth (interwiki links other than language links are not yet implemented). Methods that provide Site-wide lists of pages (allpages, allcategories, randompages, etc.) have not yet been implemented, but this is next on my to-do list. Most methods in the existing framework that manipulate wiki text have not yet been ported (things like replaceCategoryLinks), but these should require very few changes.
The ability to save changes to the wiki is *not* yet implemented. Note that the MediaWiki API does now have the ability to edit pages, but this has not yet been activated on any WMF wikis, so once editing is implemented in the bot framework, it still will be of limited use.
At the moment, I am doing most of the development work on this branch. Valhallasw contributed the http interface, nicdumz has contributed some user-related methods, and although he has not contributed directly, I have stolen^H^H^H^H^H^Hbeen inspired by some of Bryan Tong Minh's ideas from his mwclient project. Let me make it clear that I do not consider this "my" project by any means; anyone who is willing and able to contribute will be most welcome. It may be helpful for any new contributors to announce which aspects of the code they are planning to work on, to avoid duplication of effort.
Russ
On Mon, May 5, 2008 at 8:19 PM, Russell Blau russblau@imapmail.org wrote: [...]
I have stolen^H^H^H^H^H^Hbeen inspired by some of Bryan Tong Minh's ideas from his mwclient project.
Note that there is full license compatibility between the projects so copying useful parts is not a problem :)
I would encourage any dev / bot owner fluent enough in python to give a try to the rewrite, particularly if you use scripts fetching a lot of data from mediawiki.
I wrote for example, a maintenance script for the French translation project. It fetches hundreds of pages, does some mambo jumbo magic on it, and eventually use that data to update ~200 summary pages.
I first wrote it using our trunk pywikipedia. (since editing through the api is not yet available, I first thought that it was the only way). It was very, very *slow*. I wondered what improvements I would get using the rewrite, and I rewrote my script to use the rewrite for all the page fetching part. Well, I don't have precise figures, but I would say that the latter version was probably 3 to 10 times faster
In addition to being faster, the more we'll use the rewrite, the more we'll be able to detect bugs and to correct them, the easier it will be to merge the branch when API editing will get available.
(Speaking of debugging, if you're being annoyed by the debug output while writing your scripts, import logging ; logging.getLogger().setLevel(logging.INFO) in your script header will help. )
Damnit... SF mailing lists strike again... ^_^ Any chance of also improving interface/usability of interface? The user-config.py file is quite basic, it's got nothing but basic data and configuration, it could easily be simplified to use a code modifiable format like ini or xml. I see a lot of users who have the initial issue "Why does my configuration file not work?" or they don't even know how to use it. Then there are the people with the issue of the family files... Honestly, the rewrite is already trying to move to an API format, and there is very little data put in family files which we can't get from the API... Almost nothing except URL related things. Ok, bots need to be run from cli, otherwise options can't be passed to them, so I'm not saying we should use a GUI. However, python is built with the ability to act as a shell.
Why don't we turn this into something run by an interactive shell, people already need to open up a cli, by why make them try and figure out how to handle a confusing set of other files when we could probably let them handle the data over the cli.
So to startup someone would type this to enter the shell: python pywiki.py To run a bot:
run replace wiki=foo start=! regex=True "Bar[0-9]" "Baz5"
The wiki= parameter would accept one of four forms 'id' which would use the wiki with the user's default language, 'lang' which would use the default wiki on a different language, 'lang.id' which would use the wiki with the language, or for quick and easy use when some new wiki asks "Can you quickly replace some text on the wiki us?" if there is a :// in the input it takes it as a url "http://naruto.wikia.com" and grabs the needed info from the API. Will work for 99% of wiki, but you may need to add an alias if it needs special config. To deal with the user stuff we can let them set users:
config user en.wikipedia normal=FooMan sysop=FooOp
And instead of long family files, just let them specify id's which are mapped to urls... Like git's remote command
wiki add en.anime http://anime.wikia.com
And it'll retrieve the info from the api... Of course for wiki like Wikipedia:
wiki add en.wikipedia http://en.wikipedia.org/w
And the data will be grabbed from the API for the namespaces versions and such. To refresh our namespaces and other data from the site:
wiki refresh en.wikipedia
When there are extra options that config needs that we can't get through the api they can be set: set-prop is kinda svn inspired... option, param, conf, prop, or whatver you think works best and most people will understand we can pick. Say we needed to set the nicepath for some reason and the default wasn't good...
wiki prop en.wikipedia nicepath /wiki/$1
Of course, all this kind of data is retriveable:
wiki prop en.wikipedia nicepath
/wiki/$1
config user en.wikipedia
Normal: FooMan Sysop: FooOp
wiki base en.anime
http://anime.wikia.com And wiki can be changed if needed, say if the wiki moved urls (My sync bots broke when Wikia moved a wiki or two, and their redirect broke index.php urls):
wiki switch en.anime http://en.anime.wikia.com
Kind of an idea...
~Daniel Friesen(Dantman) of: -The Gaiapedia (http://gaia.wikia.com) -Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG) -and Wiki-Tools.com (http://wiki-tools.com)
Nicolas Dumazet wrote:
I would encourage any dev / bot owner fluent enough in python to give a try to the rewrite, particularly if you use scripts fetching a lot of data from mediawiki.
I wrote for example, a maintenance script for the French translation project. It fetches hundreds of pages, does some mambo jumbo magic on it, and eventually use that data to update ~200 summary pages.
I first wrote it using our trunk pywikipedia. (since editing through the api is not yet available, I first thought that it was the only way). It was very, very *slow*. I wondered what improvements I would get using the rewrite, and I rewrote my script to use the rewrite for all the page fetching part. Well, I don't have precise figures, but I would say that the latter version was probably 3 to 10 times faster
In addition to being faster, the more we'll use the rewrite, the more we'll be able to detect bugs and to correct them, the easier it will be to merge the branch when API editing will get available.
(Speaking of debugging, if you're being annoyed by the debug output while writing your scripts, import logging ; logging.getLogger().setLevel(logging.INFO) in your script header will help. )
On Fri, May 9, 2008 at 3:12 PM, DanTMan dan_the_man@telus.net wrote:
Damnit... SF mailing lists strike again...
^_^ Any chance of also improving interface/usability of interface? The user-config.py file is quite basic, it's got nothing but basic data and configuration, it could easily be simplified to use a code modifiable format like ini or xml. I see a lot of users who have the initial issue "Why does my configuration file not work?" or they don't even know how to use it.
I like the flexibility of the current system. Xml sucks for config files because it is too complicated and ini files only allow one level of depth. Besides users who do not know how to configure their bot should not run it at all.
Then there are the people with the issue of the family files... Honestly, the rewrite is already trying to move to an API format, and there is very little data put in family files which we can't get from the API... Almost nothing except URL related things.
Agree
Ok, bots need to be run from cli, otherwise options can't be passed to them, so I'm not saying we should use a GUI. However, python is built with the ability to act as a shell.
Why don't we turn this into something run by an interactive shell, people already need to open up a cli, by why make them try and figure out how to handle a confusing set of other files when we could probably let them handle the data over the cli.
So to startup someone would type this to enter the shell: python pywiki.py To run a bot:
run replace wiki=foo start=! regex=True "Bar[0-9]" "Baz5"
That could be an additional interface, but should never be the only one.
But in any case this is middleware or even frontend talk whereas the current development focuses on backend, as far as I know. But it is worth exploring.
Bryan
pywikipedia-l@lists.wikimedia.org