This message didn't get through the first time.
---------- Forwarded message ---------- From: Bryan Tong Minh bryan.tongminh@gmail.com Date: Nov 5, 2007 9:36 PM Subject: Re: [Pywikipedia-l] SVN: [4507] branches/rewrite/pywikibot/data/api.py To: Merlijn van Deen valhallasw@arctus.nl
On Nov 5, 2007 5:12 PM, Merlijn van Deen valhallasw@arctus.nl wrote:
Although I appreciate your efforts, I still want to ask you to wait with any other rewrite commits until you have read, understood, and reacted to this (fairly long) email.
The purpose of the rewrite was a) to restructure the framework b) to have consistent formatting and documentation c) to move to the API and possibly d) to add i18n support
I would also like to give some thoughts on the rewrite. As some of you might already know, I have already created a fairly complete framework based on the API and thus some experience with it.
First of all, the advantage of the API is that you don't need screen scraping, except for the changing content stuff. However, all information that is required from the changing content stuff is available through the API, so the only thing that requires screen scraping is getting information about whether the action was successful.
What is very important is that we clearly separate the different layers that a framework consists of and get rid of functions like replaceExceptInWhatever in the main module. In my opinion a proper framework consists of three separate layers: * High-level * Middleware * Lowware or core
The core functionality should consists of methods to get and put raw page data, such as Page.get, Page.categories, Site.recentchanges, etc. The middleware consists of commonly used functions such as replaceIn, replaceImage, replaceCategory. The high-level software is the bot itself. It performs tasks by calling the functions of the middle ware and core. This separation must be such that one can only use the low-ware part without dependencies on the middle and high ware. Dependencies should only be top down, never bottom up. A real separation would make the code much clearer and easier to maintain.
Related to this, the question of i18n. I strongly believe that low-ware should never output things to stdout or ask something from stdin. Communication to higher layers should happen through the use of return values and exceptions. I18n is part of middleware, or specific to a bot, but never the task of the core.
The core itself can also be divided into sublayers: * Python equivalents of the functions that the API provides * Abstract Page/List objects * Generic API function * Generic communication layer
The first item are the functions that are used by the outside, functions as Site.recentchanges(), Page.put(). The API has several list/generator functions which behave the same way. An abstract parent class would prevent duplicating code. The generic API function translates a function call to an appropriate HTTP request. The comm layer initiates and handles the connection to the server in an (optionally) persistent fashion.
What is important to consider is where the error correction is put. Some errors are recoverable after retry. The lower in level such a retry is placed, the less code duplication is required. However, retry code placed to low may cause to catch to generic errors. An example of this is the the slave lag or Retry-After. This is probably something that should be caught in either the second or the third layer. HTTP errors should probably be caught in the third or the fourth layer. User blockage should be detected in either the second or the third layer and propagate through the low and middle ware to the bot who can optionally handle it or pass it on to the user.
A thing to consider is how fool proof the framework is supposed to be. Pywikipediabot is used by many different users, from advanced users to absolute beginners. Beginners probably want the framework to catch many common exceptions and act for them, while advanced users want to keep stuff into their own control. Also the use of persistent HTTP connections makes the framework less fool proof. Persistent HTTP connections makes an object that uses them automatically unsuitable for sharing between threads. Of course one should always use proper locking when sharing between threads, but we all know that that is something that does not always happen.
So far my thoughts. Thank you for reading it, it's probably a little bit messy and unstructured, but I put it down as it entered my head.
Cheers, Bryan
pywikipedia-l@lists.wikimedia.org