This message didn't get through the first time.
---------- Forwarded message ----------
From: Bryan Tong Minh <bryan.tongminh(a)gmail.com>
Date: Nov 5, 2007 9:36 PM
Subject: Re: [Pywikipedia-l] SVN: [4507] branches/rewrite/pywikibot/data/api.py
To: Merlijn van Deen <valhallasw(a)arctus.nl>
On Nov 5, 2007 5:12 PM, Merlijn van Deen <valhallasw(a)arctus.nl> wrote:
> Although I appreciate your efforts, I still want to ask you to wait with
> any other rewrite commits until you have read, understood, and reacted to
> this (fairly long) email.
>
> The purpose of the rewrite was
> a) to restructure the framework
> b) to have consistent formatting and documentation
> c) to move to the API
> and possibly
> d) to add i18n support
>
I would also like to give some thoughts on the rewrite. As some of you
might already know, I have already created a fairly complete framework
based on the API and thus some experience with it.
First of all, the advantage of the API is that you don't need screen
scraping, except for the changing content stuff. However, all
information that is required from the changing content stuff is
available through the API, so the only thing that requires screen
scraping is getting information about whether the action was
successful.
What is very important is that we clearly separate the different
layers that a framework consists of and get rid of functions like
replaceExceptInWhatever in the main module. In my opinion a proper
framework consists of three separate layers:
* High-level
* Middleware
* Lowware or core
The core functionality should consists of methods to get and put raw
page data, such as Page.get, Page.categories, Site.recentchanges, etc.
The middleware consists of commonly used functions such as replaceIn,
replaceImage, replaceCategory. The high-level software is the bot
itself. It performs tasks by calling the functions of the middle ware
and core.
This separation must be such that one can only use the low-ware part
without dependencies on the middle and high ware. Dependencies should
only be top down, never bottom up. A real separation would make the
code much clearer and easier to maintain.
Related to this, the question of i18n. I strongly believe that
low-ware should never output things to stdout or ask something from
stdin. Communication to higher layers should happen through the use of
return values and exceptions. I18n is part of middleware, or specific
to a bot, but never the task of the core.
The core itself can also be divided into sublayers:
* Python equivalents of the functions that the API provides
* Abstract Page/List objects
* Generic API function
* Generic communication layer
The first item are the functions that are used by the outside,
functions as Site.recentchanges(), Page.put().
The API has several list/generator functions which behave the same
way. An abstract parent class would prevent duplicating code.
The generic API function translates a function call to an appropriate
HTTP request.
The comm layer initiates and handles the connection to the server in
an (optionally) persistent fashion.
What is important to consider is where the error correction is put.
Some errors are recoverable after retry. The lower in level such a
retry is placed, the less code duplication is required. However, retry
code placed to low may cause to catch to generic errors.
An example of this is the the slave lag or Retry-After. This is
probably something that should be caught in either the second or the
third layer. HTTP errors should probably be caught in the third or the
fourth layer. User blockage should be detected in either the second or
the third layer and propagate through the low and middle ware to the
bot who can optionally handle it or pass it on to the user.
A thing to consider is how fool proof the framework is supposed to be.
Pywikipediabot is used by many different users, from advanced users to
absolute beginners. Beginners probably want the framework to catch
many common exceptions and act for them, while advanced users want to
keep stuff into their own control.
Also the use of persistent HTTP connections makes the framework less
fool proof. Persistent HTTP connections makes an object that uses them
automatically unsuitable for sharing between threads. Of course one
should always use proper locking when sharing between threads, but we
all know that that is something that does not always happen.
So far my thoughts. Thank you for reading it, it's probably a little
bit messy and unstructured, but I put it down as it entered my head.
Cheers,
Bryan