New subject: [Pywikipedia-l] Rewrite thoughts (includes replies to the 4507-thread)

6 Nov 2007

      Hello all,
@Misza:
...
Ok now. As I am a complete lamer when it comes to team coding, coding
standards etc., I'll take your words for it, but have nonetheless
several
...
lame questions.
Your comments are very valuable to me, actually. And lame questions don't
exist ;)
@Misza
...
If team coding is a democracy then I vote for
camelCaseWithFirstWordLowercase.
It's a strong standard in Java (which is a close cousin to Python) and
is
...
frequent in Python too (take Twisted for an example).
@Russell
...
Misza will be disappointed; PEP 8 says to use lowercase with underscores
for function and method names.  ;)
I have to say I like getInfoFromBlah better than get_info_from_blah..
maybe it's just because the underscore is relatively hard to type ;). For
classes we should keep the CamelCaseWithFirstWordUppercase style mentioned
in the PEP.
@Misza:
...
Fine. I suggest the most common 1 indentation level = 4 spaces.
I like 4 spaces, even though I know some people like true tabs better ;)
...
...

UTF8 encoding for all files (and using u'' for all strings), with

UTF8
...
...
BOM
I'll assume gvim does that for me. o_O
If you tell it to, I think it will. The BOM is just three characters in
front of the text that define the text is in UTF8 format. (BOM stands for
Byte Order Marker, which is necessary for UTF-16/32, where the byte order
read depends on the system)
@Misza
...
...
One module per class
Java style again? ;)
Not directly inspired by Java, but apparently Java does it this way, too :)
...
A module corresponds to one .py file (like http.py or api.py from our
example) and a folder of these (data/) is a package, not module? Despite
of this:
...
...
...
...
import data
data
<module 'data' from 'data/__init__.py'>
Technically, that's the module 'data/__init__.py' of the package data, I
suppose :)
@Russell
...
...
One module per class
I'd make an exception to this one to allow including subclasses in the
same module, such as Page, ImagePage, and Category.
@Rob
...
One module per class is a bit too much of a javaism for me too. It is
not really efficient since it requires quite a lot of "import"s in the
code, and actually finding a class becomes quite expensive!
On this point, I agree with Russell. The current layout is Site and Page
objects in wikipedia.py and Category in catlib.py. Having all subclasses
of Page in Page.py (or something like that) sounds right to me.
@Misza:
...
...
Because adding structure afterwards is much harder, I think we first
should decide on what modules/classes we want, then defining what
functions we want and what these functions should do. After that, we
can
...
...
start writing unit tests and functions.
Where do we start? Edit http://www.botwiki.sno.cc/wiki/Rewrite and add
classes until someone says enough?
Well, something like that, yes. We do not need to have all functions
defined in complete detail to start at the framework, but we do need to
have a general setup for modules/packages and how they will interconnect.
Bryan has some very good points about this, which I will address further
on in this mail.
...
...
This is not a very important point, but it's kinda interesting. With
the
...
...
new framework, it's easier to restructure existing functions, making
translations easier.
...
Ok now, can you elaborate on this? I don't think I'm getting the point -
we already same i18n - bots have localized edit summaries, the framework
knows #REDIRECT and namespace names locales as well.
...
Are you talking about some whole new level of i18n?
Interface i18n: 'Wilt U deze wijziging doorvoeren (J/N)' instead of 'Do
you want to save this edit (Y/N)' ;)
@Rob:
...
...
On unit testing -- it may be difficult to write unit tests for methods
that access the wiki, because the value returned will depend on the
contents of the wiki at any given time.  Maybe if we have a dedicated
test wiki with at least some pages that are locked, so that they give
predictable values, that would be a way around the problem.
...
You are talking about testing, not unit testing. For unit testing you
will not need a sensibly filled wiki. The code that needs sensible data
will be tested separately from the wiki-fetching code. Each piece of
separable functionality should be tested separately with true unit
tests.
Let me check if I understand unit testing then: the only functions that
will need unit testing with a 'live test wiki' would be the functions that
either put or read from the wiki, correct? Any other functions that do
processing should just use a test text; i.e. externalLinks() should read
from a test text saved and return the links therein?
@Bryan:
...
What is very important is that we clearly separate the different
layers that a framework consists of and get rid of functions like
replaceExceptInWhatever in the main module. In my opinion a proper
framework consists of three separate layers:

High-level
Middleware
Lowware or core

Sounds like a good starting point for adding structure.
...
The core functionality should consists of methods to get and put raw
page data, such as Page.get, Page.categories, Site.recentchanges, etc.
The middleware consists of commonly used functions such as replaceIn,
replaceImage, replaceCategory. The high-level software is the bot
itself. It performs tasks by calling the functions of the middle ware
and core.
I'm not sure whether bots should be part of the *framework* in the first
place. In my opinion, the framework should be installable through
easy-install for example- in a central location in any case. This means
that the framework resides in /usr/lib/python2.5/site-packages/ while the
bots reside in /home/valhallasw/bots (or sth).
...
Related to this, the question of i18n. I strongly believe that
low-ware should never output things to stdout or ask something from
stdin.
Yes and no. I think there should be a core module that handles the output
in a structured manner, but it should not be used in other core modules.
I.e. core module 'comm layer' should not use core module 'stdout'. I do
agree that i18n is a higher level function than the pure output, and hence
should be 'middleware'.
...
The core itself can also be divided into sublayers:

Python equivalents of the functions that the API provides
Abstract Page/List objects
Generic API function
Generic communication layer

(...)
I think this is pretty much what we indeed need. I was also thinking if it
would not be easier to just generate Page objects from an API query used
in the python source. For example:
http://en.wikipedia.org/w/api.php?action=query&prop=links&titles=Mai...
would generate a Page 'Main Page' with only the linkedpages loaded, while
http://en.wikipedia.org/w/api.php?action=query&generator=templates&t...
would generate a list of Page objects that would be very suitable for
interwiki.py
this does mean we will need a more sophisticated way of finding what data
is already loaded and what data is not. Maybe we should use __getattr__ to
more logically shape the class? Some nice warnings about non-preloaded
data being used would be neat, too :)
...
What is important to consider is where the error correction is put.
Some errors are recoverable after retry. The lower in level such a
retry is placed, the less code duplication is required. However, retry
code placed to low may cause to catch to generic errors.
'make sure you only catch errors you expect' :). In general, we should
only catch errors that the user does not need to hear about. This means
that even retry-after errors might need to traverse to the upper layer so
the user at least gets a message 'Server load too high, retrying in ...'.
If there is a way to get this information back to the user without
backtracking completely, I'd like to hear :)
...
Also the use of persistent HTTP connections makes the framework less
fool proof. Persistent HTTP connections makes an object that uses them
automatically unsuitable for sharing between threads.
Is that so? A quick sketchup:
page.get():
  lock = threading.Lock() #thread safe lock
  lock.acquire()
  page_get_queue.append((self, query, lock))
  lock.acquire(1)
with the page getter releasing the lock. But I am probably ignoring some
important thread safety now :)
Then, last but not least:
@Russell:
...
This raises an important question.  Do we want to continue trying to
support every MediaWiki installation, including those that haven't
upgraded to more recent versions of MW?  There are a number of wikis
in the families/ directory now that don't have any API support at all.
PHP4 support will be dropped in december. All sites should update to PHP5
and then should upgrade to mediawiki 1.10+.
And as Bryan described:
...
Branching is the key. The trunk should always be in sync with the
trunk of MediaWiki. Additionally each time a MediaWiki version is
released, the trunk should be forked into a compatible branch.
I think we should start working with the current SVN release (or rather
the version used on wikimedia sites), and branch when a new stable
mediawiki version is released. Bug fixes can be made, but new features
will only be in trunk, and we don't have to guarantee the trunk is
compatible with older versions of mediawiki.
...
Personally, I only use my bots on Wikimedia Foundation sites, so it's not
an issue for me.  But it may be for others.  If we decide to go API-only,
then users of the non-current MediaWiki installations will have to use the
"old" pywikipediabot.
So be it. The old framework works, it's just not as efficient and neat as
the new one will be ;)
-- Merlijn

Re: [Pywikipedia-l] Rewrite thoughts (includes replies to the 4507-thread)