Hello all,
@Misza:
Ok now. As I am a complete lamer when it comes to team coding, coding
standards etc., I'll take your words for it, but have nonetheless several
lame questions.
Your comments are very valuable to me, actually. And lame questions don't exist ;)
@Misza
If team coding is a democracy then I vote for camelCaseWithFirstWordLowercase. It's a strong standard in Java (which is a close cousin to Python) and
is
frequent in Python too (take Twisted for an example).
@Russell
Misza will be disappointed; PEP 8 says to use lowercase with underscores
for function and method names. ;) I have to say I like getInfoFromBlah better than get_info_from_blah.. maybe it's just because the underscore is relatively hard to type ;). For classes we should keep the CamelCaseWithFirstWordUppercase style mentioned in the PEP.
@Misza:
Fine. I suggest the most common 1 indentation level = 4 spaces.
I like 4 spaces, even though I know some people like true tabs better ;)
- UTF8 encoding for all files (and using u'' for all strings), with
UTF8
BOM
I'll assume gvim does that for me. o_O
If you tell it to, I think it will. The BOM is just three characters in front of the text that define the text is in UTF8 format. (BOM stands for Byte Order Marker, which is necessary for UTF-16/32, where the byte order read depends on the system)
@Misza
One module per class
Java style again? ;)
Not directly inspired by Java, but apparently Java does it this way, too :)
A module corresponds to one .py file (like http.py or api.py from our
example) and a folder of these (data/) is a package, not module? Despite of this:
import data data
<module 'data' from 'data/__init__.py'>
Technically, that's the module 'data/__init__.py' of the package data, I suppose :)
@Russell
One module per class
I'd make an exception to this one to allow including subclasses in the
same module, such as Page, ImagePage, and Category.
@Rob
One module per class is a bit too much of a javaism for me too. It is
not really efficient since it requires quite a lot of "import"s in the code, and actually finding a class becomes quite expensive!
On this point, I agree with Russell. The current layout is Site and Page objects in wikipedia.py and Category in catlib.py. Having all subclasses of Page in Page.py (or something like that) sounds right to me.
@Misza:
Because adding structure afterwards is much harder, I think we first
should decide on what modules/classes we want, then defining what functions we want and what these functions should do. After that, we can
start writing unit tests and functions.
Where do we start? Edit http://www.botwiki.sno.cc/wiki/Rewrite and add
classes until someone says enough? Well, something like that, yes. We do not need to have all functions defined in complete detail to start at the framework, but we do need to have a general setup for modules/packages and how they will interconnect. Bryan has some very good points about this, which I will address further on in this mail.
This is not a very important point, but it's kinda interesting. With
the
new framework, it's easier to restructure existing functions, making
translations easier.
Ok now, can you elaborate on this? I don't think I'm getting the point -
we already same i18n - bots have localized edit summaries, the framework knows #REDIRECT and namespace names locales as well.
Are you talking about some whole new level of i18n?
Interface i18n: 'Wilt U deze wijziging doorvoeren (J/N)' instead of 'Do you want to save this edit (Y/N)' ;)
@Rob:
On unit testing -- it may be difficult to write unit tests for methods
that access the wiki, because the value returned will depend on the contents of the wiki at any given time. Maybe if we have a dedicated test wiki with at least some pages that are locked, so that they give predictable values, that would be a way around the problem.
You are talking about testing, not unit testing. For unit testing you
will not need a sensibly filled wiki. The code that needs sensible data will be tested separately from the wiki-fetching code. Each piece of separable functionality should be tested separately with true unit tests. Let me check if I understand unit testing then: the only functions that will need unit testing with a 'live test wiki' would be the functions that either put or read from the wiki, correct? Any other functions that do processing should just use a test text; i.e. externalLinks() should read from a test text saved and return the links therein?
@Bryan:
What is very important is that we clearly separate the different layers that a framework consists of and get rid of functions like replaceExceptInWhatever in the main module. In my opinion a proper framework consists of three separate layers:
- High-level
- Middleware
- Lowware or core
Sounds like a good starting point for adding structure.
The core functionality should consists of methods to get and put raw page data, such as Page.get, Page.categories, Site.recentchanges, etc. The middleware consists of commonly used functions such as replaceIn, replaceImage, replaceCategory. The high-level software is the bot itself. It performs tasks by calling the functions of the middle ware and core.
I'm not sure whether bots should be part of the *framework* in the first place. In my opinion, the framework should be installable through easy-install for example- in a central location in any case. This means that the framework resides in /usr/lib/python2.5/site-packages/ while the bots reside in /home/valhallasw/bots (or sth).
Related to this, the question of i18n. I strongly believe that low-ware should never output things to stdout or ask something from stdin.
Yes and no. I think there should be a core module that handles the output in a structured manner, but it should not be used in other core modules. I.e. core module 'comm layer' should not use core module 'stdout'. I do agree that i18n is a higher level function than the pure output, and hence should be 'middleware'.
The core itself can also be divided into sublayers:
- Python equivalents of the functions that the API provides
- Abstract Page/List objects
- Generic API function
- Generic communication layer
(...)
I think this is pretty much what we indeed need. I was also thinking if it would not be easier to just generate Page objects from an API query used in the python source. For example:
http://en.wikipedia.org/w/api.php?action=query&prop=links&titles=Mai... would generate a Page 'Main Page' with only the linkedpages loaded, while
http://en.wikipedia.org/w/api.php?action=query&generator=templates&t... would generate a list of Page objects that would be very suitable for interwiki.py
this does mean we will need a more sophisticated way of finding what data is already loaded and what data is not. Maybe we should use __getattr__ to more logically shape the class? Some nice warnings about non-preloaded data being used would be neat, too :)
What is important to consider is where the error correction is put. Some errors are recoverable after retry. The lower in level such a retry is placed, the less code duplication is required. However, retry code placed to low may cause to catch to generic errors.
'make sure you only catch errors you expect' :). In general, we should only catch errors that the user does not need to hear about. This means that even retry-after errors might need to traverse to the upper layer so the user at least gets a message 'Server load too high, retrying in ...'. If there is a way to get this information back to the user without backtracking completely, I'd like to hear :)
Also the use of persistent HTTP connections makes the framework less fool proof. Persistent HTTP connections makes an object that uses them automatically unsuitable for sharing between threads.
Is that so? A quick sketchup:
page.get(): lock = threading.Lock() #thread safe lock lock.acquire() page_get_queue.append((self, query, lock)) lock.acquire(1)
with the page getter releasing the lock. But I am probably ignoring some important thread safety now :)
Then, last but not least: @Russell:
This raises an important question. Do we want to continue trying to support every MediaWiki installation, including those that haven't upgraded to more recent versions of MW? There are a number of wikis in the families/ directory now that don't have any API support at all.
PHP4 support will be dropped in december. All sites should update to PHP5 and then should upgrade to mediawiki 1.10+. And as Bryan described:
Branching is the key. The trunk should always be in sync with the trunk of MediaWiki. Additionally each time a MediaWiki version is released, the trunk should be forked into a compatible branch.
I think we should start working with the current SVN release (or rather the version used on wikimedia sites), and branch when a new stable mediawiki version is released. Bug fixes can be made, but new features will only be in trunk, and we don't have to guarantee the trunk is compatible with older versions of mediawiki.
Personally, I only use my bots on Wikimedia Foundation sites, so it's not an issue for me. But it may be for others. If we decide to go API-only, then users of the non-current MediaWiki installations will have to use the "old" pywikipediabot.
So be it. The old framework works, it's just not as efficient and neat as the new one will be ;)
-- Merlijn
Indeed. See also: http://en.wikipedia.org/wiki/Unit_testing
"Ideally, each test case is independent from the others; mock objects and test harnesses can be used to assist testing a module in isolation."
Rob
2007/11/6, Merlijn van Deen valhallasw@arctus.nl:
Let me check if I understand unit testing then: the only functions that will need unit testing with a 'live test wiki' would be the functions that either put or read from the wiki, correct? Any other functions that do processing should just use a test text; i.e. externalLinks() should read from a test text saved and return the links therein?
On Nov 6, 2007 4:41 PM, Merlijn van Deen valhallasw@arctus.nl wrote
What is important to consider is where the error correction is put. Some errors are recoverable after retry. The lower in level such a retry is placed, the less code duplication is required. However, retry code placed to low may cause to catch to generic errors.
'make sure you only catch errors you expect' :). In general, we should only catch errors that the user does not need to hear about. This means that even retry-after errors might need to traverse to the upper layer so the user at least gets a message 'Server load too high, retrying in ...'. If there is a way to get this information back to the user without backtracking completely, I'd like to hear :)
Wait callbacks:
def get_wait_token(self): token = hex(random.randint(0, sys.maxint)) self.wait_tokens[token] = 0 return token
def wait(self, token, reason): self.wait_callback(self.wait_tokens[token], reason) time.sleep(self.wait_tokens[token] * self.timeout) self.wait_tokens[token] += 1 ... site.wait_callback = lambda retry, reason: print 'Retry %s: %s' % (retry, reason) ...
token = site.get_wait_token() if 'Retry-After' in headers: self.wait(token, 'server lag') try_again()
Foreword: This email was sent to Bryan previously (because GMail's reply send the reply to the sender of the email being replied, not to the mailing list). I'm sending it again, to let others read it too.
Someone spoke about democracy here?!
As a citizen (!) I vote for: camelCase for functions and variables and CamelCase for class names. (The difference between the two is obvious, isn't it?!)
About BOM, I hope every editor has a way to add it to the beginning of the file. (In MediaWiki codes, when a BOM was added by notepad, I had to remove it to make the code work correct, and it was a pain in ass; now, I have this pessimistic feeling about adding it for Pywikipedia).
Moving to the API is great, but it has its limitations. For example, to my knowledge, it is impossible to get a list of more than 500 pages from the API, so perhaps we need to find a way to get a bigger list (for example for autonomous tasks).
About the backward compatability issue some people notified previously, I think it is a good idea to start tagging versions of Pywikiedpiabot. For example, version 1.0 works with HTML parsing, version 1.1 implements API only and works with MW 11.0 and above, version 1.5 works with API only and MW 12.0 and above, and so on.
I strongly support the idea of separating the framework from the bots. With that in mind, I think we should use a different approach about i18n and l10n. Consequently, I'm making no changes to the /pywikipedia/messages branch, untill we know how the new i18n system is going to work.
Finally, as many of you may know or guess, I'm not a professional programmer, so my comments may look a little silly, or worded in a way you don't usually expect. So excuse me about that.
Hojjat (aka Huji)
Huji wrote:
About BOM, I hope every editor has a way to add it to the beginning of the file. (In MediaWiki codes, when a BOM was added by notepad, I had to remove it to make the code work correct, and it was a pain in ass; now, I have this pessimistic feeling about adding it for Pywikipedia).
UTF-8 files should not contain a BOM. According to the Unicode BOM FAQ[1], "UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8 always has the same byte order. An initial BOM is only used as a signature — an indication that an otherwise unmarked text file is in UTF-8. Note that some recipients of UTF-8 encoded data do not expect a BOM. Where UTF-8 is used transparently in 8-bit environments, the use of a BOM will interfere with any protocol or file format that expects specific ASCII characters at the beginning, such as the use of "#!" of at the beginning of Unix shell scripts." In Python, the encoding is specified by an explicit "# -*- coding" line; not only is there no need for a BOM, but having one there screws up Python's interpretation of the file.
[1] http://unicode.org/faq/utf_bom.html#25
Russ
pywikipedia-l@lists.wikimedia.org