[Pywikipedia-l] Regex help needed

Bináris wikiposta at gmail.com
Mon Jun 27 08:55:23 UTC 2011


Hi! Please help me.

Hungarian dates are in the form yyyy. mm. dd., or yyyy. <monthname> dd.,
without leading zeros.
In a text environment we use the month names, so I replace numbered months
with named months, and I remove leading zeros from day numbers.
The line in fixes.py is, for January:
            (ur'(\d{1,4}(?:\]\])?)\. ?01\. ?(\d\d?)', ur'\1. január \2'),
This is OK, no problem up to this point.

The rule is that the day number has to be followed by a dot, except it is
followed by a hyphen and a suffix.
First level of enhancement is to write a dot if necessary.

   - if there is a dot there, don't remove it anyway (a hyphen is often used
   erroneously, and I don't want to make a bigger problem)
   - if there is no dot, but the day is followed by a hyphen, don't put a
   dot
   - if there is anything but a dot or a hyphen after the day number, put a
   dot after the number

I made some experiments with (?(id/name)yes-pattern|no-pattern) syntax (
http://docs.python.org/py3k/library/re.html), but with no valuable result.
Can you help me? There will be further levels if this task is solved because
users are very creative in making errors.
Further problems are:

   - Hyphen is often used instead of ndash when describing an interval of
   two dates. In this case a dot and a space is required between the day number
   and the ndash. I don't want to correct this in this session or this fix if
   it is too difficult, but the dot should not be removed in this case (either
   they write a space or not).
   - Sometimes hyphen with no dot is correct, but there is an extra space
   that should be removed. This can be recognized by means of writing a limited
   set of suffixes after the hyphen in the regex.
   - Sometimes there is a word after the day but space is omitted and should
   be supplied.


-- 
Bináris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.wikimedia.org/pipermail/pywikipedia-l/attachments/20110627/af0b19a1/attachment.htm 


More information about the Pywikipedia-l mailing list