Hi! Please help me.
Hungarian dates are in the form yyyy. mm. dd., or yyyy. <monthname> dd., without leading zeros. In a text environment we use the month names, so I replace numbered months with named months, and I remove leading zeros from day numbers. The line in fixes.py is, for January: (ur'(\d{1,4}(?:]])?). ?01. ?(\d\d?)', ur'\1. január \2'), This is OK, no problem up to this point.
The rule is that the day number has to be followed by a dot, except it is followed by a hyphen and a suffix. First level of enhancement is to write a dot if necessary.
- if there is a dot there, don't remove it anyway (a hyphen is often used erroneously, and I don't want to make a bigger problem) - if there is no dot, but the day is followed by a hyphen, don't put a dot - if there is anything but a dot or a hyphen after the day number, put a dot after the number
I made some experiments with (?(id/name)yes-pattern|no-pattern) syntax ( http://docs.python.org/py3k/library/re.html), but with no valuable result. Can you help me? There will be further levels if this task is solved because users are very creative in making errors. Further problems are:
- Hyphen is often used instead of ndash when describing an interval of two dates. In this case a dot and a space is required between the day number and the ndash. I don't want to correct this in this session or this fix if it is too difficult, but the dot should not be removed in this case (either they write a space or not). - Sometimes hyphen with no dot is correct, but there is an extra space that should be removed. This can be recognized by means of writing a limited set of suffixes after the hyphen in the regex. - Sometimes there is a word after the day but space is omitted and should be supplied.
Bináris wikiposta@gmail.com wrote:
Hi! Please help me.
You might want to use two separate regexes, one checking for a hyphen after the number and leaving the suffix intact, and the other one to check for lack of hyphen [^-] for example, but you might want to check for unicode ones as well - and subsitute a dot.
It is important to check that those two regexes are do not match the same strings.
//Saper
Yes, this is definitely an option, I just would like to know if there is a way to unify them. I think increasing the number of lines in a fix will slow down the bot more than increasing the complexity of the existing lines, am I right?
2011/6/27 Marcin Cieslak saper@saper.info
You might want to use two separate regexes
Yes, this is definitely an option, I just would like to know if there is a = way to unify them. I think increasing the number of lines in a fix will slo= w down the bot more than increasing the complexity of the existing lines, a m I right?
Given the speed of fetching/storing pages I don't think that speed of the regular expression makes any difference. Running two compiled RE's one after the other in sequence on the page text should be very fast.
//Saper
OK, then I make separate lines. The only issue is that any enhacement/correction will be more complicated this way (that is another reason to put as many features in one line as possible).
2011/6/28 Marcin Cieslak saper@saper.info
Given the speed of fetching/storing pages I don't think that speed of the regular expression makes any difference. Running two compiled RE's one after the other in sequence on the page text should be very fast.
Thought I'd point out a couple of useful things I've come across when doing regex work (in Python, but also in other languages):
1: The re.VERBOSE flag. Lets you write your regular expressions using multiline strings (you'll have to escape whitespace, or use \s though), and also add comments. Makes it a lot easier to understand what you've been thinking when you come back to your code two months later to change it.
2: Using functions instead of strings as the replacement in sub(). If you're looking to do a fair amount of conditional logic in your replacement, it might be more easily written by having a function do it, rather than attempt to do it all with a regex.
My $.02.
Cheers, Morten
On Tue, Jun 28, 2011 at 7:23 AM, Bináris wikiposta@gmail.com wrote:
OK, then I make separate lines. The only issue is that any enhacement/correction will be more complicated this way (that is another reason to put as many features in one line as possible).
2011/6/28 Marcin Cieslak saper@saper.info
Given the speed of fetching/storing pages I don't think that speed of the regular expression makes any difference. Running two compiled RE's one after the other in sequence on the page text should be very fast.
-- Bináris
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
2011/6/28 Morten Wang nettrom@gmail.com
2: Using functions instead of strings as the replacement in sub(). If you're looking to do a fair amount of conditional logic in your replacement, it might be more easily written by having a function do it, rather than attempt to do it all with a regex.
Does the couple replace.py + fixes.py allow me to use own functions in replacements? I don't want to lose the comfort of replace.py, it is well written.
On Tue, Jun 28, 2011 at 11:21 AM, Bináris wikiposta@gmail.com wrote:
2011/6/28 Morten Wang nettrom@gmail.com
2: Using functions instead of strings as the replacement in sub(). If you're looking to do a fair amount of conditional logic in your replacement, it might be more easily written by having a function do it, rather than attempt to do it all with a regex.
Does the couple replace.py + fixes.py allow me to use own functions in replacements? I don't want to lose the comfort of replace.py, it is well written.
As far as I can tell from looking at the code, the replacement is done by replaceExcept() in textlib.py, and the documentation there clearly states that you can use a function for replacing (there's a check if the replacement is a callable function, and if it is that function is called, thus working the same way as re.sub() does).
The documentation in replace.py states that both parts of the tuple for replacement must be a string, which appears to contradict textlib.py, but I think that might be a bug. While I haven't tested this, it appears to me that writing a function and referring to that should work. Of course you'd have to test it to be sure.
Cheers, Morten
2011/6/28 Morten Wang nettrom@gmail.com
As far as I can tell from looking at the code, the replacement is done by replaceExcept() in textlib.py, and the documentation there clearly states that you can use a function for replacing (there's a check if the replacement is a callable function, and if it is that function is called, thus working the same way as re.sub() does).
You have just opened a new horizon in front of me. I will do the test and then rewrite the documentation of replace.py if this works.
I found a documentation bug in textlib.py:
In the beginning of replaceExcept() the line old - a compiled regular expression should be old - a compiled or uncompiled regular expression (ReplaceExcept will compile it if it has the type string or unicode.)
Thanks for that. Fixed in r9323.
Greetings xqt
----- Original Nachricht ---- Von: Bináris wikiposta@gmail.com An: Pywikipedia discussion list pywikipedia-l@lists.wikimedia.org Datum: 30.06.2011 14:08 Betreff: Re: [Pywikipedia-l] Regex help needed
I found a documentation bug in textlib.py:
In the beginning of replaceExcept() the line old - a compiled regular expression should be old - a compiled or uncompiled regular expression (ReplaceExcept will compile it if it has the type string or unicode.)
-- Bináris
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
I made some very promising experiments with functions that I will soon publish.
In the first one, replacement test is a function that asks me to choose between to options. In the second one the complete "replacements" list of the fix is generated by a function. :-) The 3rd one will be my original task, to analyze the match and give an appropriate replacement.
I have a problem with the forst one: although I can write the old text on screen and ask for a choice, I cannot present the environment of the match. So I have either to give a blind answer or open tha page in a browser firt which makes the process slow down. After I have given the choices for all the matches, replace.py will print a showDiff, but not before I choose.
At the time, I cannot modify edit summary according to the actual replacements but I am not sure it is worth to deal with.
Sorry for the mistakes in the previous letter.
2011/7/1 Bináris wikiposta@gmail.com
In the first one, replacement *test **text* is a function that asks me to choose between *to **two* options.
Sorry for being silly, I just realize that I should have tried match.start and match.end, but now I have to leave my beloved computer.
2011/7/1 Bináris wikiposta@gmail.com
I have a problem with the forst one: although I can write the old text on screen and ask for a choice, I cannot present the environment of the match. So I have either to give a blind answer or open tha page in a browser firt which makes the process slow down. After I have given the choices for all the matches, replace.py will print a showDiff, but not before I choose.