the last mail in this thread is only 6 years old, but I answer to an older one, because this shows more of the theme, and some of you may have forgotten what we used to talk about. :-)
The problem was that people in Wikipedia link in an article to a section title of another article terribly, somehow copying the link from URL or whatnot instead of the source of the target page, which results in an unreadable source and several false positives during spelling corrections with replace.py. An example is in the quoted text.
The secondary problem was that there is no direct function to "reverse engineer" these urlencoded section titles, and decoding is not always definite.
By now I have suffered enough to solve this issue anyway, So I made a workaround as follows:
- I use replace.py which has the basic infrastructure. I search for [[A#B|C]] format links. (As a matter of fact, I only search for
[[A#B|, not regarding C]].)
- The second parameter of the replacement is a function.
- The function tiries to get the section titles of the target pages, as Dr. Trigon suggested. Upon failure it sends me a coloured message about missing page, then returns the original text, so I get a "No changes were necessary" message from replace.py, and I can treat the issue manually, if necessary. It also tries to follow the redirect, if the target was renamed meanwhile.
- If the function has the target sections, it begins to replace encoded characters one by one. For example section = section.replace('.C3.A1', u'á'). Several replacements like this follow each other. So the trick is that I released a general solution which works smoothly throughout every wiki and every possible text. I also let the idea of the nicest code in the world go. I just enumerate the characters which actually occur in huwiki, and enhance the list if necessary.
- If the decoded section title is in the above gotten titles, we are happy and replace it. If not, the function sends me a message with another color and returns the original again.
This has sufficient efficiency for me and for now. Maybe I will implement some caching in order to make it faster.
When we collect candidates, a mock replacement instead of the function makes it faster. Another possibility is to collect candidates for the spelling correction task, and run this replacement on that collection before spelling corrections.