Hi folks, the last mail in this thread is only 6 years old, but I answer to an older one, because this shows more of the theme, and some of you may have forgotten what we used to talk about. :-)
The problem was that people in Wikipedia link in an article to a section title of another article terribly, somehow copying the link from URL or whatnot instead of the source of the target page, which results in an unreadable source and several false positives during spelling corrections with replace.py. An example is in the quoted text. The secondary problem was that there is no direct function to "reverse engineer" these urlencoded section titles, and decoding is not always definite.
By now I have suffered enough to solve this issue anyway, So I made a workaround as follows:
- I use replace.py which has the basic infrastructure. I search for [[A#B|C]] format links. (As a matter of fact, I only search for [[A#B|, not regarding C]].) - The second parameter of the replacement is a function. - The function tiries to get the section titles of the target pages, as Dr. Trigon suggested. Upon failure it sends me a coloured message about missing page, then returns the original text, so I get a "No changes were necessary" message from replace.py, and I can treat the issue manually, if necessary. It also tries to follow the redirect, if the target was renamed meanwhile. - If the function has the target sections, it begins to replace encoded characters one by one. For example section = section.replace('.C3.A1', u'á'). Several replacements like this follow each other. So the trick is that I released a general solution which works smoothly throughout every wiki and every possible text. I also let the idea of the nicest code in the world go. I just enumerate the characters which actually occur in huwiki, and enhance the list if necessary. - If the decoded section title is in the above gotten titles, we are happy and replace it. If not, the function sends me a message with another color and returns the original again.
This has sufficient efficiency for me and for now. Maybe I will implement some caching in order to make it faster. When we collect candidates, a mock replacement instead of the function makes it faster. Another possibility is to collect candidates for the spelling correction task, and run this replacement on that collection before spelling corrections.
2011-08-10 10:28 GMT+02:00 Merlijn van Deen valhallasw@arctus.nl:
Hi Binaris,
2011/8/10 Bináris wikiposta@gmail.com
No, it is not really; if you edit the section http://hu.wikipedia.org/w/index.php?title=M%C3%A1sodik_vil%C3%A1gh%C3%A1bor%C3%BA&action=edit§ion=26 you will see the original section title:
Editing the section is not really the relevant point. The relevant point (which I missed) is the following:
The wikitext parser understands [[Második világháború#Partraszállás Szicíliában (Huskey hadművelet)]] should be a link to http://hu.wikipedia.org/wiki/M%C3%A1sodik_vil%C3%A1gh% C3%A1bor%C3%BA#Partrasz.C3.A1ll.C3.A1s_Szic.C3.ADli.C3. A1ban_.28Huskey_hadm.C5.B1velet.29
Note that the section title is in the semi-urlencoded format. This is the only correct format; the following will not work: http://hu.wikipedia.org/wiki/M%C3%A1sodik_vil%C3%A1gh%C3%A1bor%C3%BA#Partras... Szicíliában (Huskey hadművelet) http://hu.wikipedia.org/wiki/M%C3%A1sodik_vil%C3%A1gh%C3%A1bor%C3%BA#Partraszállás%20Szicíliában%20(Huskey%20hadművelet)
So I agree with
This one is *technically* correct but useless if you want to read the link during edit
but not with
or when you move the mouse on it.
So the question is still valid: do we have a tool for unencoding them or I
have to write it myself?
import wikipedia p = wikipedia.Page('hu', u'wikipedia.Page("hu", u"Második
világháború#Partrasz.C3.A1ll.C3.A1s Szic.C3.ADli.C3.A1ban .28Huskey hadm.C5.B1velet.29")')
p
Page{[[hu:Wikipedia.Page("hu", u"Második világháború#Partraszállás Szicíliában (Huskey hadművelet)")]]}
however, this seems broken:
p.title()
Wikipedia.Page("hu", u"Második világháború#Partrasz.C3.A1ll. C3.A1s_Szic.C3.ADli.C3.A1ban_.28Huskey_hadm.C5.B1velet.29.22.29
So -- yes, the code is already there (as pwb is able to decode the section title, as indicated by the representation (returned by Page.__repr__ or Page.__str__). However, title() seems to have some bug.
Oh, and I think the fix is already in cosmetic_changes.py, too. Check def cleanUpLinks (line 314).
Best, Merlijn
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l