Hi folks,
the last mail in this thread is only 6 years old, but I answer to an older
one, because this shows more of the theme, and some of you may have
forgotten what we used to talk about. :-)
The problem was that people in Wikipedia link in an article to a section
title of another article terribly, somehow copying the link from URL or
whatnot instead of the source of the target page, which results in an
unreadable source and several false positives during spelling corrections
with replace.py. An example is in the quoted text.
The secondary problem was that there is no direct function to "reverse
engineer" these urlencoded section titles, and decoding is not always
definite.
By now I have suffered enough to solve this issue anyway, So I made a
workaround as follows:
- I use replace.py which has the basic infrastructure. I search for
[[A#B|C]] format links. (As a matter of fact, I only search for [[A#B|, not
regarding C]].)
- The second parameter of the replacement is a function.
- The function tiries to get the section titles of the target pages, as
Dr. Trigon suggested. Upon failure it sends me a coloured message about
missing page, then returns the original text, so I get a "No changes were
necessary" message from replace.py, and I can treat the issue manually, if
necessary. It also tries to follow the redirect, if the target was renamed
meanwhile.
- If the function has the target sections, it begins to replace encoded
characters one by one. For example section = section.replace('.C3.A1',
u'á'). Several replacements like this follow each other. So the trick is
that I released a general solution which works smoothly throughout every
wiki and every possible text. I also let the idea of the nicest code in the
world go. I just enumerate the characters which actually occur in huwiki,
and enhance the list if necessary.
- If the decoded section title is in the above gotten titles, we are
happy and replace it. If not, the function sends me a message with another
color and returns the original again.
This has sufficient efficiency for me and for now. Maybe I will implement
some caching in order to make it faster.
When we collect candidates, a mock replacement instead of the function
makes it faster. Another possibility is to collect candidates for the
spelling correction task, and run this replacement on that collection
before spelling corrections.
2011-08-10 10:28 GMT+02:00 Merlijn van Deen <valhallasw(a)arctus.nl>nl>:
Hi Binaris,
2011/8/10 Bináris <wikiposta(a)gmail.com>
No, it is not really; if you edit the section
<http://hu.wikipedia.org/w/index.php?title=M%C3%A1sodik_vil%C3%A1gh%C3%A1bor%C3%BA&action=edit§ion=26>
you will see the original section title:
Editing the section is not really the relevant point. The relevant point
(which I missed) is the following:
The wikitext parser understands [[Második világháború#Partraszállás
Szicíliában (Huskey hadművelet)]] should be a link to
http://hu.wikipedia.org/wiki/M%C3%A1sodik_vil%C3%A1gh%
C3%A1bor%C3%BA#Partrasz.C3.A1ll.C3.A1s_Szic.C3.ADli.C3.
A1ban_.28Huskey_hadm.C5.B1velet.29
Note that the section title is in the semi-urlencoded format. This is the
only correct format; the following will not work:
http://hu.wikipedia.org/wiki/M%C3%A1sodik_vil%C3%A1gh%C3%A1bor%C3%BA#Partra…
Szicíliában (Huskey hadművelet)
<http://hu.wikipedia.org/wiki/M%C3%A1sodik_vil%C3%A1gh%C3%A1bor%C3%BA#Partraszállás%20Szicíliában%20(Huskey%20hadművelet)>
So I agree with
This one is *technically* correct but useless if
you want to read the
link during edit
but not with
or when you move the mouse on it.
So the question is still valid: do we have a tool for unencoding them or I
have to write it myself?
>> import wikipedia
>> p = wikipedia.Page('hu', u'wikipedia.Page("hu",
u"Második
világháború#Partrasz.C3.A1ll.C3.A1s Szic.C3.ADli.C3.A1ban .28Huskey
hadm.C5.B1velet.29")')
>> p
Page{[[hu:Wikipedia.Page("hu", u"Második világháború#Partraszállás
Szicíliában (Huskey hadművelet)")]]}
however, this seems broken:
>> p.title()
Wikipedia.Page("hu", u"Második világháború#Partrasz.C3.A1ll.
C3.A1s_Szic.C3.ADli.C3.A1ban_.28Huskey_hadm.C5.B1velet.29.22.29
So -- yes, the code is already there (as pwb is able to decode the section
title, as indicated by the representation (returned by Page.__repr__ or
Page.__str__). However, title() seems to have some bug.
Oh, and I think the fix is already in cosmetic_changes.py, too. Check def cleanUpLinks
(line 314).
Best,
Merlijn
_______________________________________________
Pywikipedia-l mailing list
Pywikipedia-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
--
Bináris