Happy Monday,
There are strange people who make such links (kindof urlencoded?): [[Második világháború#Partrasz.C3.A1ll.C3.A1s Szic.C3.ADli.C3.A1ban .28Huskey hadm.C5.B1velet.29|Huskey hadműveletben]] So the section title must have been copied from the URL. Do we have a ready tool to fix these?
Hello Bináris,
Based on the page [1], this actually seems correct. If you check the TOC. It uses this exact same format.
http://hu.wikipedia.org/wiki/M%C3%A1sodik_vil%C3%A1gh%C3%A1bor%C3%BA#Partras...
It indeed is a 'sort of urlencoded' section title. However, this is how the sections are named (check the span id's), so I think this is correct.
Best, Merlijn
2011/8/8 Bináris wikiposta@gmail.com
Happy Monday,
There are strange people who make such links (kindof urlencoded?): [[Második világháború#Partrasz.C3.A1ll.C3.A1s Szic.C3.ADli.C3.A1ban .28Huskey hadm.C5.B1velet.29|Huskey hadműveletben]] So the section title must have been copied from the URL. Do we have a ready tool to fix these?
-- Bináris
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
2011/8/8 Merlijn van Deen valhallasw@arctus.nl
Hello Bináris,
Based on the page [1], this actually seems correct. If you check the TOC. It uses this exact same format.
http://hu.wikipedia.org/wiki/M%C3%A1sodik_vil%C3%A1gh%C3%A1bor%C3%BA#Partras...
It indeed is a 'sort of urlencoded' section title. However, this is how the sections are named (check the span id's), so I think this is correct.
No, it is not really; if you edit the sectionhttp://hu.wikipedia.org/w/index.php?title=M%C3%A1sodik_vil%C3%A1gh%C3%A1bor%C3%BA&action=edit§ion=26you will see the original section title: Partraszállás Szicíliában (Huskey hadművelet) The problem is just that some people copy it from the TOC or the URL bar. In Hungarian, they will always differ if there is any accented letter in the title. But they should be wikilinked in a readable form. In languages without accents this happens less often, for example in case of parenthesed titles like this. This one is *technically* correct but useless if you want to read the link during edit or when you move the mouse on it.
So the question is still valid: do we have a tool for unencoding them or I have to write it myself? Thanks for your effort!
Bináris
Hi Binaris,
2011/8/10 Bináris wikiposta@gmail.com
No, it is not really; if you edit the sectionhttp://hu.wikipedia.org/w/index.php?title=M%C3%A1sodik_vil%C3%A1gh%C3%A1bor%C3%BA&action=edit§ion=26you will see the original section title:
Editing the section is not really the relevant point. The relevant point (which I missed) is the following:
The wikitext parser understands [[Második világháború#Partraszállás Szicíliában (Huskey hadművelet)]] should be a link to http://hu.wikipedia.org/wiki/M%C3%A1sodik_vil%C3%A1gh%C3%A1bor%C3%BA#Partras...
Note that the section title is in the semi-urlencoded format. This is the only correct format; the following will not work: http://hu.wikipedia.org/wiki/M%C3%A1sodik_vil%C3%A1gh%C3%A1bor%C3%BA#Partras... Szicíliában (Huskey hadművelet)
So I agree with
This one is *technically* correct but useless if you want to read the link during edit
but not with
or when you move the mouse on it.
So the question is still valid: do we have a tool for unencoding them or I
have to write it myself?
import wikipedia p = wikipedia.Page('hu', u'wikipedia.Page("hu", u"Második
világháború#Partrasz.C3.A1ll.C3.A1s Szic.C3.ADli.C3.A1ban .28Huskey hadm.C5.B1velet.29")')
p
Page{[[hu:Wikipedia.Page("hu", u"Második világháború#Partraszállás Szicíliában (Huskey hadművelet)")]]}
however, this seems broken:
p.title()
Wikipedia.Page("hu", u"Második világháború#Partrasz.C3.A1ll.C3.A1s_Szic.C3.ADli.C3.A1ban_.28Huskey_hadm.C5.B1velet.29.22.29
So -- yes, the code is already there (as pwb is able to decode the section title, as indicated by the representation (returned by Page.__repr__ or Page.__str__). However, title() seems to have some bug.
Oh, and I think the fix is already in cosmetic_changes.py, too. Check def cleanUpLinks (line 314).
Best, Merlijn
Thank you, Merlijn, I will try it! Alternatively, I can make a fix and use replace.py because there is a limited set of these characters that usually occur in titles.
(Reminder: this thread is on encoded section titles that are copied from URL bar to wikitext rather than from wikitext to wikitext, and thus they are pretty unreadable for humans.)
2011/8/10 Merlijn van Deen valhallasw@arctus.nl
So -- yes, the code is already there (as pwb is able to decode the section title, as indicated by the representation (returned by Page.__repr__ or Page.__str__). However, title() seems to have some bug.
For me they return still encoded titles. :-(
Oh, and I think the fix is already in cosmetic_changes.py, too. Check def cleanUpLinks (line 314).
Fail again. If you go to http://hu.wikipedia.org/wiki/Mafia_II#T.C3.B6rt.C3.A9net and edit the section, you will wind an example in the 2nd paragraph: [[Második világháború#Partrasz.C3.A1ll.C3.A1s Szic.C3.ADli.C3.A1ban .28Huskey hadm.C5.B1velet.29|Huskey hadműveletben]]
My code is:
import wikipedia as p site=p.getSite('hu') import cosmetic_changes as cc bot=cc.CosmeticChangesToolkit(site) title=u'Mafia II' lap=p.Page(site,title) text=lap.get() text2=bot.cleanUpLinks(text) text==text2
True
Did I miss something? I can solve the problem with listing the most frequent characters used in huwiki but I would better like a nice and general solution.
Another issue: *á* is encoded as .C3.A1. However, a literal .C3.A1 in section title will also appear the same. Is there any way to decide if .C3.A1 stands for *á* or for .C3.A1? I guess the likelihood of someone writing a literal .C3.A1 into the section title is very small, so this question may be theoretical, but I am a theoretical man. :-)
2012/3/28 Bináris wikiposta@gmail.com
Another issue: *á* is encoded as .C3.A1. However, a literal .C3.A1 in section title will also appear the same. Is there any way to decide if .C3.A1 stands for *á* or for .C3.A1? I guess the likelihood of someone writing a literal .C3.A1 into the section title is very small, so this question may be theoretical, but I am a theoretical man. :-)
While this was a theoratical problem, I created a practical one. There are characters with a shorter code, such as quotation mark (.22) and parentheses (.28, .29). Have a look at this section title: http://hu.wikipedia.org/wiki/Szerkeszt%C5%91:BinBot/semmi#.22D.C3.A1tum.22 :_2012.03.22 You will see that the first two .22's (marked here with red, excuse me if this causes a problem for someone) are encoded quotation marks, while the last (blue) one a literal .22 as part of a date (Hungarian date order is yyyy. mm. dd.). I simply don't see any chance to make the difference by bot unless searching for all section titles in question (as well as anchor templates) and try to make a reverse match. So this is something very easy to spoil and almost hopeless to correct.
*:-(*
Colouring was a second class idea, sorry: http://hu.wikipedia.org/wiki/Szerkeszt%C5%91:BinBot/semmi#.22D.C3.A1tum.22:_...
On 28 March 2012 13:20, Bináris wikiposta@gmail.com wrote:
While this was a theoratical problem, I created a practical one. There are characters with a shorter code, such as quotation mark (.22) and parentheses (.28, .29). Have a look at this section title: http://hu.wikipedia.org/wiki/Szerkeszt%C5%91:BinBot/semmi#.22D.C3.A1tum.22:_... You will see that the first two .22's (marked here with red, excuse me if this causes a problem for someone) are encoded quotation marks, while the last (blue) one a literal .22 as part of a date (Hungarian date order is yyyy. mm. dd.). I simply don't see any chance to make the difference by bot unless searching for all section titles in question (as well as anchor templates) and try to make a reverse match. So this is something very easy to spoil and almost hopeless to correct.
Another example is listed in the bug report at: http://sourceforge.net/tracker/?func=detail&group_id=93107&atid=6031... - #802.11n becomes Page{[[Page#IEEE 802n]]} (because \x11 is a non-printable character).
however: I think we can might be able to work around these two problems as only characters outside of ASCII are escaped /and/ it has to be a correct UTF-8 string. Again: check the mediawiki source.
Best, Merlijn
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
If you have to whole page content accessible, you can try to use 'Page.getSections()'... might help?
Greetings DrTrigon
On 28.03.2012 13:20, Bináris wrote:
2012/3/28 Bináris <wikiposta@gmail.com mailto:wikiposta@gmail.com>
Another issue: *á* is encoded as .C3.A1. However, a literal .C3.A1 in section title will also appear the same. Is there any way to decide if .C3.A1 stands for *á* or for .C3.A1? I guess the likelihood of someone writing a literal .C3.A1 into the section title is very small, so this question may be theoretical, but I am a theoretical man. :-)
While this was a theoratical problem, I created a practical one. There are characters with a shorter code, such as quotation mark (.22) and parentheses (.28, .29). Have a look at this section title: http://hu.wikipedia.org/wiki/Szerkeszt%C5%91:BinBot/semmi#.22D.C3.A1tum.22:_...
You will see that the first two .22's (marked here with red, excuse me
if this causes a problem for someone) are encoded quotation marks, while the last (blue) one a literal .22 as part of a date (Hungarian date order is yyyy. mm. dd.). I simply don't see any chance to make the difference by bot unless searching for all section titles in question (as well as anchor templates) and try to make a reverse match. So this is something very easy to spoil and almost hopeless to correct.
*:-(*
-- Bináris
_______________________________________________ Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
2012/3/28 Bináris wikiposta@gmail.com:
For me they return still encoded titles. :-(
...they do for me, too. Strangely enough. Considering it did work back in august (which you can check by fetching r9429), someone broke it. Bisecting indicates this revision: ------------------------------------------------------------------------ r9489 | xqt | 2011-09-03 12:07:18 +0200 (Sat, 03 Sep 2011) | 1 line
revert r5856 due to bug #2989218 ------------------------------------------------------------------------
https://www.mediawiki.org/w/index.php?title=Special:Code/pywikipedia/9489 http://sourceforge.net/tracker/?func=detail&group_id=93107&atid=6031...
Oh, and I think the fix is already in cosmetic_changes.py, too. Check def cleanUpLinks (line 314).
It seems I was wrong about that. It indeed does not fix this, even though it /does/ fix %xx urls.
Another issue: á is encoded as .C3.A1. However, a literal .C3.A1 in section title will also appear the same. Is there any way to decide if .C3.A1 stands for á or for .C3.A1?
Check the mediawiki source, as it's mediawiki that does this transformation.
Best, Merlijn
pywikipedia-l@lists.wikimedia.org