[Pywikipedia-l] Urlencoded section titles

List overview All Threads
Download

newer

older

asynchronous put on core

Google Code-in 2018: Mentor some...

Bináris

8 Aug 2011 8 Aug '11

2:27 p.m.

Attachments:

attachment.htm (text/html — 351 bytes)

Show replies by date

Merlijn van Deen

8 Aug 8 Aug

2:37 p.m.

Hello Bináris, Based on the page [1], this actually seems correct. If you check the TOC. It uses this exact same format. http://hu.wikipedia.org/wiki/M%C3%A1sodik_vil%C3%A1gh%C3%A1bor%C3%BA#Partra… It indeed is a 'sort of urlencoded' section title. However, this is how the sections are named (check the span id's), so I think this is correct. Best, Merlijn 2011/8/8 Bináris <wikiposta(a)gmail.com>

...

Happy Monday, There are strange people who make such links (kindof urlencoded?): [[Második világháború#Partrasz.C3.A1ll.C3.A1s Szic.C3.ADli.C3.A1ban .28Huskey hadm.C5.B1velet.29|Huskey hadműveletben]] So the section title must have been copied from the URL. Do we have a ready tool to fix these? -- Bináris _______________________________________________ Pywikipedia-l mailing list Pywikipedia-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Bináris

10 Aug 10 Aug

9:17 a.m.

2011/8/8 Merlijn van Deen <valhallasw(a)arctus.nl>

...

No, it is not really; if you edit the section<http://hu.wikipedia.org/w/index.php?title=M%C3%A1sodik_vil%C3%A1… will see the original section title: Partraszállás Szicíliában (Huskey hadművelet) The problem is just that some people copy it from the TOC or the URL bar. In Hungarian, they will always differ if there is any accented letter in the title. But they should be wikilinked in a readable form. In languages without accents this happens less often, for example in case of parenthesed titles like this. This one is *technically* correct but useless if you want to read the link during edit or when you move the mouse on it. So the question is still valid: do we have a tool for unencoding them or I have to write it myself? Thanks for your effort! Bináris

Merlijn van Deen

10:28 a.m.

Hi Binaris, 2011/8/10 Bináris <wikiposta(a)gmail.com>

...

No, it is not really; if you edit the section<http://hu.wikipedia.org/w/index.php?title=M%C3%A1sodik_vil%C3%A1… will see the original section title:

Editing the section is not really the relevant point. The relevant point (which I missed) is the following: The wikitext parser understands [[Második világháború#Partraszállás Szicíliában (Huskey hadművelet)]] should be a link to http://hu.wikipedia.org/wiki/M%C3%A1sodik_vil%C3%A1gh%C3%A1bor%C3%BA#Partra… Note that the section title is in the semi-urlencoded format. This is the only correct format; the following will not work: http://hu.wikipedia.org/wiki/M%C3%A1sodik_vil%C3%A1gh%C3%A1bor%C3%BA#Partra… Szicíliában (Huskey hadművelet) So I agree with

...

This one is *technically* correct but useless if you want to read the link during edit

but not with

...

or when you move the mouse on it.

So the question is still valid: do we have a tool for unencoding them or I

...

have to write it myself? >> import wikipedia >> p = wikipedia.Page('hu', u'wikipedia.Page("hu", u"Második

világháború#Partrasz.C3.A1ll.C3.A1s Szic.C3.ADli.C3.A1ban .28Huskey hadm.C5.B1velet.29")')

...

>> p

Page{[[hu:Wikipedia.Page("hu", u"Második világháború#Partraszállás Szicíliában (Huskey hadművelet)")]]} however, this seems broken:

...

>> p.title()

Wikipedia.Page("hu", u"Második világháború#Partrasz.C3.A1ll.C3.A1s_Szic.C3.ADli.C3.A1ban_.28Huskey_hadm.C5.B1velet.29.22.29 So -- yes, the code is already there (as pwb is able to decode the section title, as indicated by the representation (returned by Page.__repr__ or Page.__str__). However, title() seems to have some bug. Oh, and I think the fix is already in cosmetic_changes.py, too. Check def cleanUpLinks (line 314). Best, Merlijn

Bináris

10:38 a.m.

Thank you, Merlijn, I will try it! Alternatively, I can make a fix and use replace.py because there is a limited set of these characters that usually occur in titles. -- Bináris

Bináris

28 Mar 28 Mar

10:22 a.m.

(Reminder: this thread is on encoded section titles that are copied from URL bar to wikitext rather than from wikitext to wikitext, and thus they are pretty unreadable for humans.) 2011/8/10 Merlijn van Deen <valhallasw(a)arctus.nl>

...

So -- yes, the code is already there (as pwb is able to decode the section title, as indicated by the representation (returned by Page.__repr__ or Page.__str__). However, title() seems to have some bug.

For me they return still encoded titles. :-(

...

Oh, and I think the fix is already in cosmetic_changes.py, too. Check def cleanUpLinks (line 314).

Fail again. If you go to http://hu.wikipedia.org/wiki/Mafia_II#T.C3.B6rt.C3.A9net and edit the section, you will wind an example in the 2nd paragraph: [[Második világháború#Partrasz.C3.A1ll.C3.A1s Szic.C3.ADli.C3.A1ban .28Huskey hadm.C5.B1velet.29|Huskey hadműveletben]] My code is:

...

>> import wikipedia as p >> site=p.getSite('hu') >> import cosmetic_changes as cc >> bot=cc.CosmeticChangesToolkit(site) >> title=u'Mafia II' >> lap=p.Page(site,title) >> text=lap.get() >> text2=bot.cleanUpLinks(text) >> text==text2

True Did I miss something? I can solve the problem with listing the most frequent characters used in huwiki but I would better like a nice and general solution. Another issue: *á* is encoded as .C3.A1. However, a literal .C3.A1 in section title will also appear the same. Is there any way to decide if .C3.A1 stands for *á* or for .C3.A1? I guess the likelihood of someone writing a literal .C3.A1 into the section title is very small, so this question may be theoretical, but I am a theoretical man. :-) -- Bináris

Bináris

1:20 p.m.

2012/3/28 Bináris <wikiposta(a)gmail.com>

...

Another issue: *á* is encoded as .C3.A1. However, a literal .C3.A1 in section title will also appear the same. Is there any way to decide if .C3.A1 stands for *á* or for .C3.A1? I guess the likelihood of someone writing a literal .C3.A1 into the section title is very small, so this question may be theoretical, but I am a theoretical man. :-)

While this was a theoratical problem, I created a practical one. There are characters with a shorter code, such as quotation mark (.22) and parentheses (.28, .29). Have a look at this section title: http://hu.wikipedia.org/wiki/Szerkeszt%C5%91:BinBot/semmi#.22D.C3.A1tum.22 :_2012.03.22 You will see that the first two .22's (marked here with red, excuse me if this causes a problem for someone) are encoded quotation marks, while the last (blue) one a literal .22 as part of a date (Hungarian date order is yyyy. mm. dd.). I simply don't see any chance to make the difference by bot unless searching for all section titles in question (as well as anchor templates) and try to make a reverse match. So this is something very easy to spoil and almost hopeless to correct. *:-(* -- Bináris

Bináris

1:21 p.m.

Colouring was a second class idea, sorry: http://hu.wikipedia.org/wiki/Szerkeszt%C5%91:BinBot/semmi#.22D.C3.A1tum.22:… -- Bináris

Merlijn van Deen

1 Apr 1 Apr

3:25 p.m.

On 28 March 2012 13:20, Bináris <wikiposta(a)gmail.com> wrote:

...

While this was a theoratical problem, I created a practical one. There are characters with a shorter code, such as quotation mark (.22) and parentheses (.28, .29). Have a look at this section title: http://hu.wikipedia.org/wiki/Szerkeszt%C5%91:BinBot/semmi#.22D.C3.A1tum.22:… You will see that the first two .22's (marked here with red, excuse me if this causes a problem for someone) are encoded quotation marks, while the last (blue) one a literal .22 as part of a date (Hungarian date order is yyyy. mm. dd.). I simply don't see any chance to make the difference by bot unless searching for all section titles in question (as well as anchor templates) and try to make a reverse match. So this is something very easy to spoil and almost hopeless to correct.

Another example is listed in the bug report at: http://sourceforge.net/tracker/?func=detail&group_id=93107&atid=603… - #802.11n becomes Page{[[Page#IEEE 802n]]} (because \x11 is a non-printable character). however: I think we can might be able to work around these two problems as only characters outside of ASCII are escaped /and/ it has to be a correct UTF-8 string. Again: check the mediawiki source. Best, Merlijn

Dr. Trigon

12 Jun 12 Jun

10:18 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 If you have to whole page content accessible, you can try to use 'Page.getSections()'... might help? Greetings DrTrigon On 28.03.2012 13:20, Bináris wrote:

...

2012/3/28 Bináris <wikiposta(a)gmail.com <mailto:wikiposta@gmail.com>> Another issue: *á* is encoded as .C3.A1. However, a literal .C3.A1 in section title will also appear the same. Is there any way to decide if .C3.A1 stands for *á* or for .C3.A1? I guess the likelihood of someone writing a literal .C3.A1 into the section title is very small, so this question may be theoretical, but I am a theoretical man. :-) While this was a theoratical problem, I created a practical one. There are characters with a shorter code, such as quotation mark (.22) and parentheses (.28, .29). Have a look at this section title: http://hu.wikipedia.org/wiki/Szerkeszt%C5%91:BinBot/semmi#.22D.C3.A1tum.22:…

You will see that the first two .22's (marked here with red, excuse me

...

if this causes a problem for someone) are encoded quotation marks, while the last (blue) one a literal .22 as part of a date (Hungarian date order is yyyy. mm. dd.). I simply don't see any chance to make the difference by bot unless searching for all section titles in question (as well as anchor templates) and try to make a reverse match. So this is something very easy to spoil and almost hopeless to correct. *:-(* -- Bináris _______________________________________________ Pywikipedia-l mailing list Pywikipedia-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk/XpCwACgkQAXWvBxzBrDBRJgCfe0SF+Ym7S+l5rIHW3fc4db8j 3moAnjZqX/tGut+McHhecExN8VR1Ado5 =ehn+ -----END PGP SIGNATURE-----

Merlijn van Deen

1 Apr 1 Apr

3:19 p.m.

2012/3/28 Bináris <wikiposta(a)gmail.com>om>:

...

For me they return still encoded titles. :-(

...they do for me, too. Strangely enough. Considering it did work back in august (which you can check by fetching r9429), someone broke it. Bisecting indicates this revision: ------------------------------------------------------------------------ r9489 | xqt | 2011-09-03 12:07:18 +0200 (Sat, 03 Sep 2011) | 1 line revert r5856 due to bug #2989218 ------------------------------------------------------------------------ https://www.mediawiki.org/w/index.php?title=Special:Code/pywikipedia/9489 http://sourceforge.net/tracker/?func=detail&group_id=93107&atid=603…

...

> Oh, and I think the fix is already in cosmetic_changes.py, too. Check > def cleanUpLinks (line 314).

It seems I was wrong about that. It indeed does not fix this, even though it /does/ fix %xx urls.

...

Another issue: á is encoded as .C3.A1. However, a literal .C3.A1 in section title will also appear the same. Is there any way to decide if .C3.A1 stands for á or for .C3.A1?

Check the mediawiki source, as it's mediawiki that does this transformation. Best, Merlijn

Bináris

13 Sep 13 Sep

7:59 a.m.

Hi folks, the last mail in this thread is only 6 years old, but I answer to an older one, because this shows more of the theme, and some of you may have forgotten what we used to talk about. :-) The problem was that people in Wikipedia link in an article to a section title of another article terribly, somehow copying the link from URL or whatnot instead of the source of the target page, which results in an unreadable source and several false positives during spelling corrections with replace.py. An example is in the quoted text. The secondary problem was that there is no direct function to "reverse engineer" these urlencoded section titles, and decoding is not always definite. By now I have suffered enough to solve this issue anyway, So I made a workaround as follows: - I use replace.py which has the basic infrastructure. I search for [[A#B|C]] format links. (As a matter of fact, I only search for [[A#B|, not regarding C]].) - The second parameter of the replacement is a function. - The function tiries to get the section titles of the target pages, as Dr. Trigon suggested. Upon failure it sends me a coloured message about missing page, then returns the original text, so I get a "No changes were necessary" message from replace.py, and I can treat the issue manually, if necessary. It also tries to follow the redirect, if the target was renamed meanwhile. - If the function has the target sections, it begins to replace encoded characters one by one. For example section = section.replace('.C3.A1', u'á'). Several replacements like this follow each other. So the trick is that I released a general solution which works smoothly throughout every wiki and every possible text. I also let the idea of the nicest code in the world go. I just enumerate the characters which actually occur in huwiki, and enhance the list if necessary. - If the decoded section title is in the above gotten titles, we are happy and replace it. If not, the function sends me a message with another color and returns the original again. This has sufficient efficiency for me and for now. Maybe I will implement some caching in order to make it faster. When we collect candidates, a mock replacement instead of the function makes it faster. Another possibility is to collect candidates for the spelling correction task, and run this replacement on that collection before spelling corrections. 2011-08-10 10:28 GMT+02:00 Merlijn van Deen <valhallasw(a)arctus.nl>nl>:

...

Hi Binaris, 2011/8/10 Bináris <wikiposta(a)gmail.com>

No, it is not really; if you edit the section <http://hu.wikipedia.org/w/index.php?title=M%C3%A1sodik_vil%C3%A1gh%C3%A1bor%C3%BA&action=edit&section=26> you will see the original section title:

Editing the section is not really the relevant point. The relevant point (which I missed) is the following: The wikitext parser understands [[Második világháború#Partraszállás Szicíliában (Huskey hadművelet)]] should be a link to http://hu.wikipedia.org/wiki/M%C3%A1sodik_vil%C3%A1gh% C3%A1bor%C3%BA#Partrasz.C3.A1ll.C3.A1s_Szic.C3.ADli.C3. A1ban_.28Huskey_hadm.C5.B1velet.29 Note that the section title is in the semi-urlencoded format. This is the only correct format; the following will not work: http://hu.wikipedia.org/wiki/M%C3%A1sodik_vil%C3%A1gh%C3%A1bor%C3%BA#Partra… Szicíliában (Huskey hadművelet) <http://hu.wikipedia.org/wiki/M%C3%A1sodik_vil%C3%A1gh%C3%A1bor%C3%BA#Partraszállás%20Szicíliában%20(Huskey%20hadművelet)> So I agree with

This one is *technically* correct but useless if you want to read the link during edit

but not with

or when you move the mouse on it.

So the question is still valid: do we have a tool for unencoding them or I

have to write it myself? >> import wikipedia >> p = wikipedia.Page('hu', u'wikipedia.Page("hu", u"Második

világháború#Partrasz.C3.A1ll.C3.A1s Szic.C3.ADli.C3.A1ban .28Huskey hadm.C5.B1velet.29")')

>> p

Page{[[hu:Wikipedia.Page("hu", u"Második világháború#Partraszállás Szicíliában (Huskey hadművelet)")]]} however, this seems broken:

>> p.title()

Wikipedia.Page("hu", u"Második világháború#Partrasz.C3.A1ll. C3.A1s_Szic.C3.ADli.C3.A1ban_.28Huskey_hadm.C5.B1velet.29.22.29 So -- yes, the code is already there (as pwb is able to decode the section title, as indicated by the representation (returned by Page.__repr__ or Page.__str__). However, title() seems to have some bug. Oh, and I think the fix is already in cosmetic_changes.py, too. Check def cleanUpLinks (line 314). Best, Merlijn _______________________________________________ Pywikipedia-l mailing list Pywikipedia-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

-- Bináris

2052

days inactive

4645

days old

pywikibot@lists.wikimedia.org

Manage subscription

11 comments

3 participants

tags (0)

participants (3)

Bináris
Dr. Trigon
Merlijn van Deen