Encoding in HTML source

List overview All Threads
Download

newer

older

Re: [Pywikipedia-l] Encoding in...

Re: [Pywikipedia-l]...

Bináris

7 Mar 2011 7 Mar '11

8:22 p.m.

Hi,

when I download a page in HTML, which contains titles of articles, these titles are something like urlencode()-ed, but not quite; characters like "(", ")", "!", ",", ":" appear without encoding.

For example: <li><a href="/w/index.php?title=Avant_l%27aurore_*(*court-m%C3%A9trage*)*&action=edit&redlink=1" class="new" title="Avant l'aurore (court-métrage) (page does not exist)">Avant l'aurore (court-métrage)</a></li>

Is there a function in pywiki to handle this, or is there available a full list of non-encoded characters? I used urlencode() + a dict of known exceptions, but this is not the best solution.

-- Bináris

Attachments:

attachment.htm (text/html — 922 bytes)

Show replies by date

John

7 Mar 7 Mar

8:45 p.m.

why are you downloading HTML?

On Mon, Mar 7, 2011 at 7:22 AM, Bináris wikiposta@gmail.com wrote:

...

Hi,

when I download a page in HTML, which contains titles of articles, these titles are something like urlencode()-ed, but not quite; characters like "(", ")", "!", ",", ":" appear without encoding.

For example:

<li><a href="/w/index.php?title=Avant_l%27aurore_*(*court-m%C3%A9trage*)*&action=edit&redlink=1" class="new" title="Avant l'aurore (court-métrage) (page does not exist)">Avant l'aurore (court-métrage)</a></li>

Is there a function in pywiki to handle this, or is there available a full list of non-encoded characters? I used urlencode() + a dict of known exceptions, but this is not the best solution.

-- Bináris

Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Bináris

9:27 p.m.

I always forget to reply to all. :-)

2011/3/7 John phoenixoverride@gmail.com

...

why are you downloading HTML?

Because I have a long list of articles and I have to decide which of them does exist and which does not. Page.exists() would be a more comfortable solution, but it downloads each page to determine, and thus is *extremely* slow at this quantity (4 to 11 thousand titles!). So I analyze the HTML source which takes 1 minute. I tried both and it is worth to process HTML.

2011/3/7 Andre Engels andreengels@gmail.com

...

...
...
...
page = wikipedia.Page(wikipedia.getSite(),

"Avant_l%27aurore_(court-m%C3%A9trage)")

...
...
...
page.urlname()

'Avant_l%27aurore_%28court-m%C3%A9trage%29'

Hi, Andre,

I am afraid I was not clear enough. So I don't want to encode titles; I want to recognize encoded titles. I wrote urlencode erroriously, I ment urlaname(), sorry for that.

*sourcelines* contains HTML, and I seek the article called *page*. Don't be afraid of for loop, because the requested title will appear in the beginning, so it is fast. This function returns True for "blue" articles and False for "red" ones. My code is:

def exists(page): replacements = [ ('%28','('), ('%29',')'), ('%21','!'), ('%2C',','), ('%3A',':'), #If false positives appear, #this list should be expanded. ] u = page.urlname() for x,y in replacements: u = u.replace(x,y) t = 'title=' + u + '&action=edit' for line in sourcelines: if t in line: sourcelines.remove(line) #Makes the bot run MUCH faster! return False return True I would like either a more complete list of replacements, or a function for this "half-encoding" if exists in pywikibot. Or a webpage that explains which characters avoid encoding. Thanks,

-- Bináris

Bináris

9:36 p.m.

For those using gmail: the half of my last letter disappears as quotation, so you should uncover it. :-)

Andre Engels

8:48 p.m.

On Mon, Mar 7, 2011 at 1:22 PM, Bináris wikiposta@gmail.com wrote:

...

Hi,

when I download a page in HTML, which contains titles of articles, these titles are something like urlencode()-ed, but not quite; characters like "(", ")", "!", ",", ":" appear without encoding.

For example:

<li><a href="/w/index.php?title=Avant_l%27aurore_(court-m%C3%A9trage)&action=edit&redlink=1" class="new" title="Avant l'aurore (court-métrage) (page does not exist)">Avant l'aurore (court-métrage)</a></li>

Is there a function in pywiki to handle this, or is there available a full list of non-encoded characters? I used urlencode() + a dict of known exceptions, but this is not the best solution.

...

...
...
page = wikipedia.Page(wikipedia.getSite(), "Avant_l%27aurore_(court-m%C3%A9trage)") page.urlname()

'Avant_l%27aurore_%28court-m%C3%A9trage%29'

-- André Engels, andreengels@gmail.com

5012

Age (days ago)

5012

Last active (days ago)

pywikipedia-l@lists.wikimedia.org

4 comments

3 participants

tags (0)

participants (3)

Andre Engels
Bináris
John