Re: [Pywikipedia-l] Encoding in HTML source

7 Mar 2011

I always forget to reply to all. :-)

2011/3/7 John &lt;phoenixoverride(a)gmail.com&gt;

...
  why are you downloading HTML?

Because I have a long list of articles and I have to decide which of them
does exist and which does not. Page.exists() would be a more comfortable
solution, but it downloads each page to determine, and thus is *extremely*
slow at this quantity (4 to 11 thousand titles!). So I analyze the HTML
source which takes 1 minute. I tried both and it is worth to process HTML.

2011/3/7 Andre Engels &lt;andreengels(a)gmail.com&gt;

...
  >> page =
wikipedia.Page(wikipedia.getSite(), 
"Avant_l%27aurore_(court-m%C3%A9trage)")
 >> page.urlname() 
'Avant_l%27aurore_%28court-m%C3%A9trage%29'

 Hi, Andre, I am afraid I was not clear enough. So I don't want to encode
titles; I want
to recognize encoded titles.
I wrote urlencode erroriously, I ment urlaname(), sorry for that.

*sourcelines* contains HTML, and I seek the article called *page*.
Don't be afraid of for loop, because the requested title will appear in the
beginning, so it is fast. This function returns True for "blue" articles and
False for "red" ones. My code is:

    def exists(page):
        replacements = [
            ('%28','('),
            ('%29',')'),
            ('%21','!'),
            ('%2C',','),
            ('%3A',':'),
            #If false positives appear,
            #this list should be expanded.
        ]
        u = page.urlname()
        for x,y in replacements:
            u = u.replace(x,y)
        t = 'title=' + u + '&amp;action=edit'
        for line in sourcelines:
            if t in line:
                sourcelines.remove(line) #Makes the bot run MUCH faster!
                return False
        return True
I would like either a more complete list of replacements, or a function for
this "half-encoding" if exists in pywikibot. Or a webpage that explains
which characters avoid encoding.
Thanks,

-- 
Bináris

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [Pywikipedia-l] Encoding in HTML source