Because I have a long list of articles and I have to decide which of
them does exist and which does not. Page.exists() would be a more
comfortable solution, but it downloads each page to determine, and thus
is *extremely* slow at this quantity (4 to 11 thousand titles!). So I
analyze the HTML source which takes 1 minute. I tried both and it is
worth to process HTML.
2011/3/7 Andre Engels
<andreengels@gmail.com>
>>> page = wikipedia.Page(wikipedia.getSite(), "Avant_l%27aurore_(court-m%C3%A9trage)")
>>> page.urlname()
'Avant_l%27aurore_%28court-m%C3%A9trage%29'
Hi, Andre,
I am afraid I was not clear enough. So I don't want to encode titles; I want to recognize encoded titles.
I wrote urlencode erroriously, I ment urlaname(), sorry for that.
sourcelines contains HTML, and I seek the article called page.
Don't
be afraid of for loop, because the requested title will appear in the
beginning, so it is fast. This function returns True for "blue" articles
and False for "red" ones. My code is:
def exists(page):
replacements = [
('%28','('),
('%29',')'),
('%21','!'),
('%2C',','),
('%3A',':'),
#If false positives appear,
#this list should be expanded.
]
u = page.urlname()
for x,y in replacements:
u = u.replace(x,y)
t = 'title=' + u + '&action=edit'
for line in sourcelines:
if t in line:
sourcelines.remove(line) #Makes the bot run MUCH faster!
return False
return True
I
would like either a more complete list of replacements, or a function
for this "half-encoding" if exists in pywikibot. Or a webpage that
explains which characters avoid encoding.
Thanks,