I had a similar issue with updating templates in WLM template lists. This
function might be good inspiration:
https://github.com/rowiki/wikiro/blob/master/robots/python/pwb/monumente/co…
Strainu
Pe vineri, 3 februarie 2023, Roy Smith <roy(a)panix.com> a scris:
Thanks.
Sadly, I think treating this as flat text will end up being the most
straight-forward way to do it.
On Feb 2, 2023, at 7:03 PM, JJMC89 <jjmc89.wikimedia(a)gmail.com> wrote:
For similar cases, I have used a regex to find the part marked by
comments and then
parse the part between.
START_END =
re.compile(r"^(.*?<!--\s*Hooks\s*-->)(.*?)(<!--\s*HooksEnd\s*-->.*)$",
flags=re.I | re.S)
m = START_END.search(page.text)
wikicode = mwparserfromhell.parse(m.group(2))
# do stuff with wikicode
You may be able to do it with the parser.
# assume start and end represent comment objects you found from
wikicode.filter_comments()
start_index = wikicode.index(start)
end_index = wikicode.index(end)
inside = wikicode.nodes[start_index:end_index]
On Thu, Feb 2, 2023 at 3:39 PM Roy Smith <roy(a)panix.com> wrote:
>
> I'm trying to parse DYK prep area templates, for example Template:Did
you
know/Preparation area 3. Unfortunately, these are more like flat text
files than any kind of nicely structured data. The stuff of interest is
everything between two HTML comments:
>
> <!--Hooks-->
> {{main page image/DYK|image=Melissa Ong.webp|caption=Selfie of Ong,
commonly
replicated by the Step Chickens<!--the caption length is
intentional, it highlights that this image is there for a specific purpose
and isn't just any image of Ong – please don't shorten it! Same for the
''(shown)'' –leek -->}}
> * ... that "Step Chickens" on TikTok
replace their profile pictures with
an image ''(shown)'' of
'''[[Melissa Ong]]''', whom they call "Mother Hen"?
> * ... that '''[[interfaith greetings
in Indonesia]]''' include phrases
from Islam, Christianity, Hinduism,
Buddhism, and Confucianism?
> * ... that '''[[Kimmo
Leinonen]]''' helped establish both the [[Finnish
Hockey Hall of Fame]]
and the [[IIHF Hall of Fame]]?
> * ... that the [[Pulitzer Prize for
Fiction|Pulitzer Prize]]-winning
novel '''''[[All the Light We
Cannot See]]''''' contains a sympathetic
[[Nazism|Nazi]]?
> * ... that a {{Convert|10|ft|m|adj=mid|-tall|0}}
'''[[Lady
Rainier|statue of a woman]]''' in [[Seattle]] was
commissioned by a local
brewery in 1903?
> * ... that ...
> * ... that prior to entering politics, '''[[Herbert
Salvatierra]]''' led
a troupe of [[carnival]]
''[[comparsa]]s''?
> * ... that [[Winston Churchill]] published
'''[[Are There Men on the
Moon?|an essay on extraterrestrial
life]]''' during the Second World War?
> <!--HooksEnd-->
>
> I can find the comments with Wikicode.filter_comments(). But once I've
found the two delimiting comments, how do I grab the text between them? Or
is the parser the wrong tool? Would I do better to treat the content of
the page as flat text and just iterate over it line by line, teasing it
apart with regexes?
>
> _______________________________________________
> pywikibot mailing list -- pywikibot(a)lists.wikimedia.org
> Public archives at
https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/m…
To unsubscribe
send an email to pywikibot-leave(a)lists.wikimedia.org
_______________________________________________
pywikibot mailing list -- pywikibot(a)lists.wikimedia.org
Public archives at
https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/m…
To unsubscribe send an email to
pywikibot-leave(a)lists.wikimedia.org