[pywikibot] Re: Text between two comments?

3 Feb 2023

When I was parsing similar text, I did a .split on the header part and
parsed the sections

On Thu, Feb 2, 2023 at 7:26 PM Roy Smith &lt;roy(a)panix.com&gt; wrote:

...
  Thanks.

 Sadly, I think treating this as flat text will end up being the most
 straight-forward way to do it.

 On Feb 2, 2023, at 7:03 PM, JJMC89 &lt;jjmc89.wikimedia(a)gmail.com&gt; wrote:

 For similar cases, I have used a regex to find the part marked by comments
 and then parse the part between.

 START_END  =

re.compile(r"^(.*?<!--\s*Hooks\s*-->)(.*?)(<!--\s*HooksEnd\s*-->.*)$",
 flags=re.I | re.S)
 m = START_END.search(page.text)
 wikicode = mwparserfromhell.parse(m.group(2))
 # do stuff with wikicode

 You may be able to do it with the parser.
 # assume start and end represent comment objects you found from
 wikicode.filter_comments()
 start_index = wikicode.index(start)
 end_index = wikicode.index(end)
 inside = wikicode.nodes[start_index:end_index]

 On Thu, Feb 2, 2023 at 3:39 PM Roy Smith &lt;roy(a)panix.com&gt; wrote:

  I'm trying to parse DYK prep area templates,
for example Template:Did
 you know/Preparation area 3
 <https://en.wikipedia.org/wiki/Template:Did_you_know/Preparation_area_3>.
 Unfortunately, these are more like flat text files than any kind of nicely
 structured data.  The stuff of interest is everything between two HTML
 comments:

{{main page image/DYK|image=Melissa Ong.webp|caption=Selfie of Ong,
 commonly replicated by the Step Chickens}}
 * ... that "Step Chickens" on TikTok replace their profile pictures with
 an image ''(shown)'' of '''[[Melissa Ong]]''',
whom they call "Mother Hen"?
 * ... that '''[[interfaith greetings in Indonesia]]''' include
phrases
 from Islam, Christianity, Hinduism, Buddhism, and Confucianism?
 * ... that '''[[Kimmo Leinonen]]''' helped establish both the
[[Finnish
 Hockey Hall of Fame]] and the [[IIHF Hall of Fame]]?
 * ... that the [[Pulitzer Prize for Fiction|Pulitzer Prize]]-winning
 novel '''''[[All the Light We Cannot See]]'''''
contains a sympathetic
 [[Nazism|Nazi]]?
 * ... that a {{Convert|10|ft|m|adj=mid|-tall|0}} '''[[Lady Rainier|statue
 of a woman]]''' in [[Seattle]] was commissioned by a local brewery in 1903?
 * ... that ...
 * ... that prior to entering politics, '''[[Herbert
Salvatierra]]''' led
 a troupe of [[carnival]] ''[[comparsa]]s''?
 * ... that [[Winston Churchill]] published '''[[Are There Men on the
 Moon?|an essay on extraterrestrial life]]''' during the Second World War?

 I can find the comments with Wikicode.filter_comments().  But once I've
 found the two delimiting comments, how do I grab the text between them?  Or
 is the parser the wrong tool?  Would I do better to treat the content of
 the page as flat text and just iterate over it line by line, teasing it
 apart with regexes?

 _______________________________________________
 pywikibot mailing list -- pywikibot(a)lists.wikimedia.org
 Public archives at

https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/m…
 To unsubscribe send an email to pywikibot-leave(a)lists.wikimedia.org
  _______________________________________________
 pywikibot mailing list -- pywikibot(a)lists.wikimedia.org
 Public archives at

https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/m…
 To unsubscribe send an email to pywikibot-leave(a)lists.wikimedia.org

 _______________________________________________
 pywikibot mailing list -- pywikibot(a)lists.wikimedia.org
 Public archives at

https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/m…
 To unsubscribe send an email to pywikibot-leave(a)lists.wikimedia.org

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

[pywikibot] Re: Text between two comments?