Text between two comments?

List overview All Threads
Download

newer

older

How to ignore all the non-visible...

How to exclude HTML comments

Roy Smith

2 Feb 2023 2 Feb '23

11:39 p.m.

...

{{main page image/DYK|image=Melissa Ong.webp|caption=Selfie of Ong, commonly replicated by the Step Chickens}} * ... that "Step Chickens" on TikTok replace their profile pictures with an image ''(shown)'' of '''[[Melissa Ong]]''', whom they call "Mother Hen"? * ... that '''[[interfaith greetings in Indonesia]]''' include phrases from Islam, Christianity, Hinduism, Buddhism, and Confucianism? * ... that '''[[Kimmo Leinonen]]''' helped establish both the [[Finnish Hockey Hall of Fame]] and the [[IIHF Hall of Fame]]? * ... that the [[Pulitzer Prize for Fiction|Pulitzer Prize]]-winning novel '''''[[All the Light We Cannot See]]''''' contains a sympathetic [[Nazism|Nazi]]? * ... that a {{Convert|10|ft|m|adj=mid|-tall|0}} '''[[Lady Rainier|statue of a woman]]''' in [[Seattle]] was commissioned by a local brewery in 1903? * ... that ... * ... that prior to entering politics, '''[[Herbert Salvatierra]]''' led a troupe of [[carnival]] ''[[comparsa]]s''? * ... that [[Winston Churchill]] published '''[[Are There Men on the Moon?|an essay on extraterrestrial life]]''' during the Second World War?

Attachments:

attachment.htm (text/html — 2.5 KB)

Show replies by date

JJMC89

3 Feb 3 Feb

12:03 a.m.

For similar cases, I have used a regex to find the part marked by comments and then parse the part between. START_END = re.compile(r"^(.*?)(.*?)(.*)$", flags=re.I | re.S) m = START_END.search(page.text) wikicode = mwparserfromhell.parse(m.group(2)) # do stuff with wikicode You may be able to do it with the parser. # assume start and end represent comment objects you found from wikicode.filter_comments() start_index = wikicode.index(start) end_index = wikicode.index(end) inside = wikicode.nodes[start_index:end_index] On Thu, Feb 2, 2023 at 3:39 PM Roy Smith <roy(a)panix.com> wrote:

...

I'm trying to parse DYK prep area templates, for example Template:Did you know/Preparation area 3 <https://en.wikipedia.org/wiki/Template:Did_you_know/Preparation_area_3>. Unfortunately, these are more like flat text files than any kind of nicely structured data. The stuff of interest is everything between two HTML comments:  {{main page image/DYK|image=Melissa Ong.webp|caption=Selfie of Ong, commonly replicated by the Step Chickens}} * ... that "Step Chickens" on TikTok replace their profile pictures with an image ''(shown)'' of '''[[Melissa Ong]]''', whom they call "Mother Hen"? * ... that '''[[interfaith greetings in Indonesia]]''' include phrases from Islam, Christianity, Hinduism, Buddhism, and Confucianism? * ... that '''[[Kimmo Leinonen]]''' helped establish both the [[Finnish Hockey Hall of Fame]] and the [[IIHF Hall of Fame]]? * ... that the [[Pulitzer Prize for Fiction|Pulitzer Prize]]-winning novel '''''[[All the Light We Cannot See]]''''' contains a sympathetic [[Nazism|Nazi]]? * ... that a {{Convert|10|ft|m|adj=mid|-tall|0}} '''[[Lady Rainier|statue of a woman]]''' in [[Seattle]] was commissioned by a local brewery in 1903? * ... that ... * ... that prior to entering politics, '''[[Herbert Salvatierra]]''' led a troupe of [[carnival]] ''[[comparsa]]s''? * ... that [[Winston Churchill]] published '''[[Are There Men on the Moon?|an essay on extraterrestrial life]]''' during the Second World War?  I can find the comments with Wikicode.filter_comments(). But once I've found the two delimiting comments, how do I grab the text between them? Or is the parser the wrong tool? Would I do better to treat the content of the page as flat text and just iterate over it line by line, teasing it apart with regexes? _______________________________________________ pywikibot mailing list -- pywikibot(a)lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/m… To unsubscribe send an email to pywikibot-leave(a)lists.wikimedia.org

Roy Smith

12:26 a.m.

Thanks. Sadly, I think treating this as flat text will end up being the most straight-forward way to do it.

...

On Feb 2, 2023, at 7:03 PM, JJMC89 <jjmc89.wikimedia(a)gmail.com> wrote: For similar cases, I have used a regex to find the part marked by comments and then parse the part between. START_END = re.compile(r"^(.*?)(.*?)(.*)$", flags=re.I | re.S) m = START_END.search(page.text) wikicode = mwparserfromhell.parse(m.group(2)) # do stuff with wikicode You may be able to do it with the parser. # assume start and end represent comment objects you found from wikicode.filter_comments() start_index = wikicode.index(start) end_index = wikicode.index(end) inside = wikicode.nodes[start_index:end_index] On Thu, Feb 2, 2023 at 3:39 PM Roy Smith <roy(a)panix.com <mailto:roy@panix.com>> wrote: I'm trying to parse DYK prep area templates, for example Template:Did you know/Preparation area 3 <https://en.wikipedia.org/wiki/Template:Did_you_know/Preparation_area_3>. Unfortunately, these are more like flat text files than any kind of nicely structured data. The stuff of interest is everything between two HTML comments:

I can find the comments with Wikicode.filter_comments(). But once I've found the two delimiting comments, how do I grab the text between them? Or is the parser the wrong tool? Would I do better to treat the content of the page as flat text and just iterate over it line by line, teasing it apart with regexes? _______________________________________________ pywikibot mailing list -- pywikibot(a)lists.wikimedia.org <mailto:pywikibot@lists.wikimedia.org> Public archives at https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/m… <https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/message/XA2Y2ZFSFSLRG5TWHIV5G3QRMAK27H56/> To unsubscribe send an email to pywikibot-leave(a)lists.wikimedia.org <mailto:pywikibot-leave@lists.wikimedia.org> _______________________________________________ pywikibot mailing list -- pywikibot(a)lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/m… To unsubscribe send an email to pywikibot-leave(a)lists.wikimedia.org

John

12:48 a.m.

When I was parsing similar text, I did a .split on the header part and parsed the sections On Thu, Feb 2, 2023 at 7:26 PM Roy Smith <roy(a)panix.com> wrote:

...

Thanks. Sadly, I think treating this as flat text will end up being the most straight-forward way to do it. On Feb 2, 2023, at 7:03 PM, JJMC89 <jjmc89.wikimedia(a)gmail.com> wrote: For similar cases, I have used a regex to find the part marked by comments and then parse the part between. START_END = re.compile(r"^(.*?)(.*?)(.*)$", flags=re.I | re.S) m = START_END.search(page.text) wikicode = mwparserfromhell.parse(m.group(2)) # do stuff with wikicode You may be able to do it with the parser. # assume start and end represent comment objects you found from wikicode.filter_comments() start_index = wikicode.index(start) end_index = wikicode.index(end) inside = wikicode.nodes[start_index:end_index] On Thu, Feb 2, 2023 at 3:39 PM Roy Smith <roy(a)panix.com> wrote:

_______________________________________________ pywikibot mailing list -- pywikibot(a)lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/m… To unsubscribe send an email to pywikibot-leave(a)lists.wikimedia.org _______________________________________________ pywikibot mailing list -- pywikibot(a)lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/m… To unsubscribe send an email to pywikibot-leave(a)lists.wikimedia.org

Strainu

6:03 p.m.

I had a similar issue with updating templates in WLM template lists. This function might be good inspiration: https://github.com/rowiki/wikiro/blob/master/robots/python/pwb/monumente/co… Strainu Pe vineri, 3 februarie 2023, Roy Smith <roy(a)panix.com> a scris:

...

Thanks. Sadly, I think treating this as flat text will end up being the most

straight-forward way to do it.

...

On Feb 2, 2023, at 7:03 PM, JJMC89 <jjmc89.wikimedia(a)gmail.com> wrote: For similar cases, I have used a regex to find the part marked by

comments and then parse the part between.

...

START_END =

re.compile(r"^(.*?)(.*?)(.*)$", flags=re.I | re.S)

...

m = START_END.search(page.text) wikicode = mwparserfromhell.parse(m.group(2)) # do stuff with wikicode You may be able to do it with the parser. # assume start and end represent comment objects you found from

wikicode.filter_comments()

...

start_index = wikicode.index(start) end_index = wikicode.index(end) inside = wikicode.nodes[start_index:end_index] On Thu, Feb 2, 2023 at 3:39 PM Roy Smith <roy(a)panix.com> wrote: > > I'm trying to parse DYK prep area templates, for example Template:Did

you know/Preparation area 3. Unfortunately, these are more like flat text files than any kind of nicely structured data. The stuff of interest is everything between two HTML comments:

...

> >  > {{main page image/DYK|image=Melissa Ong.webp|caption=Selfie of Ong,

commonly replicated by the Step Chickens}}

...

> * ... that "Step Chickens" on TikTok replace their profile pictures with

an image ''(shown)'' of '''[[Melissa Ong]]''', whom they call "Mother Hen"?

...

> * ... that '''[[interfaith greetings in Indonesia]]''' include phrases

from Islam, Christianity, Hinduism, Buddhism, and Confucianism?

...

> * ... that '''[[Kimmo Leinonen]]''' helped establish both the [[Finnish

Hockey Hall of Fame]] and the [[IIHF Hall of Fame]]?

...

> * ... that the [[Pulitzer Prize for Fiction|Pulitzer Prize]]-winning

novel '''''[[All the Light We Cannot See]]''''' contains a sympathetic [[Nazism|Nazi]]?

...

> * ... that a {{Convert|10|ft|m|adj=mid|-tall|0}} '''[[Lady

Rainier|statue of a woman]]''' in [[Seattle]] was commissioned by a local brewery in 1903?

...

> * ... that ... > * ... that prior to entering politics, '''[[Herbert Salvatierra]]''' led

a troupe of [[carnival]] ''[[comparsa]]s''?

...

> * ... that [[Winston Churchill]] published '''[[Are There Men on the

Moon?|an essay on extraterrestrial life]]''' during the Second World War?

...

>  > > I can find the comments with Wikicode.filter_comments(). But once I've

found the two delimiting comments, how do I grab the text between them? Or is the parser the wrong tool? Would I do better to treat the content of the page as flat text and just iterate over it line by line, teasing it apart with regexes?

...

> > _______________________________________________ > pywikibot mailing list -- pywikibot(a)lists.wikimedia.org > Public archives at

https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/m…

...

To unsubscribe send an email to pywikibot-leave(a)lists.wikimedia.org

_______________________________________________ pywikibot mailing list -- pywikibot(a)lists.wikimedia.org Public archives at

https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/m…

...

To unsubscribe send an email to pywikibot-leave(a)lists.wikimedia.org