Text between two comments?

List overview All Threads
Download

newer

older

How to ignore all the non-visible...

How to exclude HTML comments

Roy Smith

3 Feb 2023 3 Feb '23

3:39 a.m.

I'm trying to parse DYK prep area templates, for example Template:Did you know/Preparation area 3 https://en.wikipedia.org/wiki/Template:Did_you_know/Preparation_area_3. Unfortunately, these are more like flat text files than any kind of nicely structured data. The stuff of interest is everything between two HTML comments:

...

{{main page image/DYK|image=Melissa Ong.webp|caption=Selfie of Ong, commonly replicated by the Step Chickens}}

... that "Step Chickens" on TikTok replace their profile pictures with an image ''(shown)'' of '''[[Melissa Ong]]''', whom they call "Mother Hen"?

... that '''[[interfaith greetings in Indonesia]]''' include phrases from Islam, Christianity, Hinduism, Buddhism, and Confucianism?

... that '''[[Kimmo Leinonen]]''' helped establish both the [[Finnish Hockey Hall of Fame]] and the [[IIHF Hall of Fame]]?

... that the [[Pulitzer Prize for Fiction|Pulitzer Prize]]-winning novel '''''[[All the Light We Cannot See]]''''' contains a sympathetic [[Nazism|Nazi]]?

... that a {{Convert|10|ft|m|adj=mid|-tall|0}} '''[[Lady Rainier|statue of a woman]]''' in [[Seattle]] was commissioned by a local brewery in 1903?

... that ...

... that prior to entering politics, '''[[Herbert Salvatierra]]''' led a troupe of [[carnival]] ''[[comparsa]]s''?

... that [[Winston Churchill]] published '''[[Are There Men on the Moon?|an essay on extraterrestrial life]]''' during the Second World War?

I can find the comments with Wikicode.filter_comments(). But once I've found the two delimiting comments, how do I grab the text between them? Or is the parser the wrong tool? Would I do better to treat the content of the page as flat text and just iterate over it line by line, teasing it apart with regexes?

Attachments:

attachment.htm (text/html — 2.5 KB)

Show replies by date

JJMC89

3 Feb 3 Feb

4:03 a.m.

For similar cases, I have used a regex to find the part marked by comments and then parse the part between.

START_END = re.compile(r"^(.*?)(.*?)(.*)$", flags=re.I | re.S) m = START_END.search(page.text) wikicode = mwparserfromhell.parse(m.group(2)) # do stuff with wikicode

You may be able to do it with the parser. # assume start and end represent comment objects you found from wikicode.filter_comments() start_index = wikicode.index(start) end_index = wikicode.index(end) inside = wikicode.nodes[start_index:end_index]

On Thu, Feb 2, 2023 at 3:39 PM Roy Smith roy@panix.com wrote:

...

I'm trying to parse DYK prep area templates, for example Template:Did you know/Preparation area 3 https://en.wikipedia.org/wiki/Template:Did_you_know/Preparation_area_3. Unfortunately, these are more like flat text files than any kind of nicely structured data. The stuff of interest is everything between two HTML comments:



{{main page image/DYK|image=Melissa Ong.webp|caption=Selfie of Ong, commonly replicated by the Step Chickens}}

... that "Step Chickens" on TikTok replace their profile pictures with

an image ''(shown)'' of '''[[Melissa Ong]]''', whom they call "Mother Hen"?

... that '''[[interfaith greetings in Indonesia]]''' include phrases

from Islam, Christianity, Hinduism, Buddhism, and Confucianism?

... that '''[[Kimmo Leinonen]]''' helped establish both the [[Finnish

Hockey Hall of Fame]] and the [[IIHF Hall of Fame]]?

... that the [[Pulitzer Prize for Fiction|Pulitzer Prize]]-winning novel

'''''[[All the Light We Cannot See]]''''' contains a sympathetic [[Nazism|Nazi]]?

... that a {{Convert|10|ft|m|adj=mid|-tall|0}} '''[[Lady Rainier|statue

of a woman]]''' in [[Seattle]] was commissioned by a local brewery in 1903?

... that ...

... that prior to entering politics, '''[[Herbert Salvatierra]]''' led a

troupe of [[carnival]] ''[[comparsa]]s''?

... that [[Winston Churchill]] published '''[[Are There Men on the

Moon?|an essay on extraterrestrial life]]''' during the Second World War?



I can find the comments with Wikicode.filter_comments(). But once I've found the two delimiting comments, how do I grab the text between them? Or is the parser the wrong tool? Would I do better to treat the content of the page as flat text and just iterate over it line by line, teasing it apart with regexes?

pywikibot mailing list -- pywikibot@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/me... To unsubscribe send an email to pywikibot-leave@lists.wikimedia.org

Roy Smith

4:26 a.m.

Thanks.

Sadly, I think treating this as flat text will end up being the most straight-forward way to do it.

...

On Feb 2, 2023, at 7:03 PM, JJMC89 jjmc89.wikimedia@gmail.com wrote:

For similar cases, I have used a regex to find the part marked by comments and then parse the part between.

START_END = re.compile(r"^(.*?)(.*?)(.*)$", flags=re.I | re.S) m = START_END.search(page.text) wikicode = mwparserfromhell.parse(m.group(2)) # do stuff with wikicode

You may be able to do it with the parser. # assume start and end represent comment objects you found from wikicode.filter_comments() start_index = wikicode.index(start) end_index = wikicode.index(end) inside = wikicode.nodes[start_index:end_index]

On Thu, Feb 2, 2023 at 3:39 PM Roy Smith <roy@panix.com mailto:roy@panix.com> wrote: I'm trying to parse DYK prep area templates, for example Template:Did you know/Preparation area 3 https://en.wikipedia.org/wiki/Template:Did_you_know/Preparation_area_3. Unfortunately, these are more like flat text files than any kind of nicely structured data. The stuff of interest is everything between two HTML comments:

...


{{main page image/DYK|image=Melissa Ong.webp|caption=Selfie of Ong, commonly replicated by the Step Chickens}}

... that "Step Chickens" on TikTok replace their profile pictures with an image ''(shown)'' of '''[[Melissa Ong]]''', whom they call "Mother Hen"?

... that '''[[interfaith greetings in Indonesia]]''' include phrases from Islam, Christianity, Hinduism, Buddhism, and Confucianism?

... that '''[[Kimmo Leinonen]]''' helped establish both the [[Finnish Hockey Hall of Fame]] and the [[IIHF Hall of Fame]]?

... that the [[Pulitzer Prize for Fiction|Pulitzer Prize]]-winning novel '''''[[All the Light We Cannot See]]''''' contains a sympathetic [[Nazism|Nazi]]?

... that a {{Convert|10|ft|m|adj=mid|-tall|0}} '''[[Lady Rainier|statue of a woman]]''' in [[Seattle]] was commissioned by a local brewery in 1903?

... that ...

... that prior to entering politics, '''[[Herbert Salvatierra]]''' led a troupe of [[carnival]] ''[[comparsa]]s''?

... that [[Winston Churchill]] published '''[[Are There Men on the Moon?|an essay on extraterrestrial life]]''' during the Second World War?



I can find the comments with Wikicode.filter_comments(). But once I've found the two delimiting comments, how do I grab the text between them? Or is the parser the wrong tool? Would I do better to treat the content of the page as flat text and just iterate over it line by line, teasing it apart with regexes?

pywikibot mailing list -- pywikibot@lists.wikimedia.org mailto:pywikibot@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/me... https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/message/XA2Y2ZFSFSLRG5TWHIV5G3QRMAK27H56/ To unsubscribe send an email to pywikibot-leave@lists.wikimedia.org mailto:pywikibot-leave@lists.wikimedia.org _______________________________________________ pywikibot mailing list -- pywikibot@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/me... To unsubscribe send an email to pywikibot-leave@lists.wikimedia.org

John

4:48 a.m.

When I was parsing similar text, I did a .split on the header part and parsed the sections

On Thu, Feb 2, 2023 at 7:26 PM Roy Smith roy@panix.com wrote:

...

Thanks.

Sadly, I think treating this as flat text will end up being the most straight-forward way to do it.

On Feb 2, 2023, at 7:03 PM, JJMC89 jjmc89.wikimedia@gmail.com wrote:

For similar cases, I have used a regex to find the part marked by comments and then parse the part between.

START_END = re.compile(r"^(.*?)(.*?)(.*)$", flags=re.I | re.S) m = START_END.search(page.text) wikicode = mwparserfromhell.parse(m.group(2)) # do stuff with wikicode

You may be able to do it with the parser. # assume start and end represent comment objects you found from wikicode.filter_comments() start_index = wikicode.index(start) end_index = wikicode.index(end) inside = wikicode.nodes[start_index:end_index]

On Thu, Feb 2, 2023 at 3:39 PM Roy Smith roy@panix.com wrote:

...
I'm trying to parse DYK prep area templates, for example Template:Did you know/Preparation area 3 https://en.wikipedia.org/wiki/Template:Did_you_know/Preparation_area_3. Unfortunately, these are more like flat text files than any kind of nicely structured data. The stuff of interest is everything between two HTML comments:



{{main page image/DYK|image=Melissa Ong.webp|caption=Selfie of Ong, commonly replicated by the Step Chickens}}

... that "Step Chickens" on TikTok replace their profile pictures with

an image ''(shown)'' of '''[[Melissa Ong]]''', whom they call "Mother Hen"?

... that '''[[interfaith greetings in Indonesia]]''' include phrases

from Islam, Christianity, Hinduism, Buddhism, and Confucianism?

... that '''[[Kimmo Leinonen]]''' helped establish both the [[Finnish

Hockey Hall of Fame]] and the [[IIHF Hall of Fame]]?

... that the [[Pulitzer Prize for Fiction|Pulitzer Prize]]-winning

novel '''''[[All the Light We Cannot See]]''''' contains a sympathetic [[Nazism|Nazi]]?

... that a {{Convert|10|ft|m|adj=mid|-tall|0}} '''[[Lady Rainier|statue

of a woman]]''' in [[Seattle]] was commissioned by a local brewery in 1903?

... that ...

... that prior to entering politics, '''[[Herbert Salvatierra]]''' led

a troupe of [[carnival]] ''[[comparsa]]s''?

... that [[Winston Churchill]] published '''[[Are There Men on the

Moon?|an essay on extraterrestrial life]]''' during the Second World War?



I can find the comments with Wikicode.filter_comments(). But once I've found the two delimiting comments, how do I grab the text between them? Or is the parser the wrong tool? Would I do better to treat the content of the page as flat text and just iterate over it line by line, teasing it apart with regexes?

pywikibot mailing list -- pywikibot@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/me... To unsubscribe send an email to pywikibot-leave@lists.wikimedia.org

pywikibot mailing list -- pywikibot@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/me... To unsubscribe send an email to pywikibot-leave@lists.wikimedia.org

pywikibot mailing list -- pywikibot@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/me... To unsubscribe send an email to pywikibot-leave@lists.wikimedia.org

Strainu

10:03 p.m.

I had a similar issue with updating templates in WLM template lists. This function might be good inspiration: https://github.com/rowiki/wikiro/blob/master/robots/python/pwb/monumente/cor...

Strainu

Pe vineri, 3 februarie 2023, Roy Smith roy@panix.com a scris:

...

Thanks. Sadly, I think treating this as flat text will end up being the most

straight-forward way to do it.

...

On Feb 2, 2023, at 7:03 PM, JJMC89 jjmc89.wikimedia@gmail.com wrote: For similar cases, I have used a regex to find the part marked by

comments and then parse the part between.

...

START_END =

re.compile(r"^(.*?)(.*?)(.*)$", flags=re.I | re.S)

...

m = START_END.search(page.text) wikicode = mwparserfromhell.parse(m.group(2)) # do stuff with wikicode

You may be able to do it with the parser. # assume start and end represent comment objects you found from

wikicode.filter_comments()

...

start_index = wikicode.index(start) end_index = wikicode.index(end) inside = wikicode.nodes[start_index:end_index]

On Thu, Feb 2, 2023 at 3:39 PM Roy Smith roy@panix.com wrote:

...
I'm trying to parse DYK prep area templates, for example Template:Did

you know/Preparation area 3. Unfortunately, these are more like flat text files than any kind of nicely structured data. The stuff of interest is everything between two HTML comments:

...

...


{{main page image/DYK|image=Melissa Ong.webp|caption=Selfie of Ong,

commonly replicated by the Step Chickens}}

...

...

... that "Step Chickens" on TikTok replace their profile pictures with

an image ''(shown)'' of '''[[Melissa Ong]]''', whom they call "Mother Hen"?

...

...

... that '''[[interfaith greetings in Indonesia]]''' include phrases

from Islam, Christianity, Hinduism, Buddhism, and Confucianism?

...

...

... that '''[[Kimmo Leinonen]]''' helped establish both the [[Finnish

Hockey Hall of Fame]] and the [[IIHF Hall of Fame]]?

...

...

... that the [[Pulitzer Prize for Fiction|Pulitzer Prize]]-winning

novel '''''[[All the Light We Cannot See]]''''' contains a sympathetic [[Nazism|Nazi]]?

...

...

... that a {{Convert|10|ft|m|adj=mid|-tall|0}} '''[[Lady

Rainier|statue of a woman]]''' in [[Seattle]] was commissioned by a local brewery in 1903?

...

...

... that ...

... that prior to entering politics, '''[[Herbert Salvatierra]]''' led

a troupe of [[carnival]] ''[[comparsa]]s''?

...

...

... that [[Winston Churchill]] published '''[[Are There Men on the

Moon?|an essay on extraterrestrial life]]''' during the Second World War?

...

...


I can find the comments with Wikicode.filter_comments(). But once I've

found the two delimiting comments, how do I grab the text between them? Or is the parser the wrong tool? Would I do better to treat the content of the page as flat text and just iterate over it line by line, teasing it apart with regexes?

...

...

pywikibot mailing list -- pywikibot@lists.wikimedia.org Public archives at

https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/me...

...

...
To unsubscribe send an email to pywikibot-leave@lists.wikimedia.org

pywikibot mailing list -- pywikibot@lists.wikimedia.org Public archives at

https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/me...

...

To unsubscribe send an email to pywikibot-leave@lists.wikimedia.org

702

Age (days ago)

703

Last active (days ago)

pywikibot@lists.wikimedia.org

4 comments

4 participants

tags (0)

participants (4)

JJMC89
John
Roy Smith
Strainu