I'm trying to parse DYK prep area templates, for example Template:Did you know/Preparation area 3 https://en.wikipedia.org/wiki/Template:Did_you_know/Preparation_area_3. Unfortunately, these are more like flat text files than any kind of nicely structured data. The stuff of interest is everything between two HTML comments:
<!--Hooks-->
{{main page image/DYK|image=Melissa Ong.webp|caption=Selfie of Ong, commonly replicated by the Step Chickens<!--the caption length is intentional, it highlights that this image is there for a specific purpose and isn't just any image of Ong – please don't shorten it! Same for the ''(shown)'' –leek -->}}
- ... that "Step Chickens" on TikTok replace their profile pictures with an image ''(shown)'' of '''[[Melissa Ong]]''', whom they call "Mother Hen"?
- ... that '''[[interfaith greetings in Indonesia]]''' include phrases from Islam, Christianity, Hinduism, Buddhism, and Confucianism?
- ... that '''[[Kimmo Leinonen]]''' helped establish both the [[Finnish Hockey Hall of Fame]] and the [[IIHF Hall of Fame]]?
- ... that the [[Pulitzer Prize for Fiction|Pulitzer Prize]]-winning novel '''''[[All the Light We Cannot See]]''''' contains a sympathetic [[Nazism|Nazi]]?
- ... that a {{Convert|10|ft|m|adj=mid|-tall|0}} '''[[Lady Rainier|statue of a woman]]''' in [[Seattle]] was commissioned by a local brewery in 1903?
- ... that ...
- ... that prior to entering politics, '''[[Herbert Salvatierra]]''' led a troupe of [[carnival]] ''[[comparsa]]s''?
- ... that [[Winston Churchill]] published '''[[Are There Men on the Moon?|an essay on extraterrestrial life]]''' during the Second World War?
<!--HooksEnd-->
I can find the comments with Wikicode.filter_comments(). But once I've found the two delimiting comments, how do I grab the text between them? Or is the parser the wrong tool? Would I do better to treat the content of the page as flat text and just iterate over it line by line, teasing it apart with regexes?
For similar cases, I have used a regex to find the part marked by comments and then parse the part between.
START_END = re.compile(r"^(.*?<!--\s*Hooks\s*-->)(.*?)(<!--\s*HooksEnd\s*-->.*)$", flags=re.I | re.S) m = START_END.search(page.text) wikicode = mwparserfromhell.parse(m.group(2)) # do stuff with wikicode
You may be able to do it with the parser. # assume start and end represent comment objects you found from wikicode.filter_comments() start_index = wikicode.index(start) end_index = wikicode.index(end) inside = wikicode.nodes[start_index:end_index]
On Thu, Feb 2, 2023 at 3:39 PM Roy Smith roy@panix.com wrote:
I'm trying to parse DYK prep area templates, for example Template:Did you know/Preparation area 3 https://en.wikipedia.org/wiki/Template:Did_you_know/Preparation_area_3. Unfortunately, these are more like flat text files than any kind of nicely structured data. The stuff of interest is everything between two HTML comments:
<!--Hooks-->
{{main page image/DYK|image=Melissa Ong.webp|caption=Selfie of Ong, commonly replicated by the Step Chickens<!--the caption length is intentional, it highlights that this image is there for a specific purpose and isn't just any image of Ong – please don't shorten it! Same for the ''(shown)'' –leek -->}}
- ... that "Step Chickens" on TikTok replace their profile pictures with
an image ''(shown)'' of '''[[Melissa Ong]]''', whom they call "Mother Hen"?
- ... that '''[[interfaith greetings in Indonesia]]''' include phrases
from Islam, Christianity, Hinduism, Buddhism, and Confucianism?
- ... that '''[[Kimmo Leinonen]]''' helped establish both the [[Finnish
Hockey Hall of Fame]] and the [[IIHF Hall of Fame]]?
- ... that the [[Pulitzer Prize for Fiction|Pulitzer Prize]]-winning novel
'''''[[All the Light We Cannot See]]''''' contains a sympathetic [[Nazism|Nazi]]?
- ... that a {{Convert|10|ft|m|adj=mid|-tall|0}} '''[[Lady Rainier|statue
of a woman]]''' in [[Seattle]] was commissioned by a local brewery in 1903?
- ... that ...
- ... that prior to entering politics, '''[[Herbert Salvatierra]]''' led a
troupe of [[carnival]] ''[[comparsa]]s''?
- ... that [[Winston Churchill]] published '''[[Are There Men on the
Moon?|an essay on extraterrestrial life]]''' during the Second World War?
<!--HooksEnd-->
I can find the comments with Wikicode.filter_comments(). But once I've found the two delimiting comments, how do I grab the text between them? Or is the parser the wrong tool? Would I do better to treat the content of the page as flat text and just iterate over it line by line, teasing it apart with regexes?
pywikibot mailing list -- pywikibot@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/me... To unsubscribe send an email to pywikibot-leave@lists.wikimedia.org
Thanks.
Sadly, I think treating this as flat text will end up being the most straight-forward way to do it.
On Feb 2, 2023, at 7:03 PM, JJMC89 jjmc89.wikimedia@gmail.com wrote:
For similar cases, I have used a regex to find the part marked by comments and then parse the part between.
START_END = re.compile(r"^(.*?<!--\s*Hooks\s*-->)(.*?)(<!--\s*HooksEnd\s*-->.*)$", flags=re.I | re.S) m = START_END.search(page.text) wikicode = mwparserfromhell.parse(m.group(2)) # do stuff with wikicode
You may be able to do it with the parser. # assume start and end represent comment objects you found from wikicode.filter_comments() start_index = wikicode.index(start) end_index = wikicode.index(end) inside = wikicode.nodes[start_index:end_index]
On Thu, Feb 2, 2023 at 3:39 PM Roy Smith <roy@panix.com mailto:roy@panix.com> wrote: I'm trying to parse DYK prep area templates, for example Template:Did you know/Preparation area 3 https://en.wikipedia.org/wiki/Template:Did_you_know/Preparation_area_3. Unfortunately, these are more like flat text files than any kind of nicely structured data. The stuff of interest is everything between two HTML comments:
<!--Hooks-->
{{main page image/DYK|image=Melissa Ong.webp|caption=Selfie of Ong, commonly replicated by the Step Chickens<!--the caption length is intentional, it highlights that this image is there for a specific purpose and isn't just any image of Ong – please don't shorten it! Same for the ''(shown)'' –leek -->}}
- ... that "Step Chickens" on TikTok replace their profile pictures with an image ''(shown)'' of '''[[Melissa Ong]]''', whom they call "Mother Hen"?
- ... that '''[[interfaith greetings in Indonesia]]''' include phrases from Islam, Christianity, Hinduism, Buddhism, and Confucianism?
- ... that '''[[Kimmo Leinonen]]''' helped establish both the [[Finnish Hockey Hall of Fame]] and the [[IIHF Hall of Fame]]?
- ... that the [[Pulitzer Prize for Fiction|Pulitzer Prize]]-winning novel '''''[[All the Light We Cannot See]]''''' contains a sympathetic [[Nazism|Nazi]]?
- ... that a {{Convert|10|ft|m|adj=mid|-tall|0}} '''[[Lady Rainier|statue of a woman]]''' in [[Seattle]] was commissioned by a local brewery in 1903?
- ... that ...
- ... that prior to entering politics, '''[[Herbert Salvatierra]]''' led a troupe of [[carnival]] ''[[comparsa]]s''?
- ... that [[Winston Churchill]] published '''[[Are There Men on the Moon?|an essay on extraterrestrial life]]''' during the Second World War?
<!--HooksEnd-->
I can find the comments with Wikicode.filter_comments(). But once I've found the two delimiting comments, how do I grab the text between them? Or is the parser the wrong tool? Would I do better to treat the content of the page as flat text and just iterate over it line by line, teasing it apart with regexes?
pywikibot mailing list -- pywikibot@lists.wikimedia.org mailto:pywikibot@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/me... https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/message/XA2Y2ZFSFSLRG5TWHIV5G3QRMAK27H56/ To unsubscribe send an email to pywikibot-leave@lists.wikimedia.org mailto:pywikibot-leave@lists.wikimedia.org _______________________________________________ pywikibot mailing list -- pywikibot@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/me... To unsubscribe send an email to pywikibot-leave@lists.wikimedia.org
When I was parsing similar text, I did a .split on the header part and parsed the sections
On Thu, Feb 2, 2023 at 7:26 PM Roy Smith roy@panix.com wrote:
Thanks.
Sadly, I think treating this as flat text will end up being the most straight-forward way to do it.
On Feb 2, 2023, at 7:03 PM, JJMC89 jjmc89.wikimedia@gmail.com wrote:
For similar cases, I have used a regex to find the part marked by comments and then parse the part between.
START_END = re.compile(r"^(.*?<!--\s*Hooks\s*-->)(.*?)(<!--\s*HooksEnd\s*-->.*)$", flags=re.I | re.S) m = START_END.search(page.text) wikicode = mwparserfromhell.parse(m.group(2)) # do stuff with wikicode
You may be able to do it with the parser. # assume start and end represent comment objects you found from wikicode.filter_comments() start_index = wikicode.index(start) end_index = wikicode.index(end) inside = wikicode.nodes[start_index:end_index]
On Thu, Feb 2, 2023 at 3:39 PM Roy Smith roy@panix.com wrote:
I'm trying to parse DYK prep area templates, for example Template:Did you know/Preparation area 3 https://en.wikipedia.org/wiki/Template:Did_you_know/Preparation_area_3. Unfortunately, these are more like flat text files than any kind of nicely structured data. The stuff of interest is everything between two HTML comments:
<!--Hooks-->
{{main page image/DYK|image=Melissa Ong.webp|caption=Selfie of Ong, commonly replicated by the Step Chickens<!--the caption length is intentional, it highlights that this image is there for a specific purpose and isn't just any image of Ong – please don't shorten it! Same for the ''(shown)'' –leek -->}}
- ... that "Step Chickens" on TikTok replace their profile pictures with
an image ''(shown)'' of '''[[Melissa Ong]]''', whom they call "Mother Hen"?
- ... that '''[[interfaith greetings in Indonesia]]''' include phrases
from Islam, Christianity, Hinduism, Buddhism, and Confucianism?
- ... that '''[[Kimmo Leinonen]]''' helped establish both the [[Finnish
Hockey Hall of Fame]] and the [[IIHF Hall of Fame]]?
- ... that the [[Pulitzer Prize for Fiction|Pulitzer Prize]]-winning
novel '''''[[All the Light We Cannot See]]''''' contains a sympathetic [[Nazism|Nazi]]?
- ... that a {{Convert|10|ft|m|adj=mid|-tall|0}} '''[[Lady Rainier|statue
of a woman]]''' in [[Seattle]] was commissioned by a local brewery in 1903?
- ... that ...
- ... that prior to entering politics, '''[[Herbert Salvatierra]]''' led
a troupe of [[carnival]] ''[[comparsa]]s''?
- ... that [[Winston Churchill]] published '''[[Are There Men on the
Moon?|an essay on extraterrestrial life]]''' during the Second World War?
<!--HooksEnd-->
I can find the comments with Wikicode.filter_comments(). But once I've found the two delimiting comments, how do I grab the text between them? Or is the parser the wrong tool? Would I do better to treat the content of the page as flat text and just iterate over it line by line, teasing it apart with regexes?
pywikibot mailing list -- pywikibot@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/me... To unsubscribe send an email to pywikibot-leave@lists.wikimedia.org
pywikibot mailing list -- pywikibot@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/me... To unsubscribe send an email to pywikibot-leave@lists.wikimedia.org
pywikibot mailing list -- pywikibot@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/me... To unsubscribe send an email to pywikibot-leave@lists.wikimedia.org
I had a similar issue with updating templates in WLM template lists. This function might be good inspiration: https://github.com/rowiki/wikiro/blob/master/robots/python/pwb/monumente/cor...
Strainu
Pe vineri, 3 februarie 2023, Roy Smith roy@panix.com a scris:
Thanks. Sadly, I think treating this as flat text will end up being the most
straight-forward way to do it.
On Feb 2, 2023, at 7:03 PM, JJMC89 jjmc89.wikimedia@gmail.com wrote: For similar cases, I have used a regex to find the part marked by
comments and then parse the part between.
START_END =
re.compile(r"^(.*?<!--\s*Hooks\s*-->)(.*?)(<!--\s*HooksEnd\s*-->.*)$", flags=re.I | re.S)
m = START_END.search(page.text) wikicode = mwparserfromhell.parse(m.group(2)) # do stuff with wikicode
You may be able to do it with the parser. # assume start and end represent comment objects you found from
wikicode.filter_comments()
start_index = wikicode.index(start) end_index = wikicode.index(end) inside = wikicode.nodes[start_index:end_index]
On Thu, Feb 2, 2023 at 3:39 PM Roy Smith roy@panix.com wrote:
I'm trying to parse DYK prep area templates, for example Template:Did
you know/Preparation area 3. Unfortunately, these are more like flat text files than any kind of nicely structured data. The stuff of interest is everything between two HTML comments:
<!--Hooks-->
{{main page image/DYK|image=Melissa Ong.webp|caption=Selfie of Ong,
commonly replicated by the Step Chickens<!--the caption length is intentional, it highlights that this image is there for a specific purpose and isn't just any image of Ong – please don't shorten it! Same for the ''(shown)'' –leek -->}}
- ... that "Step Chickens" on TikTok replace their profile pictures with
an image ''(shown)'' of '''[[Melissa Ong]]''', whom they call "Mother Hen"?
- ... that '''[[interfaith greetings in Indonesia]]''' include phrases
from Islam, Christianity, Hinduism, Buddhism, and Confucianism?
- ... that '''[[Kimmo Leinonen]]''' helped establish both the [[Finnish
Hockey Hall of Fame]] and the [[IIHF Hall of Fame]]?
- ... that the [[Pulitzer Prize for Fiction|Pulitzer Prize]]-winning
novel '''''[[All the Light We Cannot See]]''''' contains a sympathetic [[Nazism|Nazi]]?
- ... that a {{Convert|10|ft|m|adj=mid|-tall|0}} '''[[Lady
Rainier|statue of a woman]]''' in [[Seattle]] was commissioned by a local brewery in 1903?
- ... that ...
- ... that prior to entering politics, '''[[Herbert Salvatierra]]''' led
a troupe of [[carnival]] ''[[comparsa]]s''?
- ... that [[Winston Churchill]] published '''[[Are There Men on the
Moon?|an essay on extraterrestrial life]]''' during the Second World War?
<!--HooksEnd-->
I can find the comments with Wikicode.filter_comments(). But once I've
found the two delimiting comments, how do I grab the text between them? Or is the parser the wrong tool? Would I do better to treat the content of the page as flat text and just iterate over it line by line, teasing it apart with regexes?
pywikibot mailing list -- pywikibot@lists.wikimedia.org Public archives at
https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/me...
To unsubscribe send an email to pywikibot-leave@lists.wikimedia.org
pywikibot mailing list -- pywikibot@lists.wikimedia.org Public archives at
https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/me...
To unsubscribe send an email to pywikibot-leave@lists.wikimedia.org