Hi,
I'm trying to create a dataset of summaries vs full text bodies for
automatic text summarization models.
I was looking at the online api for retrieving the summary of a page, so I
could recreate it in my Spark code for parsing wiki dumps. Specifically, I
was looking at the regex in:
https://phabricator.wikimedia.org/diffusion/ETEX/browse/master/includes/Api…
$regexp = '/^(.*?)(?=' . ExtractFormatter::SECTION_MARKER_START . ')/s';
With section marker start filled in:
$regexp = '/^(.*?)(?=' . \1\2 . ')/s';
However, when I plug that expression into an online tester (
regex101.com),
I see that: \2 This token references a non-existent or invalid subpattern
I am wondering if this is a bug or if I'm placing it incorrectly?
The alternative branch is when plaintext is set to false - that's for
parsing HTML correct / not applicable for the xml in wiki dumps?
Thanks for your help,
Dan Kramer