New subject: Question about summary regex in api for ML dataset

12 Oct 2018

Hi,

I'm trying to create a dataset of summaries vs full text bodies for
automatic text summarization models.

I was looking at the online api for retrieving the summary of a page, so I
could recreate it in my Spark code for parsing wiki dumps. Specifically, I
was looking at the regex in:
https://phabricator.wikimedia.org/diffusion/ETEX/browse/master/includes/Api…

$regexp = '/^(.*?)(?=' . ExtractFormatter::SECTION_MARKER_START . ')/s';

With section marker start filled in:

$regexp = '/^(.*?)(?=' . \1\2 . ')/s';

However, when I plug that expression into an online tester (regex101.com),
I see that: \2 This token references a non-existent or invalid subpattern

I am wondering if this is a bug or if I'm placing it incorrectly?

The alternative branch is when plaintext is set to false - that's for
parsing HTML correct / not applicable for the xml in wiki dumps?

Thanks for your help,
Dan Kramer