Hi, 

I'm trying to create a dataset of summaries vs full text bodies for automatic text summarization models. 

I was looking at the online api for retrieving the summary of a page, so I could recreate it in my Spark code for parsing wiki dumps. Specifically, I was looking at the regex in: https://phabricator.wikimedia.org/diffusion/ETEX/browse/master/includes/ApiQueryExtracts.php;012b89e966edf20834f0e551a66fbb4ebfd185cd$210

$regexp = '/^(.*?)(?=' . ExtractFormatter::SECTION_MARKER_START . ')/s';

With section marker start filled in:

$regexp = '/^(.*?)(?=' . \1\2 . ')/s';

However, when I plug that expression into an online tester (regex101.com), I see that: \2 This token references a non-existent or invalid subpattern

I am wondering if this is a bug or if I'm placing it incorrectly?

The alternative branch is when plaintext is set to false - that's for parsing HTML correct / not applicable for the xml in wiki dumps?

Thanks for your help,
Dan Kramer