[Mediawiki-api] Question about summary regex in api for ML dataset

12 Oct 2018

Hi,

I'm trying to create a dataset of summaries vs full text bodies for
automatic text summarization models.

I was looking at the online api for retrieving the summary of a page, so I
could recreate it in my Spark code for parsing wiki dumps. Specifically, I
was looking at the regex in:
https://phabricator.wikimedia.org/diffusion/ETEX/browse/master/includes/Api…

$regexp = '/^(.*?)(?=' . ExtractFormatter::SECTION_MARKER_START . ')/s';

With section marker start filled in:

$regexp = '/^(.*?)(?=' . \1\2 . ')/s';

However, when I plug that expression into an online tester (regex101.com),
I see that: \2 This token references a non-existent or invalid subpattern

I am wondering if this is a bug or if I'm placing it incorrectly?

The alternative branch is when plaintext is set to false - that's for
parsing HTML correct / not applicable for the xml in wiki dumps?

Thanks for your help,
Dan Kramer

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

[Mediawiki-api] Question about summary regex in api for ML dataset