Hi Platonides,

Thanks so much for your reply.

That makes a lot more sense - unfortunately, I can't seem to find section names as elements in the xml schema (https://www.mediawiki.org/xml/export-0.10.xsd). Do you have any recommendations for parsing the intro section out of the xml dumps? Trying to avoid parsing html or querying the api because I have Cloud9's wiki xml reader for processing the xml dumps in spark.

Thanks again,

Dan

On Fri, Oct 12, 2018 at 1:00 PM Platonides <platonides@gmail.com> wrote:

That \1\2 are literal bytes. You would do:

$regexp = '/^(.*?)(?=\x01\x02)/s';

But those bytes are not present in the original wikitext, they are set
by ExtractFormatter
$html = preg_replace( '/\s*(<h([1-6])\b)/i',
"\n\n" . self::SECTION_MARKER_START . '$2' .
self::SECTION_MARKER_END . '$1',
$html);

Best regards

PS: These are section names, not edit summaries.

_______________________________________________
Mediawiki-api mailing list
Mediawiki-api@lists.wikimedia.org
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.wikimedia.org_mailman_listinfo_mediawiki-2Dapi&d=DwIGaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=v6T2EyE4KveT7ULVWpZKEQ&m=3ifKD97b-oU21yT3FsgrNa_MYPjLADy0HJTfStT5SoQ&s=mHorhY1TsQMyQABupg-HuaEIRMc8ZKmX3zhn9u1o0a4&e=