Hi Platonides,
Thanks so much for your reply.
That makes a lot more sense - unfortunately, I can't seem to find section
names as elements in the xml schema (
https://www.mediawiki.org/xml/export-0.10.xsd). Do you have any
recommendations for parsing the intro section out of the xml dumps? Trying
to avoid parsing html or querying the api because I have Cloud9's wiki xml
reader for processing the xml dumps in spark.
Thanks again,
Dan
On Fri, Oct 12, 2018 at 1:00 PM Platonides <platonides(a)gmail.com> wrote:
That \1\2 are literal bytes. You would do:
$regexp = '/^(.*?)(?=\x01\x02)/s';
But those bytes are not present in the original wikitext, they are set
by ExtractFormatter
$html = preg_replace( '/\s*(<h([1-6])\b)/i',
"\n\n" .
self::SECTION_MARKER_START . '$2' .
self::SECTION_MARKER_END . '$1',
$html);
Best regards
PS: These are section names, not edit summaries.
_______________________________________________
Mediawiki-api mailing list
Mediawiki-api(a)lists.wikimedia.org
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.wikimedia.org_ma…