I don't know if it helps you, but the cirrussearch dumps contain the opening text (the text before the first section header) broken out into plain text. These dumps are limited to only the current (as of time of dump) version of each article with no historical data. The dumps themselves are lines of json so not too hard to parse.
https://dumps.wikimedia.org/other/cirrussearch/current/
The cirrusbuilddoc property of the api is roughly the same format as the dumps:
https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=jso...
On Fri, Oct 12, 2018 at 2:22 PM Platonides platonides@gmail.com wrote:
Are you sure you are getting html from the XML and not *wikitext*?
Assuming you are working with wikitext, and want everything up to the first heading, and handwaving things like a section set by a template, you could break at the first line matching /^(=={2,5})[ \t]*(.+?)[ \t]*\1\s*$/m (see the function Parser::doHeadings below). In practice, splitting at "\n==" will give you the right result on 99% of articles.
If the library is really giving you html, it's even easier, split the html at the first <h[1-6]>.
Note that the wikitext will contain many non-textual characters like templates, tables, wikitext formatting, references... that you'd need to clean up before applying your models. However, other projects have done this in the past (sorry, I have no links to them), so I would either make a very basic cleaning, or reuse what others made.
Best regards
=============================== public function doHeadings( $text ) { for ( $i = 6; $i >= 1; --$i ) { $h = str_repeat( '=', $i ); // Trim non-newline whitespace from headings // Using \s* will break for: "==\n===\n" and parse as
<h2>=</h2> $text = preg_replace( "/^(?:$h)[ \\t]*(.+?)[ \\t]*(?:$h)\\s*$/m", "<h$i>\\1</h$i>", $text ); } return $text; }
https://phabricator.wikimedia.org/source/mediawiki/browse/master/includes/pa... _______________________________________________ Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api