I don't know if it helps you, but the cirrussearch dumps contain the
opening text (the text before the first section header) broken out into
plain text. These dumps are limited to only the current (as of time of
dump) version of each article with no historical data. The dumps
themselves are lines of json so not too hard to parse.
The cirrusbuilddoc property of the api is roughly the same format as the
dumps:
On Fri, Oct 12, 2018 at 2:22 PM Platonides <platonides(a)gmail.com> wrote:
Are you sure you are getting html from the XML and not
*wikitext*?
Assuming you are working with wikitext, and want everything up to the
first heading, and handwaving things like a section set by a template, you
could break at the first line matching /^(=={2,5})[ \\t]*(.+?)[
\\t]*\1\\s*$/m (see the function Parser::doHeadings below).
In practice, splitting at "\n==" will give you the right result on 99% of
articles.
If the library is really giving you html, it's even easier, split the html
at the first <h[1-6]>.
Note that the wikitext will contain many non-textual characters like
templates, tables, wikitext formatting, references... that you'd need to
clean up before applying your models.
However, other projects have done this in the past (sorry, I have no links
to them), so I would either make a very basic cleaning, or reuse what
others made.
Best regards
===============================
public function doHeadings( $text ) {
for ( $i = 6; $i >= 1; --$i ) {
$h = str_repeat( '=', $i );
// Trim non-newline whitespace from headings
// Using \s* will break for: "==\n===\n" and parse as
<h2>=</h2>
$text = preg_replace( "/^(?:$h)[ \\t]*(.+?)[
\\t]*(?:$h)\\s*$/m", "<h$i>\\1</h$i>", $text );
}
return $text;
}
https://phabricator.wikimedia.org/source/mediawiki/browse/master/includes/p…
_______________________________________________
Mediawiki-api mailing list
Mediawiki-api(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api