I am writing a Java program to extract the abstract of the wikipedia page
given the title of the wikipedia page. I have done some research and found
out that the abstract with be in rvsection=0
So for example if I want the abstract of 'Eiffel Tower" wiki page then I am
querying using the api in the following way.
and parse the XML data which we get and take the wikitext in the tag <rev
xml:space="preserve"> which represents the abstract of the wikipedia page.
But this wiki text also contains the infobox data which I do not need. I
would like to know if there is anyway in which I can remove the infobox data
and get only the wikitext related to the page's abstract Or if there is any
alternative method by which I can get the abstract of the page directly.
Looking forward to your help.
Thanks in Advance
When list=allusers is used with auactiveusers, a property 'recenteditcount'
is returned in the result. In bug 67301 it was pointed out that this
property is including various other logged actions, and so should really be
named something like "recentactions".
Gerrit change 130093, merged today, adds the "recentactions" result
property. "recenteditcount" is also returned for backwards compatability,
but will be removed at some point during the MediaWiki 1.25 development
Any clients using this property should be updated to use the new property
name. The new property will be available on WMF wikis with 1.24wmf12, see
https://www.mediawiki.org/wiki/MediaWiki_1.24/Roadmap for the schedule.
Brad Jorsch (Anomie)
Mediawiki-api-announce mailing list
I'm trying to create a dataset of summaries vs full text bodies for
automatic text summarization models.
I was looking at the online api for retrieving the summary of a page, so I
could recreate it in my Spark code for parsing wiki dumps. Specifically, I
was looking at the regex in:
$regexp = '/^(.*?)(?=' . ExtractFormatter::SECTION_MARKER_START . ')/s';
With section marker start filled in:
$regexp = '/^(.*?)(?=' . \1\2 . ')/s';
However, when I plug that expression into an online tester (regex101.com),
I see that: \2 This token references a non-existent or invalid subpattern
I am wondering if this is a bug or if I'm placing it incorrectly?
The alternative branch is when plaintext is set to false - that's for
parsing HTML correct / not applicable for the xml in wiki dumps?
Thanks for your help,