Mediawiki-api October 2018

mediawiki-api@lists.wikimedia.org

3 participants
3 discussions

Need to extract abstract of a wikipedia page

by aditya srinivas

Hello, I am writing a Java program to extract the abstract of the wikipedia page given the title of the wikipedia page. I have done some research and found out that the abstract with be in rvsection=0 So for example if I want the abstract of 'Eiffel Tower" wiki page then I am querying using the api in the following way. http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Eiffel… and parse the XML data which we get and take the wikitext in the tag <rev xml:space="preserve"> which represents the abstract of the wikipedia page. But this wiki text also contains the infobox data which I do not need. I would like to know if there is anyway in which I can remove the infobox data and get only the wikitext related to the page's abstract Or if there is any alternative method by which I can get the abstract of the page directly. Looking forward to your help. Thanks in Advance Aditya Uppu

5 months

[Mediawiki-api-announce] Deprecation of list=allusers 'recenteditcount' result property

by Brad Jorsch (Anomie)

When list=allusers is used with auactiveusers, a property 'recenteditcount' is returned in the result. In bug 67301[1] it was pointed out that this property is including various other logged actions, and so should really be named something like "recentactions". Gerrit change 130093,[2] merged today, adds the "recentactions" result property. "recenteditcount" is also returned for backwards compatability, but will be removed at some point during the MediaWiki 1.25 development cycle. Any clients using this property should be updated to use the new property name. The new property will be available on WMF wikis with 1.24wmf12, see https://www.mediawiki.org/wiki/MediaWiki_1.24/Roadmap for the schedule. [1]: https://bugzilla.wikimedia.org/show_bug.cgi?id=67301 [2]: https://gerrit.wikimedia.org/r/#/c/130093/ -- Brad Jorsch (Anomie) Software Engineer Wikimedia Foundation _______________________________________________ Mediawiki-api-announce mailing list Mediawiki-api-announce(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce

5 years

Question about summary regex in api for ML dataset

by Daniel Kramer

Hi, I'm trying to create a dataset of summaries vs full text bodies for automatic text summarization models. I was looking at the online api for retrieving the summary of a page, so I could recreate it in my Spark code for parsing wiki dumps. Specifically, I was looking at the regex in: https://phabricator.wikimedia.org/diffusion/ETEX/browse/master/includes/Api… $regexp = '/^(.*?)(?=' . ExtractFormatter::SECTION_MARKER_START . ')/s'; With section marker start filled in: $regexp = '/^(.*?)(?=' . \1\2 . ')/s'; However, when I plug that expression into an online tester (regex101.com), I see that: \2 This token references a non-existent or invalid subpattern I am wondering if this is a bug or if I'm placing it incorrectly? The alternative branch is when plaintext is set to false - that's for parsing HTML correct / not applicable for the xml in wiki dumps? Thanks for your help, Dan Kramer

5 years, 6 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Mediawiki-api October 2018