Mediawiki-api November 2023

mediawiki-api@lists.wikimedia.org

4 participants
2 discussions

Extract more than 10.000 Files from search call

by rzissoldt＠gmail.com

Hello, i am currently gathering image data for my master thesis. I am using the QLabels from wikidata, to crawl specific image classes (like axe, car etc.). I am using the Action API for my requests and now my problem: The QLabel Q870 (train) has around 21k images. I am using the sroffset parameter and the "continue" parameter from the response to search for 500 images at a time. The script is working until I reach the 10k limit (the message is like: 'you request exceeded the limit of 10000 items ..."). Is there any option, that I can crawl more than 10k items/images from one search query? My search query looks like this: params = { 'action': 'query', 'format': 'json', 'list': 'search', 'srsearch': search_query, 'srnamespace': '0|6|12|14|100|106', # Namespace filter based on the provided URL 'srlimit': batch_size, # Number of images per batch 'sroffset': start, # Offset for pagination 'prop': 'info|imageinfo', # Request additional information about the pages (images) 'inprop': 'url' # Include the URL information } the 'sroffset' parameter is always updated, with the result from the "continue" param from the response I get. It would be a great, if somebody could help me! Thank you! Kind regards Ruben

5 months

Need to extract abstract of a wikipedia page

by aditya srinivas

Hello, I am writing a Java program to extract the abstract of the wikipedia page given the title of the wikipedia page. I have done some research and found out that the abstract with be in rvsection=0 So for example if I want the abstract of 'Eiffel Tower" wiki page then I am querying using the api in the following way. http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Eiffel… and parse the XML data which we get and take the wikitext in the tag <rev xml:space="preserve"> which represents the abstract of the wikipedia page. But this wiki text also contains the infobox data which I do not need. I would like to know if there is anyway in which I can remove the infobox data and get only the wikitext related to the page's abstract Or if there is any alternative method by which I can get the abstract of the page directly. Looking forward to your help. Thanks in Advance Aditya Uppu

5 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Mediawiki-api November 2023