I don't know if it's an option for you, but we make a full replica of the
production search indices available in WMF cloud. The full elasticsearch
DSL can be used to query these instances. See
https://wikitech.wikimedia.org/wiki/Help:CirrusSearch_elasticsearch_replicas
Erik B.
On Mon, Nov 27, 2023 at 12:02 PM <rzissoldt(a)gmail.com> wrote:
Hello,
i am currently gathering image data for my master thesis. I am using the
QLabels from wikidata, to crawl specific image classes (like axe, car etc.).
I am using the Action API for my requests and now my problem:
The QLabel Q870 (train) has around 21k images. I am using the sroffset
parameter and the "continue" parameter from the response to search for 500
images at a time. The script is working until I reach the 10k limit (the
message is like: 'you request exceeded the limit of 10000 items ..."). Is
there any option, that I can crawl more than 10k items/images from one
search query?
My search query looks like this:
params = {
'action': 'query',
'format': 'json',
'list': 'search',
'srsearch': search_query,
'srnamespace': '0|6|12|14|100|106', # Namespace filter
based
on the provided URL
'srlimit': batch_size, # Number of images per batch
'sroffset': start, # Offset for pagination
'prop': 'info|imageinfo', # Request additional information
about the pages (images)
'inprop': 'url' # Include the URL information
}
the 'sroffset' parameter is always updated, with the result from the
"continue" param from the response I get.
It would be a great, if somebody could help me!
Thank you!
Kind regards
Ruben
_______________________________________________
Mediawiki-api mailing list -- mediawiki-api(a)lists.wikimedia.org
To unsubscribe send an email to mediawiki-api-leave(a)lists.wikimedia.org