I don't know if it's an option for you, but we make a full replica of the production search indices available in WMF cloud. The full elasticsearch DSL can be used to query these instances. See https://wikitech.wikimedia.org/wiki/Help:CirrusSearch_elasticsearch_replicas

Erik B.

On Mon, Nov 27, 2023 at 12:02 PM <rzissoldt@gmail.com> wrote:
Hello,

i am currently gathering image data for my master thesis. I am using the QLabels from wikidata, to crawl specific image classes (like axe, car etc.).

I am using the Action API for my requests and now my problem:

The QLabel Q870 (train) has around 21k images. I am using the sroffset parameter and the "continue" parameter from the response to search for 500 images at a time. The script is working until I reach the 10k limit (the message is like: 'you request exceeded the limit of 10000 items ..."). Is there any option, that I can crawl more than 10k items/images from one search query?

My search query looks like this:
params = {
            'action': 'query',
            'format': 'json',
            'list': 'search',
            'srsearch': search_query,
            'srnamespace': '0|6|12|14|100|106',  # Namespace filter based on the provided URL
            'srlimit': batch_size,  # Number of images per batch
            'sroffset': start,  # Offset for pagination
            'prop': 'info|imageinfo',  # Request additional information about the pages (images)
            'inprop': 'url'  # Include the URL information
        }
the 'sroffset' parameter is always updated, with the result from the "continue" param from the response I get.

It would be a great, if somebody could help me!

Thank you!
Kind regards
Ruben
_______________________________________________
Mediawiki-api mailing list -- mediawiki-api@lists.wikimedia.org
To unsubscribe send an email to mediawiki-api-leave@lists.wikimedia.org