Hello! Would you be able to explain a little more on the access pattern? Is this going to be a bulk operation across, for example, all articles on a wiki? Would you mind posting your reply on this list as well as, after signing up (if not already signed up), on the discovery list ( https://lists.wikimedia.org/postorius/lists/discovery.lists.wikimedia.org/ ) ?
Thanks! -Adam
On Mon, Oct 13, 2025 at 9:24 AM delahera@gmail.com wrote:
Hi! I'm trying to get articletopic predictions for a bunch of Wikipedia articles.[1] This value is cached in Elasticsearch indices,[2] under the WeightedTags field.[3]
Because using CirrusSearch through the Action API would return at most 500 results,[4] I was thinking of querying the CirrusSearch database directly.
I've seen there is the CloudElastic replica,[5] but I'm not being able to use it from PAWS. Is it only available from Cloud VPS and Toolforge?
Otherwise, can you suggest an alternative for what I'm trying to accomplish? Thank you!
[1] https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_articletopic_outl... [2] https://wikitech.wikimedia.org/wiki/Search/articletopic [3] https://wikitech.wikimedia.org/wiki/Search/WeightedTags [4] https://www.mediawiki.org/w/api.php?action=help&modules=query%2Bsearch [5] https://wikitech.wikimedia.org/wiki/Help:CirrusSearch_OpenSearch_replicas _______________________________________________ Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
For now, I'm doing this in a personal PAWS notebook and I plan to run it
on around 800 enwiki and 200 eswiki articles. But I would like to share the notebook with others in the future, so they can use it for their own list of articles, and I may try to make it into a Toolforge tool eventually.
While accessing via cloudelastic replicas would certainly be more performant, for a set of 1k articles sequentially requesting them through the public mediawiki api's should be doable. A query such as this will return the weighted tags: https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=jso...
That API does report that it is an internal format and subject to change, but that internal format is the exact same thing we would see talking to cloudelastic directly.