Hello guys,

I recently came across the wonderful wikipedia project about circussearch dumps. Where the entire wikipedia is given in neat JSON format. I was trying to extract the abstracts (opening_text) tag. I downloaded the English wiki's content.json file and have been running to extract the abstracts for all articles.

I realised every odd entry is a data point and even entries are meta-data. That makes sense for 6.8M wikipedia articles or 61M wikipedia projects (i'm not sure what's present) but for that it'll be a ballpark number around 122M max no. of entries. But, I have already processed 128M entries from the JSON.Including checking my script and testing before running it for the entire JSON file.

That's why I wanted to know how many entries are there for the circussearch dump JSON file?

https://dumps.wikimedia.org/other/cirrussearch/20240819/

https://www.mediawiki.org/wiki/Extension:CirrusSearch/Schema

Thanks

tawsif