Hi Tawsif - thanks for sending the message. Would you be okay re-posting this onto the 'discovery' mailing list? If you're not signed up for that list, you'll probably want to do so at https://lists.wikimedia.org/postorius/lists/discovery.lists.wikimedia.org/ before re-posting.

Thanks again.

-Adam

On Sun, Aug 25, 2024 at 10:50 AM TAWSIF AHMED <sleeping4cat@gmail.com> wrote:

Hello guys,

I recently came across the wonderful wikipedia project about circussearch dumps. Where the entire wikipedia is given in neat JSON format. I was trying to extract the abstracts (opening_text) tag. I downloaded the English wiki's content.json file and have been running to extract the abstracts for all articles.

I realised every odd entry is a data point and even entries are meta-data. That makes sense for 6.8M wikipedia articles or 61M wikipedia projects (i'm not sure what's present) but for that it'll be a ballpark number around 122M max no. of entries. But, I have already processed 128M entries from the JSON.Including checking my script and testing before running it for the entire JSON file.

That's why I wanted to know how many entries are there for the circussearch dump JSON file?

https://dumps.wikimedia.org/other/cirrussearch/20240819/
https://www.mediawiki.org/wiki/Extension:CirrusSearch/Schema

Thanks
tawsif
_______________________________________________
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org
List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimedia.org/