Hello guys,
I recently came across the wonderful wikipedia project about circussearch dumps. Where the entire wikipedia is given in neat JSON format. I was trying to extract the abstracts (opening_text) tag. I downloaded the English wiki's content.json file and have been running to extract the abstracts for all articles.
I realised every odd entry is a data point and even entries are meta-data. That makes sense for 6.8M wikipedia articles or 61M wikipedia projects (i'm not sure what's present) but for that it'll be a ballpark number around 122M max no. of entries. But, I have already processed 128M entries from the JSON.Including checking my script and testing before running it for the entire JSON file.
That's why I wanted to know how many entries are there for the circussearch dump JSON file?
https://dumps.wikimedia.org/other/cirrussearch/20240819/ https://www.mediawiki.org/wiki/Extension:CirrusSearch/Schema
Thanks tawsif
Hi Tawsif - thanks for sending the message. Would you be okay re-posting this onto the 'discovery' mailing list? If you're not signed up for that list, you'll probably want to do so at https://lists.wikimedia.org/postorius/lists/discovery.lists.wikimedia.org/ before re-posting.
Thanks again. -Adam
On Sun, Aug 25, 2024 at 10:50 AM TAWSIF AHMED sleeping4cat@gmail.com wrote:
Hello guys,
I recently came across the wonderful wikipedia project about circussearch dumps. Where the entire wikipedia is given in neat JSON format. I was trying to extract the abstracts (opening_text) tag. I downloaded the English wiki's content.json file and have been running to extract the abstracts for all articles.
I realised every odd entry is a data point and even entries are meta-data. That makes sense for 6.8M wikipedia articles or 61M wikipedia projects (i'm not sure what's present) but for that it'll be a ballpark number around 122M max no. of entries. But, I have already processed 128M entries from the JSON.Including checking my script and testing before running it for the entire JSON file.
That's why I wanted to know how many entries are there for the circussearch dump JSON file?
https://dumps.wikimedia.org/other/cirrussearch/20240819/ https://www.mediawiki.org/wiki/Extension:CirrusSearch/Schema
Thanks tawsif _______________________________________________ Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimed...
abstract-wikipedia@lists.wikimedia.org