Hello guys,
I recently came across the wonderful wikipedia project about circussearch dumps. Where the entire wikipedia is given in neat JSON format. I was trying to extract the abstracts (opening_text) tag. I downloaded the English wiki's content.json file and have been running to extract the abstracts for all articles.
I realised every odd entry is a data point and even entries are meta-data. That makes sense for 6.8M wikipedia articles or 61M wikipedia projects (i'm not sure what's present) but for that it'll be a ballpark number around 122M max no. of entries. But, I have already processed 128M entries from the JSON.Including checking my script and testing before running it for the entire JSON file.
Thanks
tawsif