On 01/21/2014 09:47 PM, Amir Ladsgroup wrote:
One of the things I can't understand is why we are extracting summary of pages for Yahoo? Is it our job to do it? the dumps are really huge e.g. forwikidata:http://dumps.wikimedia.org/wikidatawiki/20140106/ wikidatawiki-20140106-abstract.xmlhttp://dumps.wikimedia.org/wikidatawiki/20140106/wikidatawiki-20140106-abstract.xml14.1 GB Compare it to: full history: wikidatawiki-20140106-pages-meta-history.xml.bz2http://dumps.wikimedia.org/wikidatawiki/20140106/wikidatawiki-20140106-pages-meta-history.xml.bz28.8 GB
That's because the Yahoo one isn't compressed.
I'm not sure if Yahoo still uses those abstracts, but I wouldn't be surprised at all if other people are.
Matt Flaschen