You can source that from the cirrussearch dumps, which contain the text
already cleaned up. The python looks something like:
import json
from itertools import zip_longest
from pprint import pprint
import requests
import zlib
def get_gzip_stream(url):
with requests.get(url, stream=True) as res:
d = zlib.decompressobj(16+zlib.MAX_WBITS)
for data in res.iter_content():
yield d.decompress(data).decode('utf8')
def decode_lines(stream):
buf = []
for data in stream:
buf.append(data)
if '\n' in data:
line, tail = ''.join(buf).split('\n', 1)
buf = [tail]
yield json.loads(line)
if buf:
yield json.loads(''.join(buf))
def pair_up_lines(lines):
return zip_longest(*([iter(lines)] * 2))
url = '
https://dumps.wikimedia.org/other/cirrussearch/20180723/enwiki-20180723-cir…
'
stream = get_gzip_stream(url)
stream = decode_lines(stream)
stream = pair_up_lines(stream)
for meta, doc in stream:
print(meta['index']['_id'])
print(doc['title'])
print(doc['text'])
On Tue, Jul 24, 2018 at 7:23 AM Nikhil Prakash <nikhil07prakash(a)gmail.com>
wrote:
Hi There,
I'm searching for some efficient way to convert the WikiText of the
downloaded data dumps(in XML) to plain text. I basically need plain text of
each and every revision of Wikipedia articles.
Therefore, it would be very helpful if you can tell me about some library
or some piece of code(bunch of regex) to convert WikiText to Plain Text.
BTW, I write my code in Python!
Thanks.
_______________________________________________
MediaWiki-l mailing list
To unsubscribe, go to:
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l