Hi There,
I'm searching for some efficient way to convert the WikiText of the downloaded data dumps(in XML) to plain text. I basically need plain text of each and every revision of Wikipedia articles.
Therefore, it would be very helpful if you can tell me about some library or some piece of code(bunch of regex) to convert WikiText to Plain Text. BTW, I write my code in Python!
Thanks.
You can source that from the cirrussearch dumps, which contain the text already cleaned up. The python looks something like:
import json from itertools import zip_longest from pprint import pprint import requests import zlib
def get_gzip_stream(url): with requests.get(url, stream=True) as res: d = zlib.decompressobj(16+zlib.MAX_WBITS) for data in res.iter_content(): yield d.decompress(data).decode('utf8')
def decode_lines(stream): buf = [] for data in stream: buf.append(data) if '\n' in data: line, tail = ''.join(buf).split('\n', 1) buf = [tail] yield json.loads(line)
if buf:
yield json.loads(''.join(buf))
def pair_up_lines(lines): return zip_longest(*([iter(lines)] * 2))
url = ' https://dumps.wikimedia.org/other/cirrussearch/20180723/enwiki-20180723-cirr... ' stream = get_gzip_stream(url) stream = decode_lines(stream) stream = pair_up_lines(stream)
for meta, doc in stream: print(meta['index']['_id']) print(doc['title']) print(doc['text'])
On Tue, Jul 24, 2018 at 7:23 AM Nikhil Prakash nikhil07prakash@gmail.com wrote:
Hi There,
I'm searching for some efficient way to convert the WikiText of the downloaded data dumps(in XML) to plain text. I basically need plain text of each and every revision of Wikipedia articles.
Therefore, it would be very helpful if you can tell me about some library or some piece of code(bunch of regex) to convert WikiText to Plain Text. BTW, I write my code in Python!
Thanks. _______________________________________________ MediaWiki-l mailing list To unsubscribe, go to: https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
mediawiki-l@lists.wikimedia.org