How to convert WikiText to Plain Tex

List overview All Threads
Download

newer

older

New "Between the Brackets"...

How to delete erroneous file...

Nikhil Prakash

24 Jul 2018 24 Jul '18

2:22 p.m.

Hi There,

I'm searching for some efficient way to convert the WikiText of the downloaded data dumps(in XML) to plain text. I basically need plain text of each and every revision of Wikipedia articles.

Therefore, it would be very helpful if you can tell me about some library or some piece of code(bunch of regex) to convert WikiText to Plain Text. BTW, I write my code in Python!

Thanks.

Show replies by date

Erik Bernhardson

24 Jul 24 Jul

5:06 p.m.

You can source that from the cirrussearch dumps, which contain the text already cleaned up. The python looks something like:

import json from itertools import zip_longest from pprint import pprint import requests import zlib

def get_gzip_stream(url): with requests.get(url, stream=True) as res: d = zlib.decompressobj(16+zlib.MAX_WBITS) for data in res.iter_content(): yield d.decompress(data).decode('utf8')

def decode_lines(stream): buf = [] for data in stream: buf.append(data) if '\n' in data: line, tail = ''.join(buf).split('\n', 1) buf = [tail] yield json.loads(line)

if buf:

yield json.loads(''.join(buf))

def pair_up_lines(lines): return zip_longest(*([iter(lines)] * 2))

url = ' https://dumps.wikimedia.org/other/cirrussearch/20180723/enwiki-20180723-cirr... ' stream = get_gzip_stream(url) stream = decode_lines(stream) stream = pair_up_lines(stream)

for meta, doc in stream: print(meta['index']['_id']) print(doc['title']) print(doc['text'])

On Tue, Jul 24, 2018 at 7:23 AM Nikhil Prakash nikhil07prakash@gmail.com wrote:

...

Hi There,

I'm searching for some efficient way to convert the WikiText of the downloaded data dumps(in XML) to plain text. I basically need plain text of each and every revision of Wikipedia articles.

Therefore, it would be very helpful if you can tell me about some library or some piece of code(bunch of regex) to convert WikiText to Plain Text. BTW, I write my code in Python!

Thanks. _______________________________________________ MediaWiki-l mailing list To unsubscribe, go to: https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

2295

Age (days ago)

2295

Last active (days ago)

mediawiki-l@lists.wikimedia.org

1 comments

2 participants

tags (0)

participants (2)

Erik Bernhardson
Nikhil Prakash