Re: [MediaWiki-l] How to convert WikiText to Plain Tex

24 Jul 2018

You can source that from the cirrussearch dumps, which contain the text
already cleaned up.  The python looks something like:

import json
from itertools import zip_longest
from pprint import pprint
import requests
import zlib

def get_gzip_stream(url):
    with requests.get(url, stream=True) as res:
        d = zlib.decompressobj(16+zlib.MAX_WBITS)
        for data in res.iter_content():
            yield d.decompress(data).decode('utf8')

def decode_lines(stream):
    buf = []
    for data in stream:
        buf.append(data)
        if '\n' in data:
            line, tail = ''.join(buf).split('\n', 1)
            buf = [tail]
            yield json.loads(line)

    if buf:

        yield json.loads(''.join(buf))

def pair_up_lines(lines):
    return zip_longest(*([iter(lines)] * 2))

url = '
https://dumps.wikimedia.org/other/cirrussearch/20180723/enwiki-20180723-cir…
'
stream = get_gzip_stream(url)
stream = decode_lines(stream)
stream = pair_up_lines(stream)

for meta, doc in stream:
    print(meta['index']['_id'])
    print(doc['title'])
    print(doc['text'])

On Tue, Jul 24, 2018 at 7:23 AM Nikhil Prakash &lt;nikhil07prakash(a)gmail.com&gt;
wrote:

...
   Hi There,

 I'm searching for some efficient way to convert the WikiText of the
 downloaded data dumps(in XML) to plain text. I basically need plain text of
 each and every revision of Wikipedia articles.

 Therefore, it would be very helpful if you can tell me about some library
 or some piece of code(bunch of regex) to convert WikiText to Plain Text.
 BTW, I write my code in Python!

 Thanks.
 _______________________________________________
 MediaWiki-l mailing list
 To unsubscribe, go to:
 https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

Re: [MediaWiki-l] How to convert WikiText to Plain Tex