Re: [Wikitech-l] API vs data dumps

7 Nov 2010


      On 14 October 2010 09:37, Alex Brollo alex.brollo@gmail.com wrote:
...
2010/10/13 Paul Houle paul@ontology2.com
...
Don't be intimidated by working with the data dumps.  If you've got
an XML API that does streaming processing (I used .NET's XmlReader) and
use the old unix trick of piping the output of bunzip2 into your
program,  it's really pretty easy.
When I worked into it.source (a small dump! something like 300Mby unzipped),
I used a simple do-it-yourself string python search routine  and I found it
really faster then python xml routines. I presume that my scripts are really
too rough to deserve sharing, but I encourage programmers to write a "simple
dump reader" using speed of string search. My personal trick was to build an
"index", t.i. a list of pointers to articles and name of articles  into xml
file, so that it was simple and fast to recover their content. I used it
mainly because I didn't understand API at all. ;-)
Alex
Hi Alex. I have been doing something similar in Perl for a few years
for the English
Wiktionary. I've never been sure on the best way to store all the
index files I create
especially in code to share with other people like I would like to
happen. If you'd
like to collaborate or anyone else for that matter it would be pretty cool.
You'll find my stuff on the Toolserver:
https://fisheye.toolserver.org/browse/enwikt
Andrew Dunbar (hippietrail)
-- 
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] API vs data dumps