Re: [Wikitech-l] WikiDump Parsing

17 Jan 2007


      Jeff wrote:
...
You will need at least 16K buffer as many lines 
read with fgets can exceed 8192 bytes in size.
Shouldn't be realley needed. You parse < && > tags. The problem is that 
some tags can be splitted. You get "..long long line</te"
and on next line "xt>" and *if* you're looking for "</text>", you have 
problems.
</text> is tricky, because most tags start on their own line, but 
</text> doesn't (unless article ends with its own blank line).

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] WikiDump Parsing