Re: [Wikitech-l] WikiDump Parsing

17 Jan 2007


      Hey Guys,
Thank you for the responses. My further queries within
individual responses below:
Jeff V. Merkey:
...
...
Look for the beginning tags for each section. Category links are
embedded in the articles themselves.
This is a big problem for me. Cause when I do a regular expression
match on "Category :", I get those lines within the article that are
references to other categories as well. I Just want the category that
the current article belongs to. Its worse cause sometimes its "
Category :" sometimes "   Category :" and at times "Category :"
...
tags are <TAGNAME> start and </TAGNAME> end.
True, but like you mentioned above not everything I want is in a separate tag.
...
Jeff
------------------------------------------------------------------------------------------------
From: Brion Vibber brion@pobox.com
Harish TM wrote:
...
...
I was trying to parse the Wikipedia dumps but unfortunately I find the XML
file that can be downloaded a little hard to parse. I was wondering if there
is a neat way to extract:
                         1. The article title
...
/mediawiki/page/title
Its harder to link article titles to the article content if the
sources are different isn't it?
...
...
                     2. The article content ( without links to articles

in other languages, external links and so on )
...
The article content *contains* those links, so I guess you mean you want
to parse the text and remove certain elements of it?
YES
...
...
                     3. The category.

...
Again, that's part of article text.
True - My problem with extracting this is as described above.
...
...
Also I find that there are a large number of tools that allow one to convert
plain text to media wiki text. What if I want to go the other way and
extract information exactly the way it appears on the wikipedia site.
...
Run the wiki parser on it.
Cant seem to find it. Searching for it seems to give me wikipeida
articles on parsing!!!
-------------------------------------------------------------------
From: "Jeff V. Merkey" jmerkey@wolfmountaingroup.com
...
This works too, but its slower than mollasses on a cold Utah day ....
:-)
Working on a reasonably fast machine ( 64bit 3.something GHz processor
with 4 GB RAM ) - Using Ruby to code the parser.
---------------------------------------------------------------------
Platonides Platonides@gmail.com
Jeff wrote:
...
...
You will need at least 16K buffer as many lines
read with fgets can exceed 8192 bytes in size.
...
Shouldn't be realley needed. You parse < && > tags. The problem is that
some tags can be splitted. You get "..long long line</te"
and on next line "xt>" and *if* you're looking for "</text>", you have
problems.
</text> is tricky, because most tags start on their own line, but
</text> doesn't (unless article ends with its own blank line).
Thanks for that!!! Are there some tags that are never split? That way
I could look for those, merge all the lines between them into a single
line and do a regex.
Just to further clarify what it is that I am looking for - Lets say I
want to PRINT out a copy of wikipedia ( I know thats insane - but I
need text to be as clean as if I were printing it out ), with the
articles indexed as per Title and category, how would I get that
data??
Thanks again
Harish

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] WikiDump Parsing