Hey Guys,
Thank you for the responses. My further queries within
individual responses below:
Jeff V. Merkey:
...
Look for the beginning tags for each section. Category
links are
embedded in the articles themselves.
This is a big problem for me. Cause when I do a
regular expression
match on "Category :", I get those lines within the article that are
references to other categories as well. I Just want the category that
the current article belongs to. Its worse cause sometimes its "
Category :" sometimes " Category :" and at times "Category :"
tags are <TAGNAME> start and </TAGNAME>
end.
True, but like you mentioned above not everything I want is in a separate tag.
Jeff
------------------------------------------------------------------------------------------------
From: Brion Vibber <brion(a)pobox.com>
Harish TM wrote:
> I was trying to parse the Wikipedia dumps but
unfortunately I find the XML
> file that can be downloaded a little hard to parse. I was wondering if there
> is a neat way to extract:
> 1. The article title
/mediawiki/page/title
Its harder to link article
titles to the article content if the
sources are different isn't it?
> 2. The article content (
without links to articles
> in other languages, external links and so on )
The article content *contains* those links, so I guess
you mean you want
to parse the text and remove certain elements of it?
YES
> 3. The category.
Again, that's part of article text.
True - My
problem with extracting this is as described above.
> Also I find that there are a large number of tools
that allow one to convert
> plain text to media wiki text. What if I want to go the other way and
> extract information exactly the way it appears on the wikipedia site.
Run the wiki parser on it.
Cant seem to find it.
Searching for it seems to give me wikipeida
articles on parsing!!!
-------------------------------------------------------------------
From: "Jeff V. Merkey" <jmerkey(a)wolfmountaingroup.com>
This works too, but its slower than mollasses on a cold
Utah day ....
:-)
Working on a reasonably fast machine ( 64bit 3.something GHz processor
with 4 GB RAM ) - Using Ruby to code the parser.
---------------------------------------------------------------------
Platonides <Platonides(a)gmail.com>
Jeff wrote:
>You will need at least 16K buffer as many lines
> read with fgets can exceed 8192 bytes in size.
Shouldn't be realley needed. You parse <
&& > tags. The problem is that
some tags can be splitted. You get "..long long line</te"
and on next line "xt>" and *if* you're looking for
"</text>", you have
problems.
</text> is tricky, because most tags start on their own line, but
</text> doesn't (unless article ends with its own blank line).
Thanks for that!!! Are there some tags that are never split? That way
I could look for those, merge all the lines between them into a single
line and do a regex.
Just to further clarify what it is that I am looking for - Lets say I
want to PRINT out a copy of wikipedia ( I know thats insane - but I
need text to be as clean as if I were printing it out ), with the
articles indexed as per Title and category, how would I get that
data??
Thanks again
Harish