Hey Guys,
Thank you for the responses. My further queries within individual responses below:
Jeff V. Merkey:
...
Look for the beginning tags for each section. Category links are embedded in the articles themselves.
This is a big problem for me. Cause when I do a regular expression match on "Category :", I get those lines within the article that are references to other categories as well. I Just want the category that the current article belongs to. Its worse cause sometimes its " Category :" sometimes " Category :" and at times "Category :"
tags are <TAGNAME> start and </TAGNAME> end.
True, but like you mentioned above not everything I want is in a separate tag.
Jeff
------------------------------------------------------------------------------------------------
From: Brion Vibber brion@pobox.com
Harish TM wrote:
I was trying to parse the Wikipedia dumps but unfortunately I find the XML file that can be downloaded a little hard to parse. I was wondering if there is a neat way to extract: 1. The article title
/mediawiki/page/title
Its harder to link article titles to the article content if the sources are different isn't it?
2. The article content ( without links to articles
in other languages, external links and so on )
The article content *contains* those links, so I guess you mean you want to parse the text and remove certain elements of it?
YES
3. The category.
Again, that's part of article text.
True - My problem with extracting this is as described above.
Also I find that there are a large number of tools that allow one to convert plain text to media wiki text. What if I want to go the other way and extract information exactly the way it appears on the wikipedia site.
Run the wiki parser on it.
Cant seem to find it. Searching for it seems to give me wikipeida articles on parsing!!!
-------------------------------------------------------------------
From: "Jeff V. Merkey" jmerkey@wolfmountaingroup.com
This works too, but its slower than mollasses on a cold Utah day .... :-)
Working on a reasonably fast machine ( 64bit 3.something GHz processor with 4 GB RAM ) - Using Ruby to code the parser.
--------------------------------------------------------------------- Platonides Platonides@gmail.com
Jeff wrote:
You will need at least 16K buffer as many lines read with fgets can exceed 8192 bytes in size.
Shouldn't be realley needed. You parse < && > tags. The problem is that some tags can be splitted. You get "..long long line</te" and on next line "xt>" and *if* you're looking for "</text>", you have problems. </text> is tricky, because most tags start on their own line, but </text> doesn't (unless article ends with its own blank line).
Thanks for that!!! Are there some tags that are never split? That way I could look for those, merge all the lines between them into a single line and do a regex.
Just to further clarify what it is that I am looking for - Lets say I want to PRINT out a copy of wikipedia ( I know thats insane - but I need text to be as clean as if I were printing it out ), with the articles indexed as per Title and category, how would I get that data??
Thanks again Harish