I was trying to parse the Wikipedia dumps but unfortunately I find the XML file that can be downloaded a little hard to parse. I was wondering if there is a neat way to extract: 1. The article title 2. The article content ( without links to articles in other languages, external links and so on ) 3. The category.
Also I find that there are a large number of tools that allow one to convert plain text to media wiki text. What if I want to go the other way and extract information exactly the way it appears on the wikipedia site.
Harish
I have written a complete set of tools that do all of this, but they are not open sourced. I would suggest a simple C or C++ program calling stdin and looking for just the tags you want. Be careful as the buffering required is LARGE to parse these files. You will need at least 16K buffer as many lines read with fgets can exceed 8192 bytes in size.
Look for the beginning tags for each section. Category links are embedded in the articles themselves.
tags are <TAGNAME> start and </TAGNAME> end.
Jeff
Harish TM wrote:
I was trying to parse the Wikipedia dumps but unfortunately I find the XML file that can be downloaded a little hard to parse. I was wondering if there is a neat way to extract: 1. The article title 2. The article content ( without links to articles in other languages, external links and so on ) 3. The category.
Also I find that there are a large number of tools that allow one to convert plain text to media wiki text. What if I want to go the other way and extract information exactly the way it appears on the wikipedia site.
Harish _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Jeff wrote:
You will need at least 16K buffer as many lines read with fgets can exceed 8192 bytes in size.
Shouldn't be realley needed. You parse < && > tags. The problem is that some tags can be splitted. You get "..long long line</te" and on next line "xt>" and *if* you're looking for "</text>", you have problems. </text> is tricky, because most tags start on their own line, but </text> doesn't (unless article ends with its own blank line).
Hey Guys,
Thank you for the responses. My further queries within individual responses below:
Jeff V. Merkey:
...
Look for the beginning tags for each section. Category links are embedded in the articles themselves.
This is a big problem for me. Cause when I do a regular expression match on "Category :", I get those lines within the article that are references to other categories as well. I Just want the category that the current article belongs to. Its worse cause sometimes its " Category :" sometimes " Category :" and at times "Category :"
tags are <TAGNAME> start and </TAGNAME> end.
True, but like you mentioned above not everything I want is in a separate tag.
Jeff
------------------------------------------------------------------------------------------------
From: Brion Vibber brion@pobox.com
Harish TM wrote:
I was trying to parse the Wikipedia dumps but unfortunately I find the XML file that can be downloaded a little hard to parse. I was wondering if there is a neat way to extract: 1. The article title
/mediawiki/page/title
Its harder to link article titles to the article content if the sources are different isn't it?
2. The article content ( without links to articles
in other languages, external links and so on )
The article content *contains* those links, so I guess you mean you want to parse the text and remove certain elements of it?
YES
3. The category.
Again, that's part of article text.
True - My problem with extracting this is as described above.
Also I find that there are a large number of tools that allow one to convert plain text to media wiki text. What if I want to go the other way and extract information exactly the way it appears on the wikipedia site.
Run the wiki parser on it.
Cant seem to find it. Searching for it seems to give me wikipeida articles on parsing!!!
-------------------------------------------------------------------
From: "Jeff V. Merkey" jmerkey@wolfmountaingroup.com
This works too, but its slower than mollasses on a cold Utah day .... :-)
Working on a reasonably fast machine ( 64bit 3.something GHz processor with 4 GB RAM ) - Using Ruby to code the parser.
--------------------------------------------------------------------- Platonides Platonides@gmail.com
Jeff wrote:
You will need at least 16K buffer as many lines read with fgets can exceed 8192 bytes in size.
Shouldn't be realley needed. You parse < && > tags. The problem is that some tags can be splitted. You get "..long long line</te" and on next line "xt>" and *if* you're looking for "</text>", you have problems. </text> is tricky, because most tags start on their own line, but </text> doesn't (unless article ends with its own blank line).
Thanks for that!!! Are there some tags that are never split? That way I could look for those, merge all the lines between them into a single line and do a regex.
Just to further clarify what it is that I am looking for - Lets say I want to PRINT out a copy of wikipedia ( I know thats insane - but I need text to be as clean as if I were printing it out ), with the articles indexed as per Title and category, how would I get that data??
Thanks again Harish
On 1/17/07, Harish TM harish.tmh@gmail.com wrote:
This is a big problem for me. Cause when I do a regular expression match on "Category :", I get those lines within the article that are references to other categories as well. I Just want the category that the current article belongs to. Its worse cause sometimes its " Category :" sometimes " Category :" and at times "Category :"
Any category link without a leading colon will add the category. With a leading colon, it links. So [[Category:Foo]] categorizes, [[:Category:Foo]] links. You can use that to your advantage.
You may wish to look over the parser at http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/Parser.php?v.... Be aware that the parser is not short, simple, or clean, because (as you've discovered) neither is the markup language.
The legal title characters, by default, are " %!"$&'()*,\-.\/0-9:;=?@A-Z\\^_`a-z~\x80-\xFF+". Therefore, a reasonable PCRE regex (with no string escaping of backslashes et al) might be:
/[[Category:([ %!"$&'()*,-./0-9:;=?@A-Z\^_`a-z~\x80-\xFF+]+)]]/i
Which would give you the category name. Hopefully. In most cases. Because other things can make a title invalid. Basically, it's a mess, and the only surefire way around it is using MediaWiki itself to do the parsing.
Cant seem to find it. Searching for it seems to give me wikipeida articles on parsing!!!
Basically: install MediaWiki and hack it to do what you want. The parser is not well-defined or application-independent. You can make your own simplified parser, but it *will* fail in corner-cases (unless your simplifications consist of hacking out stuff you don't need and inlining stuff from other files).
Just to further clarify what it is that I am looking for - Lets say I want to PRINT out a copy of wikipedia ( I know thats insane - but I need text to be as clean as if I were printing it out ), with the articles indexed as per Title and category, how would I get that data??
Use your own script to parse the XML part of the dumps and reformat them how you like. Then run each article's text through the parser, and grab the category list it spits back. It will not be very easy.
Thanks a ton.
I will try hacking MediaWiki, will post back if there are issues
Harish
On 1/18/07, Simetrical Simetrical+wikitech@gmail.com wrote:
On 1/17/07, Harish TM harish.tmh@gmail.com wrote:
This is a big problem for me. Cause when I do a regular expression match on "Category :", I get those lines within the article that are references to other categories as well. I Just want the category that the current article belongs to. Its worse cause sometimes its " Category :" sometimes " Category :" and at times "Category :"
Any category link without a leading colon will add the category. With a leading colon, it links. So [[Category:Foo]] categorizes, [[:Category:Foo]] links. You can use that to your advantage.
You may wish to look over the parser at
http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/Parser.php?v... . Be aware that the parser is not short, simple, or clean, because (as you've discovered) neither is the markup language.
The legal title characters, by default, are " %!"$&'()*,\-.\/0-9:;=?@A-Z\\^_`a-z~\x80-\xFF+". Therefore, a reasonable PCRE regex (with no string escaping of backslashes et al) might be:
/[[Category:([ %!"$&'()*,-./0-9:;=?@A-Z\^_`a-z~\x80-\xFF+]+)]]/i
Which would give you the category name. Hopefully. In most cases. Because other things can make a title invalid. Basically, it's a mess, and the only surefire way around it is using MediaWiki itself to do the parsing.
Cant seem to find it. Searching for it seems to give me wikipeida articles on parsing!!!
Basically: install MediaWiki and hack it to do what you want. The parser is not well-defined or application-independent. You can make your own simplified parser, but it *will* fail in corner-cases (unless your simplifications consist of hacking out stuff you don't need and inlining stuff from other files).
Just to further clarify what it is that I am looking for - Lets say I want to PRINT out a copy of wikipedia ( I know thats insane - but I need text to be as clean as if I were printing it out ), with the articles indexed as per Title and category, how would I get that data??
Use your own script to parse the XML part of the dumps and reformat them how you like. Then run each article's text through the parser, and grab the category list it spits back. It will not be very easy.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Simetrical <Simetrical+wikitech@...> writes:
The legal title characters, by default, are " %!"$&'()*,\-.\/0-9:;=? <at> A-Z\\^_`a-z~\x80-\xFF+". Therefore, a reasonable PCRE regex (with no string escaping of backslashes et al) might be:
/[[Category:([ %!"$&'()*,-./0-9:;=? <at> A-Z\^_`a-z~\x80-\xFF+]+)]]/i
Note that one also needs to cope with leading and trailing spaces, and the "Category" is case-insensitive. So probably something more like:
[[ *[Cc][Aa][Tt][Ee][Gg][Oo][Rr][Yy] *: * ...
Which would give you the category name. Hopefully. In most cases. Because other things can make a title invalid. Basically, it's a mess, and the only surefire way around it is using MediaWiki itself to do the parsing.
Or, to download categorylink.sql file from the same dump, and use the entries from that. A bit redundant if one needs to parse the markup anyway, though...
A good first step is probably to use an XML toolkit/parser library of some kind for whatever PL one might be using for the task. Unless one happens to be using Haskell, I can't offer any specific advice on that... (Did the original querant specify?) Not a complete solution, since the markup is something of an XML/HTML mish-mash, but it'll still cope with much of the structure.
Slan, Alai.
"Alai" AlaiWiki@gmail.com wrote in message news:loom.20070118T080426-624@post.gmane.org...
Simetrical <Simetrical+wikitech@...> writes:
The legal title characters, by default, are " %!"$&'()*,\-.\/0-9:;=? <at> A-Z\\^_`a-z~\x80-\xFF+". Therefore,
a
reasonable PCRE regex (with no string escaping of backslashes et al) might be:
/[[Category:([ %!"$&'()*,-./0-9:;=? <at>
A-Z\^_`a-z~\x80-\xFF+]+)]]/i
Note that one also needs to cope with leading and trailing spaces, and the "Category" is case-insensitive. So probably something more like:
[[ *[Cc][Aa][Tt][Ee][Gg][Oo][Rr][Yy] *: * ...
The /i at the end of the regex makes it case insensitive.
- Mark Clements (HappyDog)
Alai wrote:
Note that one also needs to cope with leading and trailing spaces, and
the
"Category" is case-insensitive. So probably something more like:
[[ *[Cc][Aa][Tt][Ee][Gg][Oo][Rr][Yy] *: * ...
Or just use an "ignore case" option like /i (in pcre)...
-- chris
On 1/17/07, Harish TM harish.tmh@gmail.com wrote:
Just to further clarify what it is that I am looking for - Lets say I want to PRINT out a copy of wikipedia ( I know thats insane - but I need text to be as clean as if I were printing it out ), with the articles indexed as per Title and category, how would I get that data??
The easiest way would probably be: 1) download the dumps 2) install mediawiki 3) have a bot scrape from *your own* mediawiki installation (printable version) 4) use the categorylinks table to sort by category (or scrape the categories at the bottom of each article)
On the other hand, you could always write your own parser :).
Anthony
This is actually quite a cool solution!!!
It is a little roundabout but its good to know I have alternatives!!
Harish
On 1/18/07, Anthony wikitech@inbox.org wrote:
On 1/17/07, Harish TM harish.tmh@gmail.com wrote:
Just to further clarify what it is that I am looking for - Lets say I want to PRINT out a copy of wikipedia ( I know thats insane - but I need text to be as clean as if I were printing it out ), with the articles indexed as per Title and category, how would I get that data??
The easiest way would probably be:
- download the dumps
- install mediawiki
- have a bot scrape from *your own* mediawiki installation (printable
version) 4) use the categorylinks table to sort by category (or scrape the categories at the bottom of each article)
On the other hand, you could always write your own parser :).
Anthony
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Harish TM wrote:
I was trying to parse the Wikipedia dumps but unfortunately I find the XML file that can be downloaded a little hard to parse. I was wondering if there is a neat way to extract: 1. The article title
/mediawiki/page/title
2. The article content ( without links to articles
in other languages, external links and so on )
The article content *contains* those links, so I guess you mean you want to parse the text and remove certain elements of it?
3. The category.
Again, that's part of article text.
Also I find that there are a large number of tools that allow one to convert plain text to media wiki text. What if I want to go the other way and extract information exactly the way it appears on the wikipedia site.
Run the wiki parser on it.
- -- brion vibber (brion @ pobox.com)
This works too, but its slower than mollasses on a cold Utah day ....
:-)
Jeff
Brion Vibber wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Harish TM wrote:
I was trying to parse the Wikipedia dumps but unfortunately I find the XML file that can be downloaded a little hard to parse. I was wondering if there is a neat way to extract: 1. The article title
/mediawiki/page/title
2. The article content ( without links to articles
in other languages, external links and so on )
The article content *contains* those links, so I guess you mean you want to parse the text and remove certain elements of it?
3. The category.
Again, that's part of article text.
Also I find that there are a large number of tools that allow one to convert plain text to media wiki text. What if I want to go the other way and extract information exactly the way it appears on the wikipedia site.
Run the wiki parser on it.
- -- brion vibber (brion @ pobox.com)
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFFrpxXwRnhpk1wk44RArnZAKCe347OtktrffTXbzGgzb0xVNnZOQCeO7sq MIjjmK5c8Oc4RYQzMExvqHQ= =jTHV -----END PGP SIGNATURE-----
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Brion Vibber wrote:
Harish TM wrote:
I was trying to parse the Wikipedia dumps but unfortunately I find the XML file that can be downloaded a little hard to parse. I was wondering if there is a neat way to extract: 1. The article title
/mediawiki/page/title
2. The article content ( without links to articles
in other languages, external links and so on )
The article content *contains* those links, so I guess you mean you want to parse the text and remove certain elements of it?
3. The category.
Again, that's part of article text.
Also I find that there are a large number of tools that allow one to convert plain text to media wiki text. What if I want to go the other way and extract information exactly the way it appears on the wikipedia site.
Run the wiki parser on it.
Or download (http://static.wikipedia.org/downloads/November_2006/en/) it parsed.
Matthew Flaschen
You could also just modify this code (released unde GPLv3) and use it to strip out titles.
Stuff it into a file under linux called "parsetitle.c" and type:
gcc parsetitle.c -o parsetitle
./parsetitle < enwiki<date>.xml > titles.txt
Jeff
#include "platform.h"
#ifdef WINDOWS
#define strncasecmp strnicmp
#include "windows.h" #include "winioctl.h" #include "winuser.h" #include "stdarg.h" typedef UCHAR BYTE; typedef USHORT WORD; #include "stdio.h" #include "stdlib.h" #include "ctype.h" #include "conio.h"
#endif
#ifdef LINUX
#include <unistd.h> #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <ctype.h> #include <string.h> //#include <ncurses.h> #include <termios.h> #include <sys/ioctl.h> #include <sys/stat.h> #include <pthread.h> #include <sys/types.h> #include <sys/socket.h> #include <netinet/in.h> #include <net/if.h> #include <stdio.h> #include <errno.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <sched.h> #include <ctype.h> #include <openssl/md5.h>
#endif
char buffer[0x10000]; char title[4096];
int main(int argc, char *argv[]) { register char *s, *p; register int intitle = 0, i, f, inpage = 0; register int titlefound = 0, revision = 0, inrev = 0;
while (s = fgets(buffer, 0x10000, stdin)) { while (*s) { if (!*s || *s == '\n') { if (*s) { // putc(*s, stdout); s++; } break; }
if (!memcmp(s, "<page>", 6)) { s += 6; inpage++; titlefound = 0; revision = 0; continue; }
if (!memcmp(s, "</page>", 7)) { s += 7;
if (!titlefound) fprintf(stdout, "no article title?\n");
if (!revision) fprintf(stdout, "no revision?\n");
titlefound = 0; revision = 0; if (inpage) inpage--; continue; }
if (!memcmp(s, "</revision>", 11)) { if (inrev) inrev--; s += 11; continue; }
if (!memcmp(s, "<revision>", 10)) { inrev++; revision = 1; s += 10; continue; }
if (!memcmp(s, "<title>", 7)) { intitle++; s += 7;
p = strstr(s, "</title>"); if (p) { if (intitle) intitle--;
if (p - s) { strncpy(title, s, p - s); title[p - s] = '\0'; s += (p - s);
for (f=i=0; i < (p - s); i++) { if (!isspace(*p++)) f = 1; } if (f) fprintf(stdout, "[%s] SPACES?\n", title); else fprintf(stdout, "[%s]\n", title); } else fprintf(stdout, "[%s] NULL?\n", s);
titlefound = 1; continue; }
if (intitle) { intitle--; printf("state error: title spanned segments [%s]\n", s); continue; } } // putc(*s, stdout); s++; } } return 0; }
Matthew Flaschen wrote:
Brion Vibber wrote:
Harish TM wrote:
I was trying to parse the Wikipedia dumps but unfortunately I find the XML file that can be downloaded a little hard to parse. I was wondering if there is a neat way to extract: 1. The article title
/mediawiki/page/title
2. The article content ( without links to articles
in other languages, external links and so on )
The article content *contains* those links, so I guess you mean you want to parse the text and remove certain elements of it?
3. The category.
Again, that's part of article text.
Also I find that there are a large number of tools that allow one to convert plain text to media wiki text. What if I want to go the other way and extract information exactly the way it appears on the wikipedia site.
Run the wiki parser on it.
Or download (http://static.wikipedia.org/downloads/November_2006/en/) it parsed.
Matthew Flaschen
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
This c program will also run and complete about 50 times faster then that java and php code previously metioned.
Jeff
Jeffrey V. Merkey wrote:
You could also just modify this code (released unde GPLv3) and use it to strip out titles.
Stuff it into a file under linux called "parsetitle.c" and type:
gcc parsetitle.c -o parsetitle
./parsetitle < enwiki<date>.xml > titles.txt
Jeff
You'll do much better with a purpose-built lexical analyzer. If nothing else, this'll work when titles and tags span across buffers - something which the previous program will choke on on.
Released into the public domain as a trivial example program. Save as 'titleparse.l' and 'make titleparse'.
%option noyywrap
%{ #include <stdio.h> %}
%x TITLE
%%
<INITIAL>"<title>" { BEGIN TITLE; } <TITLE>"</title>" { BEGIN INITIAL; putchar('\n'); } <TITLE>"<" { putchar('<'); } <TITLE>">" { putchar('>'); } <TITLE>""e;" { putchar('"'); } <INITIAL>.|\n /* ignored */
%%
int main(int argc, char *argv[]) { yylex(); exit(0); }
zetawoof wrote:
You'll do much better with a purpose-built lexical analyzer. If nothing else, this'll work when titles and tags span across buffers -
If they span buffers, the xml parsing libs choke as well, just for your information. I know, I've seen them do it. I have a version that does not choke on buffer spanning, it buffers underneath. I just posted that as an example.
Jeff
something which the previous program will choke on on.
Released into the public domain as a trivial example program. Save as 'titleparse.l' and 'make titleparse'.
%option noyywrap
%{ #include <stdio.h> %}
%x TITLE
%%
<INITIAL>"<title>" { BEGIN TITLE; }
<TITLE>"</title>" { BEGIN INITIAL; putchar('\n'); } <TITLE>"<" { putchar('<'); } <TITLE>">" { putchar('>'); } <TITLE>""e;" { putchar('"'); } <INITIAL>.|\n /* ignored */
%%
int main(int argc, char *argv[]) { yylex(); exit(0); }
Thanks a ton guys...
Harish
On 2/26/07, Jeffrey V. Merkey jmerkey@wolfmountaingroup.com wrote:
zetawoof wrote:
You'll do much better with a purpose-built lexical analyzer. If nothing else, this'll work when titles and tags span across buffers -
If they span buffers, the xml parsing libs choke as well, just for your information. I know, I've seen them do it. I have a version that does not choke on buffer spanning, it buffers underneath. I just posted that as an example.
Jeff
something which the previous program will choke on on.
Released into the public domain as a trivial example program. Save as 'titleparse.l' and 'make titleparse'.
%option noyywrap
%{ #include <stdio.h> %}
%x TITLE
%%
<INITIAL>"<title>" { BEGIN TITLE; }
<TITLE>"</title>" { BEGIN INITIAL; putchar('\n'); } <TITLE>"<" { putchar('<'); } <TITLE>">" { putchar('>'); } <TITLE>""e;" { putchar('"'); } <INITIAL>.|\n /* ignored */
%%
int main(int argc, char *argv[]) { yylex(); exit(0); }
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org