WikiDump Parsing

List overview All Threads
Download

newer

older

MediaWiki 1.9.3 bugs

mwdumper does not work

Harish TM

17 Jan 2007 17 Jan '07

10:46 p.m.

I was trying to parse the Wikipedia dumps but unfortunately I find the XML file that can be downloaded a little hard to parse. I was wondering if there is a neat way to extract: 1. The article title 2. The article content ( without links to articles in other languages, external links and so on ) 3. The category.

Also I find that there are a large number of tools that allow one to convert plain text to media wiki text. What if I want to go the other way and extract information exactly the way it appears on the wikipedia site.

Harish

Show replies by date

Jeff V. Merkey

17 Jan 17 Jan

10:33 p.m.

I have written a complete set of tools that do all of this, but they are not open sourced. I would suggest a simple C or C++ program calling stdin and looking for just the tags you want. Be careful as the buffering required is LARGE to parse these files. You will need at least 16K buffer as many lines read with fgets can exceed 8192 bytes in size.

Look for the beginning tags for each section. Category links are embedded in the articles themselves.

tags are <TAGNAME> start and </TAGNAME> end.

Jeff

Harish TM wrote:

...

I was trying to parse the Wikipedia dumps but unfortunately I find the XML file that can be downloaded a little hard to parse. I was wondering if there is a neat way to extract: 1. The article title 2. The article content ( without links to articles in other languages, external links and so on ) 3. The category.

Also I find that there are a large number of tools that allow one to convert plain text to media wiki text. What if I want to go the other way and extract information exactly the way it appears on the wikipedia site.

Harish _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

Platonides

11:28 p.m.

Jeff wrote:

...

You will need at least 16K buffer as many lines read with fgets can exceed 8192 bytes in size.

Shouldn't be realley needed. You parse < && > tags. The problem is that some tags can be splitted. You get "..long long line</te" and on next line "xt>" and *if* you're looking for "</text>", you have problems. </text> is tricky, because most tags start on their own line, but </text> doesn't (unless article ends with its own blank line).

Harish TM

18 Jan 18 Jan

5:14 a.m.

Hey Guys,

Thank you for the responses. My further queries within individual responses below:

Jeff V. Merkey:

...

Look for the beginning tags for each section. Category links are embedded in the articles themselves.

This is a big problem for me. Cause when I do a regular expression match on "Category :", I get those lines within the article that are references to other categories as well. I Just want the category that the current article belongs to. Its worse cause sometimes its " Category :" sometimes " Category :" and at times "Category :"

...

tags are <TAGNAME> start and </TAGNAME> end.

True, but like you mentioned above not everything I want is in a separate tag.

...

Jeff

------------------------------------------------------------------------------------------------

From: Brion Vibber brion@pobox.com

Harish TM wrote:

...

...
I was trying to parse the Wikipedia dumps but unfortunately I find the XML file that can be downloaded a little hard to parse. I was wondering if there is a neat way to extract: 1. The article title

...

/mediawiki/page/title

Its harder to link article titles to the article content if the sources are different isn't it?

...

                     2. The article content ( without links to articles

in other languages, external links and so on )

...

The article content *contains* those links, so I guess you mean you want to parse the text and remove certain elements of it?

YES

...

                     3. The category.

...

Again, that's part of article text.

True - My problem with extracting this is as described above.

...

...
Also I find that there are a large number of tools that allow one to convert plain text to media wiki text. What if I want to go the other way and extract information exactly the way it appears on the wikipedia site.

...

Run the wiki parser on it.

Cant seem to find it. Searching for it seems to give me wikipeida articles on parsing!!!

-------------------------------------------------------------------

From: "Jeff V. Merkey" jmerkey@wolfmountaingroup.com

...

This works too, but its slower than mollasses on a cold Utah day .... :-)

Working on a reasonably fast machine ( 64bit 3.something GHz processor with 4 GB RAM ) - Using Ruby to code the parser.

--------------------------------------------------------------------- Platonides Platonides@gmail.com

Jeff wrote:

...

...
You will need at least 16K buffer as many lines read with fgets can exceed 8192 bytes in size.

...

Shouldn't be realley needed. You parse < && > tags. The problem is that some tags can be splitted. You get "..long long line</te" and on next line "xt>" and *if* you're looking for "</text>", you have problems. </text> is tricky, because most tags start on their own line, but </text> doesn't (unless article ends with its own blank line).

Thanks for that!!! Are there some tags that are never split? That way I could look for those, merge all the lines between them into a single line and do a regex.

Just to further clarify what it is that I am looking for - Lets say I want to PRINT out a copy of wikipedia ( I know thats insane - but I need text to be as clean as if I were printing it out ), with the articles indexed as per Title and category, how would I get that data??

Thanks again Harish

Simetrical

5:52 a.m.

On 1/17/07, Harish TM harish.tmh@gmail.com wrote:

...

This is a big problem for me. Cause when I do a regular expression match on "Category :", I get those lines within the article that are references to other categories as well. I Just want the category that the current article belongs to. Its worse cause sometimes its " Category :" sometimes " Category :" and at times "Category :"

Any category link without a leading colon will add the category. With a leading colon, it links. So [[Category:Foo]] categorizes, [[:Category:Foo]] links. You can use that to your advantage.

You may wish to look over the parser at http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/Parser.php?v.... Be aware that the parser is not short, simple, or clean, because (as you've discovered) neither is the markup language.

The legal title characters, by default, are " %!"$&'()*,\-.\/0-9:;=?@A-Z\\^_`a-z~\x80-\xFF+". Therefore, a reasonable PCRE regex (with no string escaping of backslashes et al) might be:

/[[Category:([ %!"$&'()*,-./0-9:;=?@A-Z\^_`a-z~\x80-\xFF+]+)]]/i

Which would give you the category name. Hopefully. In most cases. Because other things can make a title invalid. Basically, it's a mess, and the only surefire way around it is using MediaWiki itself to do the parsing.

...

Cant seem to find it. Searching for it seems to give me wikipeida articles on parsing!!!

Basically: install MediaWiki and hack it to do what you want. The parser is not well-defined or application-independent. You can make your own simplified parser, but it *will* fail in corner-cases (unless your simplifications consist of hacking out stuff you don't need and inlining stuff from other files).

...

Just to further clarify what it is that I am looking for - Lets say I want to PRINT out a copy of wikipedia ( I know thats insane - but I need text to be as clean as if I were printing it out ), with the articles indexed as per Title and category, how would I get that data??

Use your own script to parse the XML part of the dumps and reformat them how you like. Then run each article's text through the parser, and grab the category list it spits back. It will not be very easy.

Harish TM

7:17 a.m.

Thanks a ton.

I will try hacking MediaWiki, will post back if there are issues

Harish

On 1/18/07, Simetrical Simetrical+wikitech@gmail.com wrote:

...

On 1/17/07, Harish TM harish.tmh@gmail.com wrote:

...
This is a big problem for me. Cause when I do a regular expression match on "Category :", I get those lines within the article that are references to other categories as well. I Just want the category that the current article belongs to. Its worse cause sometimes its " Category :" sometimes " Category :" and at times "Category :"

Any category link without a leading colon will add the category. With a leading colon, it links. So [[Category:Foo]] categorizes, [[:Category:Foo]] links. You can use that to your advantage.

You may wish to look over the parser at

http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/Parser.php?v... . Be aware that the parser is not short, simple, or clean, because (as you've discovered) neither is the markup language.

The legal title characters, by default, are " %!"$&'()*,\-.\/0-9:;=?@A-Z\\^_`a-z~\x80-\xFF+". Therefore, a reasonable PCRE regex (with no string escaping of backslashes et al) might be:

/[[Category:([ %!"$&'()*,-./0-9:;=?@A-Z\^_`a-z~\x80-\xFF+]+)]]/i

Which would give you the category name. Hopefully. In most cases. Because other things can make a title invalid. Basically, it's a mess, and the only surefire way around it is using MediaWiki itself to do the parsing.

...
Cant seem to find it. Searching for it seems to give me wikipeida articles on parsing!!!

Basically: install MediaWiki and hack it to do what you want. The parser is not well-defined or application-independent. You can make your own simplified parser, but it *will* fail in corner-cases (unless your simplifications consist of hacking out stuff you don't need and inlining stuff from other files).

...
Just to further clarify what it is that I am looking for - Lets say I want to PRINT out a copy of wikipedia ( I know thats insane - but I need text to be as clean as if I were printing it out ), with the articles indexed as per Title and category, how would I get that data??

Use your own script to parse the XML part of the dumps and reformat them how you like. Then run each article's text through the parser, and grab the category list it spits back. It will not be very easy.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

Alai

8:23 a.m.

Simetrical <Simetrical+wikitech@...> writes:

...

The legal title characters, by default, are " %!"$&'()*,\-.\/0-9:;=? <at> A-Z\\^_`a-z~\x80-\xFF+". Therefore, a reasonable PCRE regex (with no string escaping of backslashes et al) might be:

/[[Category:([ %!"$&'()*,-./0-9:;=? <at> A-Z\^_`a-z~\x80-\xFF+]+)]]/i

Note that one also needs to cope with leading and trailing spaces, and the "Category" is case-insensitive. So probably something more like:

[[ *[Cc][Aa][Tt][Ee][Gg][Oo][Rr][Yy] *: * ...

...

Which would give you the category name. Hopefully. In most cases. Because other things can make a title invalid. Basically, it's a mess, and the only surefire way around it is using MediaWiki itself to do the parsing.

Or, to download categorylink.sql file from the same dump, and use the entries from that. A bit redundant if one needs to parse the markup anyway, though...

A good first step is probably to use an XML toolkit/parser library of some kind for whatever PL one might be using for the task. Unless one happens to be using Haskell, I can't offer any specific advice on that... (Did the original querant specify?) Not a complete solution, since the markup is something of an XML/HTML mish-mash, but it'll still cope with much of the structure.

Slan, Alai.

Mark Clements

10:24 a.m.

"Alai" AlaiWiki@gmail.com wrote in message news:loom.20070118T080426-624@post.gmane.org...

...

Simetrical <Simetrical+wikitech@...> writes:

...
The legal title characters, by default, are " %!"$&'()*,\-.\/0-9:;=? <at> A-Z\\^_`a-z~\x80-\xFF+". Therefore,

...

...
reasonable PCRE regex (with no string escaping of backslashes et al) might be:

/[[Category:([ %!"$&'()*,-./0-9:;=? <at>

A-Z\^_`a-z~\x80-\xFF+]+)]]/i

...

Note that one also needs to cope with leading and trailing spaces, and the "Category" is case-insensitive. So probably something more like:

[[ *[Cc][Aa][Tt][Ee][Gg][Oo][Rr][Yy] *: * ...

The /i at the end of the regex makes it case insensitive.

- Mark Clements (HappyDog)

christoph.huesler＠css.ch

19 Jan 19 Jan

8:03 a.m.

New subject: Antwort: Re: WikiDump Parsing

Alai wrote:

...

Note that one also needs to cope with leading and trailing spaces, and

the

...

"Category" is case-insensitive. So probably something more like:

[[ *[Cc][Aa][Tt][Ee][Gg][Oo][Rr][Yy] *: * ...

Or just use an "ignore case" option like /i (in pcre)...

-- chris

Anthony

18 Jan 18 Jan

1:19 p.m.

On 1/17/07, Harish TM harish.tmh@gmail.com wrote:

...

Just to further clarify what it is that I am looking for - Lets say I want to PRINT out a copy of wikipedia ( I know thats insane - but I need text to be as clean as if I were printing it out ), with the articles indexed as per Title and category, how would I get that data??

The easiest way would probably be: 1) download the dumps 2) install mediawiki 3) have a bot scrape from *your own* mediawiki installation (printable version) 4) use the categorylinks table to sort by category (or scrape the categories at the bottom of each article)

On the other hand, you could always write your own parser :).

Anthony

Harish TM

4:32 p.m.

This is actually quite a cool solution!!!

It is a little roundabout but its good to know I have alternatives!!

Harish

On 1/18/07, Anthony wikitech@inbox.org wrote:

...

On 1/17/07, Harish TM harish.tmh@gmail.com wrote:

...
Just to further clarify what it is that I am looking for - Lets say I want to PRINT out a copy of wikipedia ( I know thats insane - but I need text to be as clean as if I were printing it out ), with the articles indexed as per Title and category, how would I get that data??

The easiest way would probably be:

download the dumps

install mediawiki

have a bot scrape from *your own* mediawiki installation (printable

version) 4) use the categorylinks table to sort by category (or scrape the categories at the bottom of each article)

On the other hand, you could always write your own parser :).

Anthony

Brion Vibber

17 Jan 17 Jan

10:59 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Harish TM wrote:

...

I was trying to parse the Wikipedia dumps but unfortunately I find the XML file that can be downloaded a little hard to parse. I was wondering if there is a neat way to extract: 1. The article title

/mediawiki/page/title

...

                     2. The article content ( without links to articles

in other languages, external links and so on )

The article content *contains* those links, so I guess you mean you want to parse the text and remove certain elements of it?

...

                     3. The category.

Again, that's part of article text.

...

Also I find that there are a large number of tools that allow one to convert plain text to media wiki text. What if I want to go the other way and extract information exactly the way it appears on the wikipedia site.

Run the wiki parser on it.

- -- brion vibber (brion @ pobox.com)

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFrpxXwRnhpk1wk44RArnZAKCe347OtktrffTXbzGgzb0xVNnZOQCeO7sq MIjjmK5c8Oc4RYQzMExvqHQ= =jTHV -----END PGP SIGNATURE-----

Jeff V. Merkey

10:35 p.m.

This works too, but its slower than mollasses on a cold Utah day ....

:-)

Jeff

Brion Vibber wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Harish TM wrote:

...
I was trying to parse the Wikipedia dumps but unfortunately I find the XML file that can be downloaded a little hard to parse. I was wondering if there is a neat way to extract: 1. The article title

/mediawiki/page/title

...
                    2. The article content ( without links to articles
in other languages, external links and so on )
The article content *contains* those links, so I guess you mean you want to parse the text and remove certain elements of it?

...
                    3. The category.
Again, that's part of article text.

...
Also I find that there are a large number of tools that allow one to convert plain text to media wiki text. What if I want to go the other way and extract information exactly the way it appears on the wikipedia site.

Run the wiki parser on it.

-- brion vibber (brion @ pobox.com)

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFrpxXwRnhpk1wk44RArnZAKCe347OtktrffTXbzGgzb0xVNnZOQCeO7sq MIjjmK5c8Oc4RYQzMExvqHQ= =jTHV -----END PGP SIGNATURE-----

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

Matthew Flaschen

26 Feb 26 Feb

8:31 a.m.

Brion Vibber wrote:

...

Harish TM wrote:

...
I was trying to parse the Wikipedia dumps but unfortunately I find the XML file that can be downloaded a little hard to parse. I was wondering if there is a neat way to extract: 1. The article title

/mediawiki/page/title

...
                     2. The article content ( without links to articles
in other languages, external links and so on )
The article content *contains* those links, so I guess you mean you want to parse the text and remove certain elements of it?

...
                     3. The category.
Again, that's part of article text.

...
Also I find that there are a large number of tools that allow one to convert plain text to media wiki text. What if I want to go the other way and extract information exactly the way it appears on the wikipedia site.

Run the wiki parser on it.

Or download (http://static.wikipedia.org/downloads/November_2006/en/) it parsed.

Matthew Flaschen

Jeffrey V. Merkey

9:53 a.m.

You could also just modify this code (released unde GPLv3) and use it to strip out titles.

Stuff it into a file under linux called "parsetitle.c" and type:

gcc parsetitle.c -o parsetitle

./parsetitle < enwiki<date>.xml > titles.txt

Jeff

#include "platform.h"

#ifdef WINDOWS

#define strncasecmp strnicmp

#include "windows.h" #include "winioctl.h" #include "winuser.h" #include "stdarg.h" typedef UCHAR BYTE; typedef USHORT WORD; #include "stdio.h" #include "stdlib.h" #include "ctype.h" #include "conio.h"

#endif

#ifdef LINUX

#include <unistd.h> #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <ctype.h> #include <string.h> //#include <ncurses.h> #include <termios.h> #include <sys/ioctl.h> #include <sys/stat.h> #include <pthread.h> #include <sys/types.h> #include <sys/socket.h> #include <netinet/in.h> #include <net/if.h> #include <stdio.h> #include <errno.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <sched.h> #include <ctype.h> #include <openssl/md5.h>

#endif

char buffer[0x10000]; char title[4096];

int main(int argc, char *argv[]) { register char *s, *p; register int intitle = 0, i, f, inpage = 0; register int titlefound = 0, revision = 0, inrev = 0;

while (s = fgets(buffer, 0x10000, stdin)) { while (*s) { if (!*s || *s == '\n') { if (*s) { // putc(*s, stdout); s++; } break; }

if (!memcmp(s, "<page>", 6)) { s += 6; inpage++; titlefound = 0; revision = 0; continue; }

if (!memcmp(s, "</page>", 7)) { s += 7;

if (!titlefound) fprintf(stdout, "no article title?\n");

if (!revision) fprintf(stdout, "no revision?\n");

titlefound = 0; revision = 0; if (inpage) inpage--; continue; }

if (!memcmp(s, "</revision>", 11)) { if (inrev) inrev--; s += 11; continue; }

if (!memcmp(s, "<revision>", 10)) { inrev++; revision = 1; s += 10; continue; }

if (!memcmp(s, "<title>", 7)) { intitle++; s += 7;

p = strstr(s, "</title>"); if (p) { if (intitle) intitle--;

if (p - s) { strncpy(title, s, p - s); title[p - s] = '\0'; s += (p - s);

for (f=i=0; i < (p - s); i++) { if (!isspace(*p++)) f = 1; } if (f) fprintf(stdout, "[%s] SPACES?\n", title); else fprintf(stdout, "[%s]\n", title); } else fprintf(stdout, "[%s] NULL?\n", s);

titlefound = 1; continue; }

if (intitle) { intitle--; printf("state error: title spanned segments [%s]\n", s); continue; } } // putc(*s, stdout); s++; } } return 0; }

Matthew Flaschen wrote:

...

Brion Vibber wrote:

...
Harish TM wrote:

...
I was trying to parse the Wikipedia dumps but unfortunately I find the XML file that can be downloaded a little hard to parse. I was wondering if there is a neat way to extract: 1. The article title

/mediawiki/page/title

...
                    2. The article content ( without links to articles
in other languages, external links and so on )
The article content *contains* those links, so I guess you mean you want to parse the text and remove certain elements of it?

...
                    3. The category.
Again, that's part of article text.

...
Also I find that there are a large number of tools that allow one to convert plain text to media wiki text. What if I want to go the other way and extract information exactly the way it appears on the wikipedia site.

Run the wiki parser on it.
Or download (http://static.wikipedia.org/downloads/November_2006/en/) it parsed.

Matthew Flaschen

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

Jeffrey V. Merkey

9:55 a.m.

This c program will also run and complete about 50 times faster then that java and php code previously metioned.

Jeff

Jeffrey V. Merkey wrote:

...

You could also just modify this code (released unde GPLv3) and use it to strip out titles.

Stuff it into a file under linux called "parsetitle.c" and type:

gcc parsetitle.c -o parsetitle

./parsetitle < enwiki<date>.xml > titles.txt

Jeff

christoph.huesler＠css.ch

8:52 a.m.

New subject: Antwort: Re: WikiDump Parsing

Jeffrey V. Merkey wrote:

...

This c program will also run and complete about 50 times faster then that java and php code previously metioned.

Jeff

Well... what a surprise </sarcasm>

zetawoof

9:07 a.m.

You'll do much better with a purpose-built lexical analyzer. If nothing else, this'll work when titles and tags span across buffers - something which the previous program will choke on on.

Released into the public domain as a trivial example program. Save as 'titleparse.l' and 'make titleparse'.

%option noyywrap

%{ #include <stdio.h> %}

%x TITLE

<INITIAL>"<title>" { BEGIN TITLE; } <TITLE>"</title>" { BEGIN INITIAL; putchar('\n'); } <TITLE>"<" { putchar('<'); } <TITLE>">" { putchar('>'); } <TITLE>"&quote;" { putchar('"'); } <INITIAL>.|\n /* ignored */

int main(int argc, char *argv[]) { yylex(); exit(0); }

Jeffrey V. Merkey

10:37 a.m.

zetawoof wrote:

...

You'll do much better with a purpose-built lexical analyzer. If nothing else, this'll work when titles and tags span across buffers -

If they span buffers, the xml parsing libs choke as well, just for your information. I know, I've seen them do it. I have a version that does not choke on buffer spanning, it buffers underneath. I just posted that as an example.

Jeff

...

something which the previous program will choke on on.

Released into the public domain as a trivial example program. Save as 'titleparse.l' and 'make titleparse'.

%option noyywrap

%{ #include <stdio.h> %}

%x TITLE

%%

<INITIAL>"<title>" { BEGIN TITLE; }

<TITLE>"</title>" { BEGIN INITIAL; putchar('\n'); } <TITLE>"<" { putchar('<'); } <TITLE>">" { putchar('>'); } <TITLE>"&quote;" { putchar('"'); } <INITIAL>.|\n /* ignored */

%%

int main(int argc, char *argv[]) { yylex(); exit(0); }

Harish TM

11:52 a.m.

Thanks a ton guys...

Harish

On 2/26/07, Jeffrey V. Merkey jmerkey@wolfmountaingroup.com wrote:

...

zetawoof wrote:

...
You'll do much better with a purpose-built lexical analyzer. If nothing else, this'll work when titles and tags span across buffers -

If they span buffers, the xml parsing libs choke as well, just for your information. I know, I've seen them do it. I have a version that does not choke on buffer spanning, it buffers underneath. I just posted that as an example.

Jeff

...
something which the previous program will choke on on.

Released into the public domain as a trivial example program. Save as 'titleparse.l' and 'make titleparse'.

%option noyywrap

%{ #include <stdio.h> %}

%x TITLE

%%

<INITIAL>"<title>" { BEGIN TITLE; }

<TITLE>"</title>" { BEGIN INITIAL; putchar('\n'); } <TITLE>"<" { putchar('<'); } <TITLE>">" { putchar('>'); } <TITLE>"&quote;" { putchar('"'); } <INITIAL>.|\n /* ignored */

%%

int main(int argc, char *argv[]) { yylex(); exit(0); }

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

6509

Age (days ago)

6549

Last active (days ago)

wikitech-l@lists.wikimedia.org

19 comments

12 participants

tags (0)

participants (12)

Alai
Anthony
Brion Vibber
christoph.huesler＠css.ch
Harish TM
Jeff V. Merkey
Jeffrey V. Merkey
Mark Clements
Matthew Flaschen
Platonides
Simetrical
zetawoof