Static HTML Dumps

List overview All Threads
Download

newer

older

HipHop

GSOC and networking

Samat

5 Apr 2011 5 Apr '11

8:41 a.m.

Hi!

I am looking for static HTML Dumps of the English Wikipedia, but I have found only this page: http://static.wikipedia.org/ This page contains almost three years old dumps!

Do you know other pages, where I could download recent HTML dumps from, or do you have any idea, how could I get them?

Thank you for your help, Samat

Show replies by date

Huib Laurens

5 Apr 5 Apr

8:55 a.m.

Hello,

The HTML dump proces isn't running anymore according to the message on the downloads page. The most recent en.wiki dump is in XML and here: http://dumps.wikimedia.org/enwiki/20110317/

Best,

Huib Laurens

Samat

9:38 a.m.

Hi Huib,

Thank you for your prompt answer.

I know XML dumps, but I looking for HTML dumps for a linguistic (and related) research. This research would like to use full text of Wikipedia articles without any other content. The researcher asked me helping to find the solution avoid writing a program to generate this from the xml. (I am not a programmer but there could be problems with included pages and templates, and other special contents.)

Best regards, Samat

On Tue, Apr 5, 2011 at 14:55, Huib Laurens sterkebak@gmail.com wrote:

...

Hello,

The HTML dump proces isn't running anymore according to the message on the downloads page. The most recent en.wiki dump is in XML and here: http://dumps.wikimedia.org/enwiki/20110317/

Best,

Huib Laurens

Paul Houle

10:06 a.m.

On 4/5/2011 9:38 AM, Samat wrote:

...

Hi Huib,

Thank you for your prompt answer.

I know XML dumps, but I looking for HTML dumps for a linguistic (and related) research. This research would like to use full text of Wikipedia articles without any other content. The researcher asked me helping to find the solution avoid writing a program to generate this from the xml. (I am not a programmer but there could be problems with included pages and templates, and other special contents.)

Best regards, Samat

I did a substantial project that worked from the XML dumps. I designed a recursive descent parser in C# that, with a few tricks, almost decodes wikipedia markup correctly. Getting it right is tricky, for a number of reasons, however, my approach preserved some semantics that would have been lost in the HTML dumps.

If I had to do it all again, I'd probably download a copy of the mediawiki software, load Wikipedia into it, and then add some 'hooks' into the template handling code that would let me put traps on important templates and other parts of the markup handling so template nesting gets handled correctly.

Old static dumps, going back to June 2008, are available:

http://static.wikipedia.org/

In your case, I'd do the following: install a copy of the mediawiki software,

http://lifehacker.com/#!163707/geek-to-live--set-up-your-personal-wikipedia http://lifehacker.com/#%21163707/geek-to-live--set-up-your-personal-wikipedia

get a list of all the pages in the wiki by running a database query, and then write a script that does http requests for all the pages and saves them in files. This is programming of the simplest type, but getting good speed could be a challenge. I'd seriously consider using Amazon EC2 for this kind of thing, renting a big DB server and a big web server, then writing a script that does the download in parallel.

Platonides

4 p.m.

Paul Houle wrote:

...

 I did a substantial project that worked from the XML dumps.  I 
designed a recursive descent parser in C# that, with a few tricks, almost decodes wikipedia markup correctly. Getting it right is tricky, for a number of reasons, however, my approach preserved some semantics that would have been lost in the HTML dumps.

(...)

...

 In your case,  I'd do the following:  install a copy of the 
mediawiki software,

http://lifehacker.com/#!163707/geek-to-live--set-up-your-personal-wikipedia http://lifehacker.com/#%21163707/geek-to-live--set-up-your-personal-wikipedia
 get a list of all the pages in the wiki by running a database 
query, and then write a script that does http requests for all the pages and saves them in files. This is programming of the simplest type, but getting good speed could be a challenge. I'd seriously consider using Amazon EC2 for this kind of thing, renting a big DB server and a big web server, then writing a script that does the download in parallel.

He could as well generate the static html dumps from that. http://www.mediawiki.org/wiki/Extension:DumpHTML

I think he is better parsing the articles, though.

For a linguistic research you don't need things such as the contents of templates, so a simple wikitext stripping would do. And it will be much, much, much, much faster than parsing the whole wiki.

Paul Houle

4:41 p.m.

On 4/5/2011 4:00 PM, Platonides wrote

...

I think he is better parsing the articles, though.

For a linguistic research you don't need things such as the contents of templates, so a simple wikitext stripping would do. And it will be much, much, much, much faster than parsing the whole wiki.

Could be true, but what's fascinating for me about Wikipedia is all of the unscrambled eggs that can be found in the middle of otherwise unstructured text.

5016

Age (days ago)

5016

Last active (days ago)

wikitech-l@lists.wikimedia.org

5 comments

4 participants

tags (0)

participants (4)

Huib Laurens
Paul Houle
Platonides
Samat