Hello, I need an html dump of Wikipedia but the link http://static.wikipedia.org/ does not work. I'd appreciate any explanation or suggestion.
Regards Ben Sidi Ahmed
On Fri, Dec 2, 2011 at 10:33 PM, Khalida BEN SIDI AHMED send.to.khalida@gmail.com wrote:
Hello, I need an html dump of Wikipedia but the link http://static.wikipedia.org/ does not work. I'd appreciate any explanation or suggestion.
Does http://dumps.wikimedia.org/ not have the data you need?
Roan
* Roan Kattouw wrote:
On Fri, Dec 2, 2011 at 10:33 PM, Khalida BEN SIDI AHMED send.to.khalida@gmail.com wrote:
Hello, I need an html dump of Wikipedia but the link http://static.wikipedia.org/ does not work. I'd appreciate any explanation or suggestion.
Does http://dumps.wikimedia.org/ not have the data you need?
That links to http://static.wikipedia.org/ which is offline and would only have offered dumps that are 3.4 years old.
I need static Html dumps. In the webpage you've mentionned, when I click on the link of static html, this one is not accessible.
Truly yours
I need an Html dump of Wikipedia because I have written a java code which extract text from an html content and I would like to apply it on this dump. In fact I need to extract the first sentence of a list of articles (<200) and I don' know how to do it on other dumps. If you have any idea of other solutions, I will be pleased if you share them with me.
* Khalida BEN SIDI AHMED wrote:
I need an Html dump of Wikipedia because I have written a java code which extract text from an html content and I would like to apply it on this dump. In fact I need to extract the first sentence of a list of articles (<200) and I don' know how to do it on other dumps. If you have any idea of other solutions, I will be pleased if you share them with me.
If you just need a few articles, you can simply use the online version. There are any number of tools that would help you to batch the requests without hitting the server too much, `wget` and `curl` are popular ones.
Currently, I'm using the online version with the java api Jsoup. It does not work perfectly. After the extraction of less than 10 articles, my project shows me a set of exceptions. Could you please give me the approximative number of articles I can get with these tools.
If you just need a few articles, you can simply use the online version. There are any number of tools that would help you to batch the requests without hitting the server too much, `wget` and `curl` are popular ones. -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
* Khalida BEN SIDI AHMED wrote:
Currently, I'm using the online version with the java api Jsoup. It does not work perfectly. After the extraction of less than 10 articles, my project shows me a set of exceptions. Could you please give me the approximative number of articles I can get with these tools.
There is no limit in number, but you should wait somewhere around a couple of seconds between requests. It's quite normal to request half a dozen of articles simultaneously for normal users, so I doubt you are hitting some rate limit, so your problem is likely elsewhere.
I just wonder if the problem can be due to the speed of my connexion. The text of the exception is : Grave: null java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:150) at java.net.SocketInputStream.read(SocketInputStream.java:121)....
Hi Khalida,
In a previous message, you mentioned that the speed of your Internet connection and the storage capacity of your computer were giving you trouble.
I know this is not directly on-topic on this list, but since you seem to have tried and exhausted many options already, perhaps you would consider running your code on a cloud-based computer, such as a server from Amazon EC2? Doing so would allow you to get around both connection and storage issues, and perhaps allow you to run your Java code successfully, or perhaps to run JWPL successfully.
Using EC2 is not particularly simple, but it is technically straightforward and not too hard, either. The Getting Started Guide is here:
http://docs.amazonwebservices.com/AWSEC2/latest/GettingStartedGuide/
The pricing is quite inexpensive if you limit the number of hours you use the servers. There is even a free tier for new users http://aws.amazon.com/free/, although with your needs you might choose to pay (again, a small amount) for server(s) with larger capacity.
I can offer a small bit of EC2 guidance off-list if you need other pointers in getting started.
Pete
On 12/2/11 14:03 PM, Khalida BEN SIDI AHMED wrote:
I need an Html dump of Wikipedia because I have written a java code which extract text from an html content and I would like to apply it on this dump. In fact I need to extract the first sentence of a list of articles (<200) and I don' know how to do it on other dumps. If you have any idea of other solutions, I will be pleased if you share them with me.
Hi Kaminski
I appreciate your help, thank you very much indeed. I will try the options that were given to me today. If my attempts fail, I will contact you for help. Many thanks for Hoehrmann: I'll immediately see if I can succeed with curl or wget.
Regards Khalida Ben Sidi Ahmed
An other important question to which I seek a response days ago: If I download Wikitaxi and have Wikipedia offline, can I query this offline version using Java?
On 02/12/11 22:33, Khalida BEN SIDI AHMED wrote:
Hello, I need an html dump of Wikipedia but the link http://static.wikipedia.org/ does not work. I'd appreciate any explanation or suggestion.
Regards Ben Sidi Ahmed
Why do oyu need an html dump of Wikipedia?
On 03/12/11 08:58, Platonides wrote:
On 02/12/11 22:33, Khalida BEN SIDI AHMED wrote:
Hello, I need an html dump of Wikipedia but the link http://static.wikipedia.org/ does not work. I'd appreciate any explanation or suggestion.
Regards Ben Sidi Ahmed
Why do oyu need an html dump of Wikipedia?
It's a huge task to set up MediaWiki in precisely the same way as it is on Wikimedia, to import an XML dump and to generate HTML. It takes a serious amount of hardware and software development resources. That's why I spent so much time making HTML dump scripts. It's just a pity that nobody cared enough about it to keep the project going.
-- Tim Starling
Tim Starling wrote:
On 03/12/11 08:58, Platonides wrote:
On 02/12/11 22:33, Khalida BEN SIDI AHMED wrote:
I need an html dump of Wikipedia but the link http://static.wikipedia.org/ does not work. I'd appreciate any explanation or suggestion.
Why do you need an html dump of Wikipedia?
It's a huge task to set up MediaWiki in precisely the same way as it is on Wikimedia, to import an XML dump and to generate HTML. It takes a serious amount of hardware and software development resources. That's why I spent so much time making HTML dump scripts. It's just a pity that nobody cared enough about it to keep the project going.
This may be a stupid question as I don't understand the mechanics particularly well, but... as far as I understand it, there's a Squid cache layer that contains the HTML output of parsed and rendered wikitext pages. This stored HTML is what most anonymous viewers receive when they access the site. Why can't that be dumped into a output file rather than running expensive and timely HTML dump generation scripts?
In other words, it's not as though the HTML doesn't exist already. It's served millions and millions of times each day. Why is it so painful to make it available as a dump?
MZMcBride
On 04/12/11 12:32, MZMcBride wrote:
This may be a stupid question as I don't understand the mechanics particularly well, but... as far as I understand it, there's a Squid cache layer that contains the HTML output of parsed and rendered wikitext pages. This stored HTML is what most anonymous viewers receive when they access the site. Why can't that be dumped into a output file rather than running expensive and timely HTML dump generation scripts?
In other words, it's not as though the HTML doesn't exist already. It's served millions and millions of times each day. Why is it so painful to make it available as a dump?
Most of the code would be the same, it's just a bit more flexible to do the parsing in the extension, it makes it easier to change some details of the generated HTML, and lets you avoid polluting the caches with rarely-viewed pages. It's not especially painful either way.
-- Tim Starling
Tim Starling wrote:
On 04/12/11 12:32, MZMcBride wrote:
This may be a stupid question as I don't understand the mechanics particularly well, but... as far as I understand it, there's a Squid cache layer that contains the HTML output of parsed and rendered wikitext pages. This stored HTML is what most anonymous viewers receive when they access the site. Why can't that be dumped into a output file rather than running expensive and timely HTML dump generation scripts?
In other words, it's not as though the HTML doesn't exist already. It's served millions and millions of times each day. Why is it so painful to make it available as a dump?
Most of the code would be the same, it's just a bit more flexible to do the parsing in the extension, it makes it easier to change some details of the generated HTML, and lets you avoid polluting the caches with rarely-viewed pages. It's not especially painful either way.
So the reason that there hasn't been an HTML dump of Wikimedia wikis in years is that no Wikimedia sysadmin can be bothered to run a maintenance script?
MZMcBride
On 03/12/2011 13:18, Tim Starling wrote:
On 03/12/11 08:58, Platonides wrote:
On 02/12/11 22:33, Khalida BEN SIDI AHMED wrote:
Hello, I need an html dump of Wikipedia but the link http://static.wikipedia.org/ does not work. I'd appreciate any explanation or suggestion.
Regards Ben Sidi Ahmed
Why do oyu need an html dump of Wikipedia?
It's a huge task to set up MediaWiki in precisely the same way as it is on Wikimedia, to import an XML dump and to generate HTML. It takes a serious amount of hardware and software development resources. That's why I spent so much time making HTML dump scripts. It's just a pity that nobody cared enough about it to keep the project going.
The DumpHTML Mediawiki extension is an essential piece of software: https://www.mediawiki.org/wiki/Extension:DumpHTML
This is IMO the good approach and the only way to do high-quality static dumps. I have been using it since many years and all ZIM files I made were done using Tim's Mediawiki DumpHTML extension. http://download.kiwix.org/zim/0.9/
At Kiwix we currently pretty much focus on the end-user software but we still want to do everything necessary for having an open/efficient/handful toolchain to create static dumps from Mediawiki instances (in particular in the ZIM format).
That is the reason why we have an small action plan to improve DumpHTML http://www.kiwix.org/index.php/Mediawiki_DumpHTML_extension_improvement Any comment or critic is welcome.
If hackers are interested in working on DumpHTML, please let me know ; we currently work to get a grant for that, and this is on the good way.
Emmanuel
wikitech-l@lists.wikimedia.org