Hi everyone, we need your help.
We are from Python Argentina, and we are working on adapting our cdpedia project to make a DVD together with educ.ar and Wikimedia Foundation, holding the entire Spanish Wikipedia that will be sent soon to Argentinian schools.
Hernán and Diego are the two interns tasked with updating the data that cdpedia uses to make the cd (it currently uses a static html dump dated June 2008), but they are encountering some problems while trying to make an up to date static html es-wikipedia dump.
I'm ccing this list of people, because I'm sure you've faced similar issues when making your offline wikipedias, or because maybe you know someone who can help us.
Following is an email from Hernán describing the problems he's found.
thanks!
Hi Alejandro
On Fri, Apr 30, 2010 at 11:43 PM, Alejandro J. Cura alecura@gmail.comwrote:
[snip] Hernán and Diego are the two interns tasked with updating the data that cdpedia uses to make the cd (it currently uses a static html dump dated June 2008), but they are encountering some problems while trying to make an up to date static html es-wikipedia dump.
I'm ccing this list of people, because I'm sure you've faced similar issues when making your offline wikipedias, or because maybe you know someone who can help us.
We're doing this XML to HTML conversion as one of the steps in our process of rendering Wikipedia for our WikiReader device. We can build Spanish without issues.
All of our source code is here:
http://github.com/wikireader/wikireader
The specific portion you would need is the offline-renderer located here:
http://github.com/wikireader/wikireader/tree/master/host-tools/offline-rende...
You'll probably need to modify the HTML output for your specific needs. Just let me know if you get stuck.
Sean
2010/4/30 Sean Moss-Pultz sean@openmoko.com
Hi Alejandro
On Fri, Apr 30, 2010 at 11:43 PM, Alejandro J. Cura alecura@gmail.comwrote:
[snip]
Hernán and Diego are the two interns tasked with updating the data that cdpedia uses to make the cd (it currently uses a static html dump dated June 2008), but they are encountering some problems while trying to make an up to date static html es-wikipedia dump.
I'm ccing this list of people, because I'm sure you've faced similar issues when making your offline wikipedias, or because maybe you know someone who can help us.
We're doing this XML to HTML conversion as one of the steps in our process of rendering Wikipedia for our WikiReader device. We can build Spanish without issues.
All of our source code is here:
http://github.com/wikireader/wikireader
The specific portion you would need is the offline-renderer located here:
http://github.com/wikireader/wikireader/tree/master/host-tools/offline-rende...
You'll probably need to modify the HTML output for your specific needs. Just let me know if you get stuck.
Sean
I'll take a look. Thanks!
Hey,
have a look at Kiwix:
http://www.kiwix.org/index.php/Main_Page/es
As far as I know Emmanuel (maintainer of Kiwix) has made ZIM files for es:wp.
Alternatively here is a description how to make them: http://www.kiwix.org/index.php/Tools/en (http://www.kiwix.org/index.php/Tools/es - not complete)
/Manuel
Am 30.04.2010 17:43, schrieb Alejandro J. Cura:
Hi everyone, we need your help.
We are from Python Argentina, and we are working on adapting our cdpedia project to make a DVD together with educ.ar and Wikimedia Foundation, holding the entire Spanish Wikipedia that will be sent soon to Argentinian schools.
Hernán and Diego are the two interns tasked with updating the data that cdpedia uses to make the cd (it currently uses a static html dump dated June 2008), but they are encountering some problems while trying to make an up to date static html es-wikipedia dump.
I'm ccing this list of people, because I'm sure you've faced similar issues when making your offline wikipedias, or because maybe you know someone who can help us.
Following is an email from Hernán describing the problems he's found.
thanks!
Hello,
you could test our solution also : http://www.okawix.com/
If you want we could make .iso ready to use.
----- Original Message ----- From: "Alejandro J. Cura" alecura@gmail.com To: "Samuel Klein" meta.sj@gmail.com Cc: ibarrags@gmail.com; "Jimmy Wales" jwales@wikia-inc.com; "Madeleine Ball" mad@printf.net; "Facundo Batista" facundobatista@gmail.com; Wiki-offline-reader-l@wikimedia.org; "Offline Wikireaders" wikireader@lists.laptop.org; "Cecilia Sagol" csagol@educ.gov.ar; "Pomies Patricia" ppomies@educ.gov.ar; "Patricio Lorente" patricio.lorente@gmail.com; "Enrique Chaparro" cinabrium@gmail.com; "Sean Moss-Pultz" sean@openmoko.com; "Kul Takanao Wadhwa" kwadhwa@wikimedia.org; "Emmanuel Engelhart" emmanuel.engelhart@wikimedia.fr; godiard@gmail.com; "Diego Mascialino" dmascialino@gmail.com; "Hernan Olivera" lholivera@gmail.com; cjb@laptop.org; "Iris Fernández" irisfernandez@gmail.com; "OpenZim devel" dev-l@openzim.org Sent: Friday, April 30, 2010 5:43 PM Subject: [openZIM dev-l] a DVD with the Spanish Wikipedia (was [Argentina]WikiBrowse improvements)
Hi everyone, we need your help.
We are from Python Argentina, and we are working on adapting our cdpedia project to make a DVD together with educ.ar and Wikimedia Foundation, holding the entire Spanish Wikipedia that will be sent soon to Argentinian schools.
Hernán and Diego are the two interns tasked with updating the data that cdpedia uses to make the cd (it currently uses a static html dump dated June 2008), but they are encountering some problems while trying to make an up to date static html es-wikipedia dump.
I'm ccing this list of people, because I'm sure you've faced similar issues when making your offline wikipedias, or because maybe you know someone who can help us.
Following is an email from Hernán describing the problems he's found.
thanks!
You are all making me very happy with this important work. I am sad that I'm not able to personally roll up my sleeves and help you. :) But I am excited to see progress, thank you so much!
Hi everyone. We need your help again.
We finally have a working mirror for generating the static html version of ESWIKI we need for Cd-Pedia using DumpHTML extension. But it seems that the process will take about 3000 hours of processing in our little sempron server (4 months!).
How many time could it take in Wikimedia's servers?
Thanks
PD: excuse me if you receive this e-mail twice
2010/5/1 Jimmy Wales jwales@wikia-inc.com
You are all making me very happy with this important work. I am sad that I'm not able to personally roll up my sleeves and help you. :) But I am excited to see progress, thank you so much!
I can think of a few different server farms that would be glad to run your process :-) Which script is it that will take 3000 (!) hours to run?
SJ
On Wed, Jul 7, 2010 at 3:35 AM, Hernan Olivera lholivera@gmail.com wrote:
Hi everyone. We need your help again.
We finally have a working mirror for generating the static html version of ESWIKI we need for Cd-Pedia using DumpHTML extension. But it seems that the process will take about 3000 hours of processing in our little sempron server (4 months!).
How many time could it take in Wikimedia's servers?
Thanks
PD: excuse me if you receive this e-mail twice
2010/5/1 Jimmy Wales jwales@wikia-inc.com
You are all making me very happy with this important work. I am sad that I'm not able to personally roll up my sleeves and help you. :) But I am excited to see progress, thank you so much!
-- Hernan Olivera
El 7 de julio de 2010 04:38, Samuel Klein meta.sj@gmail.com escribió:
I can think of a few different server farms that would be glad to run your process :-) Which script is it that will take 3000 (!) hours to run?
'Dump the Spanish Wikipedia in HTML' php dumpHTML.php -d /home/hernan/html3 -s 1 -e 1824046 --checkpoint /home/hernan/html3/check.txt --show-titles
SJ
On Wed, Jul 7, 2010 at 3:35 AM, Hernan Olivera lholivera@gmail.com wrote:
Hi everyone. We need your help again.
We finally have a working mirror for generating the static html version
of
ESWIKI we need for Cd-Pedia using DumpHTML extension. But it seems that the process will take about 3000 hours of processing in our little sempron server (4 months!).
How many time could it take in Wikimedia's servers?
Thanks
PD: excuse me if you receive this e-mail twice
2010/5/1 Jimmy Wales jwales@wikia-inc.com
You are all making me very happy with this important work. I am sad
that
I'm not able to personally roll up my sleeves and help you. :) But I am excited to see progress, thank you so much!
-- Hernan Olivera
-- Samuel Klein identi.ca:sj w:user:sj
I will also look into possible options. I just got to Gdansk for Wikisym/Wikimania so give me a little time but if anyone else has ideas feel free to chime in.
--Kul
On 7/7/10 12:49 AM, Hernan Olivera wrote:
El 7 de julio de 2010 04:38, Samuel Klein <meta.sj http://meta.sj@gmail.com http://gmail.com> escribió:
I can think of a few different server farms that would be glad to run your process :-) Which script is it that will take 3000 (!) hours to run?
'Dump the Spanish Wikipedia in HTML' php dumpHTML.php -d /home/hernan/html3 -s 1 -e 1824046 --checkpoint /home/hernan/html3/check.txt --show-titles
SJ On Wed, Jul 7, 2010 at 3:35 AM, Hernan Olivera <lholivera@gmail.com <mailto:lholivera@gmail.com>> wrote: > Hi everyone. We need your help again. > > We finally have a working mirror for generating the static html version of > ESWIKI we need for Cd-Pedia using DumpHTML extension. > But it seems that the process will take about 3000 hours of processing in > our little sempron server (4 months!). > > How many time could it take in Wikimedia's servers? > > > Thanks > > > PD: excuse me if you receive this e-mail twice > > 2010/5/1 Jimmy Wales <jwales@wikia-inc.com <mailto:jwales@wikia-inc.com>> >> >> You are all making me very happy with this important work. I am sad that >> I'm not able to personally roll up my sleeves and help you. :) But I am >> excited to see progress, thank you so much! > > > > -- > Hernan Olivera > > -- Samuel Klein identi.ca:sj w:user:sj
-- Hernan Olivera
Hernan,
We have some contacts that may be able to give you access to processing power. Tomasz and I need to more about what your specific needs.
Just contact me and Tomasz directly and we'll go from there.
Kul
On 7/7/10 12:49 AM, Hernan Olivera wrote:
El 7 de julio de 2010 04:38, Samuel Klein <meta.sj http://meta.sj@gmail.com http://gmail.com> escribió:
I can think of a few different server farms that would be glad to run your process :-) Which script is it that will take 3000 (!) hours to run?
'Dump the Spanish Wikipedia in HTML' php dumpHTML.php -d /home/hernan/html3 -s 1 -e 1824046 --checkpoint /home/hernan/html3/check.txt --show-titles
SJ On Wed, Jul 7, 2010 at 3:35 AM, Hernan Olivera <lholivera@gmail.com <mailto:lholivera@gmail.com>> wrote: > Hi everyone. We need your help again. > > We finally have a working mirror for generating the static html version of > ESWIKI we need for Cd-Pedia using DumpHTML extension. > But it seems that the process will take about 3000 hours of processing in > our little sempron server (4 months!). > > How many time could it take in Wikimedia's servers? > > > Thanks > > > PD: excuse me if you receive this e-mail twice > > 2010/5/1 Jimmy Wales <jwales@wikia-inc.com <mailto:jwales@wikia-inc.com>> >> >> You are all making me very happy with this important work. I am sad that >> I'm not able to personally roll up my sleeves and help you. :) But I am >> excited to see progress, thank you so much! > > > > -- > Hernan Olivera > > -- Samuel Klein identi.ca:sj w:user:sj
-- Hernan Olivera
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
I have released an up-to-date ZIM file with all main namespace articles with thumbnails: http://tmp.kiwix.org/zim/0.9/wikipedia_es_all_09_2010_beta1.zim
You may extract the content in files using zimdump -D http://openzim.org/Zimdump
You can have a look on it online (served by kiwix-serve, an HTTP server which reads ZIM files) at : http://library.kiwix.org:4214/
For windows users which simply want to have an all-in-one solution (ZIM reader Kiwix + ZIM file + full search index + installer + autorun) http://download.kiwix.org/portable/wikipedia_es_all.zip
Emmanuel
On 11/08/2010 22:40, Kul Takanao Wadhwa wrote:
Hernan,
We have some contacts that may be able to give you access to processing power. Tomasz and I need to more about what your specific needs.
Just contact me and Tomasz directly and we'll go from there.
Kul
On 7/7/10 12:49 AM, Hernan Olivera wrote:
El 7 de julio de 2010 04:38, Samuel Klein <meta.sj http://meta.sj@gmail.com http://gmail.com> escribió:
I can think of a few different server farms that would be glad to run your process :-) Which script is it that will take 3000 (!) hours to run?
'Dump the Spanish Wikipedia in HTML' php dumpHTML.php -d /home/hernan/html3 -s 1 -e 1824046 --checkpoint /home/hernan/html3/check.txt --show-titles
SJ On Wed, Jul 7, 2010 at 3:35 AM, Hernan Olivera <lholivera@gmail.com <mailto:lholivera@gmail.com>> wrote: > Hi everyone. We need your help again. > > We finally have a working mirror for generating the static html version of > ESWIKI we need for Cd-Pedia using DumpHTML extension. > But it seems that the process will take about 3000 hours of processing in > our little sempron server (4 months!). > > How many time could it take in Wikimedia's servers? > > > Thanks > > > PD: excuse me if you receive this e-mail twice > > 2010/5/1 Jimmy Wales <jwales@wikia-inc.com <mailto:jwales@wikia-inc.com>> >> >> You are all making me very happy with this important work. I am sad that >> I'm not able to personally roll up my sleeves and help you. :) But I am >> excited to see progress, thank you so much! > > > > -- > Hernan Olivera > > -- Samuel Klein identi.ca:sj w:user:sj
-- Hernan Olivera
dev-l mailing list dev-l@openzim.org https://intern.openzim.org/mailman/listinfo/dev-l
El 30/04/10 17:43, Alejandro J. Cura escribió:
Hi everyone, we need your help.
We are from Python Argentina, and we are working on adapting our cdpedia project to make a DVD together with educ.ar and Wikimedia Foundation, holding the entire Spanish Wikipedia that will be sent soon to Argentinian schools.
Hernán and Diego are the two interns tasked with updating the data that cdpedia uses to make the cd (it currently uses a static html dump dated June 2008), but they are encountering some problems while trying to make an up to date static html es-wikipedia dump.
I'm ccing this list of people, because I'm sure you've faced similar issues when making your offline wikipedias, or because maybe you know someone who can help us.
Following is an email from Hernán describing the problems he's found.
thanks! -- alecu - Python Argentina 2010/4/30 Hernan Olivera lholivera@gmail.com: Hi everybody, I've been working on making an up to date static html dump for the spanish wikipedia, to use as a basis for the DVD. I've followed the procedures detailed in the pages below, that were used to generate the current (and out of date) static html dumps: 1) installing and setting up a mediawiki instance 2) importing the xml from [6] with mwdumper 3) exporting the static html with mediawiki's tool The procedure finishes without throwing any errors, but the xml import produces malformed html pages that have visible wikimarkup. We would really need to have a successful import from the spanish xmls to a mediawiki instance so we can produce the up to date static html dump. Links to the info I used: [0] http://www.mediawiki.org/wiki/Manual:Installation_guide/es [1] http://www.mediawiki.org/wiki/Manual:Running_MediaWiki_on_Ubuntu [2] http://en.wikipedia.org/wiki/Wikipedia_database [3] http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps [4] http://meta.wikimedia.org/wiki/Importing_a_Wikipedia_database_dump_into_Medi... [5] http://meta.wikimedia.org/wiki/Data_dumps [6] http://dumps.wikimedia.org/eswiki/20100331/ [7] http://www.mediawiki.org/wiki/Alternative_parsers (among others) Cheers, --
Hola Hernán,
You may have used one of the corrupted dumps. See https://bugzilla.wikimedia.org/show_bug.cgi?id=18694 https://bugzilla.wikimedia.org/show_bug.cgi?id=23264
Otherwise, did you install parserfunctions and other extensions needed?
Hi everyone. We need your help again.
We finally have a working mirror for generating the static html version of ESWIKI we need for Cd-Pedia using DumpHTML extension. But it seems that the process will take about 3000 hours of processing in our little sempron server (4 months!).
How many time could it take in Wikimedia's servers?
Thanks
(This is intentional top-posting to update quickly the situation)
El 1 de junio de 2010 18:53, Ángel González keisial@gmail.com escribió:
El 30/04/10 17:43, Alejandro J. Cura escribió:
Hi everyone, we need your help.
We are from Python Argentina, and we are working on adapting our cdpedia project to make a DVD together with educ.ar and Wikimedia Foundation, holding the entire Spanish Wikipedia that will be sent soon to Argentinian schools.
Hernán and Diego are the two interns tasked with updating the data that cdpedia uses to make the cd (it currently uses a static html dump dated June 2008), but they are encountering some problems while trying to make an up to date static html es-wikipedia dump.
I'm ccing this list of people, because I'm sure you've faced similar issues when making your offline wikipedias, or because maybe you know someone who can help us.
Following is an email from Hernán describing the problems he's found.
thanks! -- alecu - Python Argentina 2010/4/30 Hernan Olivera lholivera@gmail.com: Hi everybody, I've been working on making an up to date static html dump for the spanish wikipedia, to use as a basis for the DVD. I've followed the procedures detailed in the pages below, that were used to generate the current (and out of date) static html dumps: 1) installing and setting up a mediawiki instance 2) importing the xml from [6] with mwdumper 3) exporting the static html with mediawiki's tool The procedure finishes without throwing any errors, but the xml import produces malformed html pages that have visible wikimarkup. We would really need to have a successful import from the spanish xmls to a mediawiki instance so we can produce the up to date static html dump. Links to the info I used: [0] http://www.mediawiki.org/wiki/Manual:Installation_guide/es [1] http://www.mediawiki.org/wiki/Manual:Running_MediaWiki_on_Ubuntu [2] http://en.wikipedia.org/wiki/Wikipedia_database [3] http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps [4]
http://meta.wikimedia.org/wiki/Importing_a_Wikipedia_database_dump_into_Medi...
[5] http://meta.wikimedia.org/wiki/Data_dumps [6] http://dumps.wikimedia.org/eswiki/20100331/ [7] http://www.mediawiki.org/wiki/Alternative_parsers (among others) Cheers, --
Hola Hernán,
You may have used one of the corrupted dumps. See https://bugzilla.wikimedia.org/show_bug.cgi?id=18694 https://bugzilla.wikimedia.org/show_bug.cgi?id=23264
Otherwise, did you install parserfunctions and other extensions needed?