Hi all,
I am helping the charity Volunteer Uganda set up an offline eLearning computer system with 15 Raspberry Pi's and cheap desktop computer for a server. Server stats:
- 2TB disk - 8GB DDR3 ram - 3ghz i5 quad core.
I am trying to import enwiki-20130403-pages-articles-multistream.xml.bz2 using mwdumper-1.16.jar, but I have a few questions.
1. I was originally using a GUI version of mwdumper-1.16.jar, but that errored out a few time with duplicate pages so I decided to use the pre-built one recommended on the media wiki page. Having looked at the stats on Wikipedia I can see that there are roughly 30 million pages, however I have found this morning that mwdumper-1.16.jar has finished (no errors) with roughtly 13.3 million pages. Without any errors I assumed that it had finished, but I appear to be 17 million pages short? 2. The pages that have imported are missing templates. Is there another XML file that I can import which will add the missing templates? As the screen shot below shows, it is almost unreadable without them.
Many thanks in advance for your help.
Kind regards, Richard Ive
[image: Inline images 2]
Στις 02-06-2013, ημέρα Κυρ, και ώρα 11:38 +0100, ο/η Richard Ive έγραψε:
...
1. I was originally using a GUI version of mwdumper-1.16.jar, but that errored out a few time with duplicate pages so I decided to use the pre-built one recommended on the media wiki page. Having looked at the stats on Wikipedia I can see that there are roughly 30 million pages, however I have found this morning that mwdumper-1.16.jar has finished (no errors) with roughtly 13.3 million pages. Without any errors I assumed that it had finished, but I appear to be 17 million pages short?
Some pages are not included in this dump, e.g. user and user talk pages. You don't need those for display of the articles. A quick count of the titles in the stubs file shows 13 million or so pages so you're all good.
1. The pages that have imported are missing templates. Is there another XML file that I can import which will add the missing templates? As the screen shot below shows, it is almost unreadable without them.
The templates are included in the pages-articles dump. Are you sure you have the ParserFunctions extension enabled?
Ariel
Thanks for confirming 13mil is enough.
I didn't have the extension installed. It looks a lot better now. Thank you!
There are still a few weird looking things around templates, but I'm importing enwiki-20130403-templatelinks.sql now so hopefully that should fix things.
Thanks again.
On 3 June 2013 09:07, Ariel T. Glenn ariel@wikimedia.org wrote:
Στις 02-06-2013, ημέρα Κυρ, και ώρα 11:38 +0100, ο/η Richard Ive έγραψε:
...
1. I was originally using a GUI version of mwdumper-1.16.jar, but that errored out a few time with duplicate pages so I decided to use the pre-built one recommended on the media wiki page. Having looked at the stats on Wikipedia I can see that there are roughly 30 million pages, however I have found this morning that mwdumper-1.16.jar has finished (no errors) with roughtly 13.3 million pages. Without any errors I assumed that it had finished, but I appear to be 17 million pages short?
Some pages are not included in this dump, e.g. user and user talk pages. You don't need those for display of the articles. A quick count of the titles in the stubs file shows 13 million or so pages so you're all good.
1. The pages that have imported are missing templates. Is there another XML file that I can import which will add the missing templates? As the screen shot below shows, it is almost unreadable without them.
The templates are included in the pages-articles dump. Are you sure you have the ParserFunctions extension enabled?
Ariel
Richard Ive, 02/06/2013 12:38:
Hi all,
I am helping the charity Volunteer Uganda set up an offline eLearning computer system with 15 Raspberry Pi's and cheap desktop computer for a server.
Why aren't you using Kiwix? Reportedly, it even runs standalone on a Raspberry Pi without problems.
Nemo
In all honesty, I didn't know about it until now.
Everything else we are using is web biased (Khan Academy lite, ebooks and emedia), so for our model the Wikipedia website works best. I would guess it is cheaper to buy a £300 desktop with 2TB for the wiki Mysql database, than getting 15 SD cards > 16GB (I'm guessing at disk usage)?
Thanks for pointing this out though! I'll definitely consider using it in the future.
Kind regards, Richard Ive
On 3 June 2013 10:17, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Richard Ive, 02/06/2013 12:38:
Hi all,
I am helping the charity Volunteer Uganda set up an offline eLearning computer system with 15 Raspberry Pi's and cheap desktop computer for a server.
Why aren't you using Kiwix? Reportedly, it even runs standalone on a Raspberry Pi without problems.
Nemo
Richard Ive, 03/06/2013 12:06:
In all honesty, I didn't know about it until now.
Everything else we are using is web biased (Khan Academy lite, ebooks and emedia), so for our model the Wikipedia website works best. I would guess it is cheaper to buy a £300 desktop with 2TB for the wiki Mysql database, than getting 15 SD cards > 16GB (I'm guessing at disk usage)?
You can also use the kiwix server feature from a central machine to all machines connected via LAN if disk space is an issue more than connection; WMFR has a whole african program on this.
Nemo
xmldatadumps-l@lists.wikimedia.org