Hi,
I've been tasked with building an offline copy of the Wikipedia website. The main goal is to have the database and images stored locally so that we can run a Wikipedia website on a local server. Our ultimate goal is to have Mediawiki and the Wikipedia database and images stored on a single hard drive in servers in schools in Zimbabwe where they have no Internet access.
I've made good progress so far but I can see that some templates are missing from my Offline Wikipedia. I only found that these templates were missing because I was checking random pages and noticed that the text "Template:abcdef" was displayed in red text. It has the alt text "Template:abcdef (page does not exist)". A couple of examples of missing templates are 'Template:Citation/make link' and 'Template:Gaps'.
My Offline Wikipedia data was imported into the MySQL database using the Mwdumper program. The source data came from the enwiki-latest-pages-articles.xml file. I imported the page links into the database using the enwiki-latest-pagelinks.sql file.
Can anyone give me some guidance on how to troubleshoot this problem?
Thanks,
Kevin Clark Connection Software
On Wed, 20 Jul 2011 13:42:23 +0100, Kevin Clark wrote:
I've been tasked with building an offline copy of the Wikipedia website. The main goal is to have the database and images stored locally so that we can run a Wikipedia website on a local server. Our ultimate goal is to have Mediawiki and the Wikipedia database and images stored on a single hard drive in servers in schools in Zimbabwe where they have no Internet access.
They are people who do such things than many years. I'm one of them. What you try to achieve already exists IMO. Have a look to kiwix-serve to serve the Wikipedia 1.0 selection or Wikipedia for school selection/ZIMs. Why reinventig the wheel?
I'm also currently involved in a project which has for purpose to deliver Wikipedia offline in Zimbabwe (many thousands of DVDs).
Regards Emmanuel
What you try to achieve already exists IMO.
I beg to differ. The existing solutions I've seen that provide only a small subset of the data and force the user to use a completely different interface. We want the user experience to be as close to the real thing as possible i.e. all the Wikipedia pages are available and the user can choose which client (browser) they use to access the data.
Kevin
On Wed, Jul 20, 2011 at 1:51 PM, Kevin Clark kevin.clark@csoft.co.uk wrote:
What you try to achieve already exists IMO.
I beg to differ. The existing solutions I've seen that provide only a small subset of the data and force the user to use a completely different interface. We want the user experience to be as close to the real thing as possible i.e. all the Wikipedia pages are available and the user can choose which client (browser) they use to access the data.
Kevin, I think you are taking the wrong way.
Many of us have been trying to do this for years. My first version of the full French WP was deployed in 2006. This have been thought through by a lot of people and the conclusion is: it's too difficult to maintain it alone, especially if you want good quality.
This is why we created the OpenZIM standard which provides a format and tools to create "resources files" aka distribution of Wikipedia (or else). By using this format, we can safely invest time and resources into a factory of ZIM for the needs of all offline users.
There are already a lot of full-WP ZIM file, and also subsets in many languages. Using such a file will give you a complete dump of Wikipedia with images and al. everything working (hopefully ahah).
A lot of offline users are using Kiwix, a desktop client which reads ZIM file. I understand you don't want this. That's why we also have kiwix-serve, an HTTP server which serves the ZIM content, including search.
The appearance is not exactly the same as on the real WP: we only keep content in the ZIM, not the top and left frames. If this is an urging need (I don't see why you would want that though), you could still hack kiwix-serve to use a WP-like template for serving the content.
Now, if the content you want to provide isn't available as a ZIM file yet, just request it and we'll do our best to create a polished ZIM file for it.
Please, don't waste the time you have to start yet another barely working solution that no one will ever maintain.
renaud
Please, don't waste the time you have to start yet another barely working solution that no one will ever maintain.
Hi Renaud,
Thank you for your comments. For my part, I do not want to cause offence when I say that we've looked at Kiwix-serve and the OpenZIM format and feel that they do not meet our needs at this time.
I've no doubt that we've not thought through all the issues. But after a considerable effort I have an Offline Wikipedia site running on a server here, and it looks pretty close to the finished article. The template issue I reported might be icing on the cake but if it's easy to fix I might as well do it.
Our belief is that we can maintain the Offline Wikipedia by shipping a replacement hard drive containing the database and images. The user plugs it in, restarts MySQL, and hey presto! Okay, so that's overly simplistic - clearly the hard drive containing the data is the product of updating the images and re-importing the database but that's mostly CPU time, not human effort. Our main dependency is only that the database dumps and images are made available, and there's no sign of those being withdrawn any time soon.
Can anyone on this list give me some pointers as to how I might fix the template problem I reported?
Thanks,
Kevin
On Wed, Jul 20, 2011 at 17:17, Kevin Clark kevin.clark@csoft.co.uk wrote:
Please, don't waste the time you have to start yet another barely working solution that no one will ever maintain.
Hi Renaud,
Thank you for your comments. For my part, I do not want to cause offence when I say that we've looked at Kiwix-serve and the OpenZIM format and feel that they do not meet our needs at this time.
what are your needs which cannot be met?
rupert.
what are your needs which cannot be met?
We evaluated it a year ago so apologies if things have moved on since then. I recorded its weaknesses for the Wikipedia project we were planning as:
1) No links between articles 2) Requires it's own browser, not a standard desktop browser 3) Doesn't run under Apache
Looking ahead we also noted that our Offline Wikipedia offered advantages because
A) Our solution could run on Linux or Windows - Linux provides a key part of our infrastructure but our clients might not want to support it
B) Our solution could be extended to support other Wikimedia sites, e.g. Wiktionary - database dumps are readily available Wiktionary but I don't see that same availability for OpenZIM files
Regards,
Kevin
Hi Kevin,
Am 20.07.2011 18:08, schrieb Kevin Clark:
what are your needs which cannot be met?
We evaluated it a year ago so apologies if things have moved on since then. I recorded its weaknesses for the Wikipedia project we were planning as:
- No links between articles
sorry, the articles have always been perfect HTML, so including all links... otherwise it wouldn't make sense at all.
- Requires it's own browser, not a standard desktop browser
ZIM files can be read by different applications, such as a TXT file can be edited in many editor.
Kiwix is one of them, kiwix-serve in contrast is a server component - you will need to use your favourite webbrowser to access / browse the content of the ZIM file. As said in 1) already, the content of ZIM files is perfect HTML anyway. Kiwix-serve just provides the HTTP interface between the ZIM reader and the browser.
- Doesn't run under Apache
True.
A) Our solution could run on Linux or Windows - Linux provides a key part of our infrastructure but our clients might not want to support it
Kiwix runs on Windows, Linux and MacOS X
WikiOnBoard runs on Symbian (and Android)
vido and qvido run on extremely small embedded Linux devices (less than 16 MB of RAM)
zimreader-java runs on anything that supports Java
B) Our solution could be extended to support other Wikimedia sites, e.g. Wiktionary - database dumps are readily available Wiktionary but I don't see that same availability for OpenZIM files
as stated in 1), ZIM is a mere storage format for HTML content. You can put anything in it, even non-MediaWiki stuff.
I am pretty sure that there are Wiktionary ZIM files around, and if not, they can be made easily.
Did you know that the "print a book" feature (Collections Extension) in Wikipedia creates ZIM files?
I have even heard about a project making combined ZIM files holding both Wikipedia and Wiktionary in the same file.
The Wikimedia Foundation is working since more than a year now to provide regular ZIM files from all Wikimedia wikis, just in the same way they already do with XML and SQL files. The ZIM support in Collections Extension was one of the steps to reach this goal.
/Manuel
Hi,
speaking from a Kenyan perspective who undertook the Wikipedia for Schools pilot in Kenya: www.wikimedia.or.ke/Wikipedia_for_Schools
One of the major challenges that I faced was that although the Kiwix version was helpful, it could have been better if we had a custom-made version for the Kenyan curriculum, since the one that was available was based on the British curriculum, and therefore there were some topics that Kenyan students needed but weren't there.
Secondly, I find that the use of external hard drives being expensive and unreliable. Expensive, since one HDD costs around €50. Unreliable because I distrust students and how sure am I that in a month's time, the students or teachers wouldn't have taken the hard disks and used them for their own personal use? Also it can be easily deleted, unlike when it's on DVD.
Also, some of the computers were so old and had Win XP and somehow, the ZIM files refused to index. I wouldn't be surprised if you get the same problem in Zimbabwe.
Abbas.
Date: Wed, 20 Jul 2011 18:32:42 +0200 From: manuel.schneider@wikimedia.ch To: offline-l@lists.wikimedia.org Subject: Re: [Offline-l] Possible template problem
Hi Kevin,
Am 20.07.2011 18:08, schrieb Kevin Clark:
what are your needs which cannot be met?
We evaluated it a year ago so apologies if things have moved on since then. I recorded its weaknesses for the Wikipedia project we were planning as:
- No links between articles
sorry, the articles have always been perfect HTML, so including all links... otherwise it wouldn't make sense at all.
- Requires it's own browser, not a standard desktop browser
ZIM files can be read by different applications, such as a TXT file can be edited in many editor.
Kiwix is one of them, kiwix-serve in contrast is a server component - you will need to use your favourite webbrowser to access / browse the content of the ZIM file. As said in 1) already, the content of ZIM files is perfect HTML anyway. Kiwix-serve just provides the HTTP interface between the ZIM reader and the browser.
- Doesn't run under Apache
True.
A) Our solution could run on Linux or Windows - Linux provides a key part of our infrastructure but our clients might not want to support it
Kiwix runs on Windows, Linux and MacOS X
WikiOnBoard runs on Symbian (and Android)
vido and qvido run on extremely small embedded Linux devices (less than 16 MB of RAM)
zimreader-java runs on anything that supports Java
B) Our solution could be extended to support other Wikimedia sites, e.g. Wiktionary - database dumps are readily available Wiktionary but I don't see that same availability for OpenZIM files
as stated in 1), ZIM is a mere storage format for HTML content. You can put anything in it, even non-MediaWiki stuff.
I am pretty sure that there are Wiktionary ZIM files around, and if not, they can be made easily.
Did you know that the "print a book" feature (Collections Extension) in Wikipedia creates ZIM files?
I have even heard about a project making combined ZIM files holding both Wikipedia and Wiktionary in the same file.
The Wikimedia Foundation is working since more than a year now to provide regular ZIM files from all Wikimedia wikis, just in the same way they already do with XML and SQL files. The ZIM support in Collections Extension was one of the steps to reach this goal.
/Manuel
Regards Manuel Schneider
Wikimedia CH - Verein zur Förderung Freien Wissens Wikimedia CH - Association for the advancement of free knowledge www.wikimedia.ch
Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l
On Wed, Jul 20, 2011 at 5:33 PM, Abbas Mahmood abbasjnr@hotmail.com wrote:
One of the major challenges that I faced was that although the Kiwix version was helpful, it could have been better if we had a custom-made version for the Kenyan curriculum, since the one that was available was based on the British curriculum, and therefore there were some topics that Kenyan students needed but weren't there.
Agree. I have the same problem as many others do. That's why we should invest more efforts on tools to help people create their own ZIM file.
Secondly, I find that the use of external hard drives being expensive and unreliable. Expensive, since one HDD costs around €50. Unreliable because I distrust students and how sure am I that in a month's time, the students or teachers wouldn't have taken the hard disks and used them for their own personal use? Also it can be easily deleted, unlike when it's on DVD.
Not sure what you point is here. Kiwix runs on everything from DVD to USB Sticks or HDD or servers. Here in West Africa, CD & DVD are single use only at most due to dust and quality of hardware.
Also, some of the computers were so old and had Win XP and somehow, the ZIM files refused to index. I wouldn't be surprised if you get the same problem in Zimbabwe.
Yes, this has been a long time request which have been fixed now. We can embed the index along the ZIM file.
renaud
Date: Wed, 20 Jul 2011 17:38:10 +0000 From: rgaudin@gmail.com To: offline-l@lists.wikimedia.org Subject: Re: [Offline-l] Possible template problem
Secondly, I find that the use of external hard drives being expensive and unreliable. Expensive, since one HDD costs around €50. Unreliable because I distrust students and how sure am I that in a month's time, the students or teachers wouldn't have taken the hard disks and used them for their own personal use? Also it can be easily deleted, unlike when it's on DVD.
Not sure what you point is here. Kiwix runs on everything from DVD to USB Sticks or HDD or servers. Here in West Africa, CD & DVD are single use only at most due to dust and quality of hardware.
What I meant to say here is that using external HDD as opposed to DVDs is expensive. And yes, most DVDs are DVD-R format -- but that is not a problem because you can use the same DVD to install the offline Wikipedia from one school to another. Abbas.
On Wed, Jul 20, 2011 at 6:28 PM, Abbas Mahmood abbasjnr@hotmail.com wrote:
What I meant to say here is that using external HDD as opposed to DVDs is expensive.
OK.
And yes, most DVDs are DVD-R format -- but that is not a problem because you can use the same DVD to install the offline Wikipedia from one school to another.
What I meant by single usage what that the medium is so fragile that you can't use it twice. Mainly because of dust but that depends on the environnement.
Main advantage I see to DVD is that it's by far the cheapest option if your content fits on it.
renaud
Date: Wed, 20 Jul 2011 18:33:27 +0000 From: rgaudin@gmail.com To: offline-l@lists.wikimedia.org Subject: Re: [Offline-l] Possible template problem
What I meant by single usage what that the medium is so fragile that you can't use it twice. Mainly because of dust but that depends on the environnement.
Main advantage I see to DVD is that it's by far the cheapest option if your content fits on it.
Agreed :) Abbas.
Just a note for apache, one can of course use mod_proxy to put kiwix serve behind apache.
Am 20.07.2011 18:32 schrieb "Manuel Schneider" < manuel.schneider@wikimedia.ch>:
Hi Kevin,
Am 20.07.2011 18:08, schrieb Kevin Clark:
what are your needs which cannot be met?
We evaluated it a year ago so apologies if things have moved on since
then. I recorded its weaknesses for the Wikipedia project we were planning as:
- No links between articles
sorry, the articles have always been perfect HTML, so including all links... otherwise it wouldn't make sense at all.
- Requires it's own browser, not a standard desktop browser
ZIM files can be read by different applications, such as a TXT file can be edited in many editor.
Kiwix is one of them, kiwix-serve in contrast is a server component - you will need to use your favourite webbrowser to access / browse the content of the ZIM file. As said in 1) already, the content of ZIM files is perfect HTML anyway. Kiwix-serve just provides the HTTP interface between the ZIM reader and the browser.
- Doesn't run under Apache
True.
A) Our solution could run on Linux or Windows - Linux provides a key
part of our infrastructure but our clients might not want to support it
Kiwix runs on Windows, Linux and MacOS X
WikiOnBoard runs on Symbian (and Android)
vido and qvido run on extremely small embedded Linux devices (less than 16 MB of RAM)
zimreader-java runs on anything that supports Java
B) Our solution could be extended to support other Wikimedia sites, e.g.
Wiktionary - database dumps are readily available Wiktionary but I don't see that same availability for OpenZIM files
as stated in 1), ZIM is a mere storage format for HTML content. You can put anything in it, even non-MediaWiki stuff.
I am pretty sure that there are Wiktionary ZIM files around, and if not, they can be made easily.
Did you know that the "print a book" feature (Collections Extension) in Wikipedia creates ZIM files?
I have even heard about a project making combined ZIM files holding both Wikipedia and Wiktionary in the same file.
The Wikimedia Foundation is working since more than a year now to provide regular ZIM files from all Wikimedia wikis, just in the same way they already do with XML and SQL files. The ZIM support in Collections Extension was one of the steps to reach this goal.
/Manuel
Regards Manuel Schneider
Wikimedia CH - Verein zur Förderung Freien Wissens Wikimedia CH - Association for the advancement of free knowledge www.wikimedia.ch
Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l
On Wed, Jul 20, 2011 at 3:17 PM, Kevin Clark kevin.clark@csoft.co.uk wrote:
Please, don't waste the time you have to start yet another barely working solution that no one will ever maintain.
Thank you for your comments. For my part, I do not want to cause offence when I say that we've looked at Kiwix-serve and the OpenZIM format and feel that they do not meet our needs at this time.
Not taken, don't worry. Now I don't want to argue indefinitely with you about the disadvantages I see in maintaining something parallel but please, do me a favor and tell why Kiwix (kiwix-serve in that case) does not meet your needs. This is not for the sake of trolling but if there's a use case we don't match, we'd better know about it and see if it can be addressed. You may also say that you've spent too much time on your solution to want to stick to it anyway. I know how it feels.
Anyway thanks for your contribution and good luck,
renaud
Kevin,
You're right that our focus has often been on DVD-sized collections, but bigger collections are available! In some ways bigger is easier because you bypass the whole (very complex!) issue of content selection - but also curation becomes a problem as you scale up..
Are you planning on publishing the entire English Wikipedia? I believe there are such ZIM collections available, but some may be a few months old. More important, they have not been checked against vandalism, which means that some articles may be pure obscenities, etc. (I found some disgusting examples when preparing the Version 0.7 collection.) How are you doing your revisionID selection? Are you using the WikiTrust system we use on the 1.0 collections? If you are producing a vandalism-checked version of the entire English Wikipedia, or you have developed your own tools, we would very much like to share these things with others working in the area.
With large storage becoming cheaper, it should now be feasible to put the entire English Wikipedia onto a hard drive or similar; our subset collections are designed for cases where distribution will be via DVD or flash drive (as with the Version 0.7 and 0.8 collections). In the case of Wikipedia for Schools, the producers wanted a collection that was hand-checked for vandalism and child-appropriateness, which is why it is only around 6000 articles (but also why it's so popular!).
To avoid the template problems, you should definitely consider the ZIM format, which is designed for this purpose (making content readable offline). As I understand it, you don't need to tie that to a particular reader (Kiwix, Okawix, etc), though these systems are storage-efficient and each represents several years of optimisation work.
If you're looking for HTML versions, I believe these are available, and I think Emmanuel has produced such things for Wizzy to help him with his work: http://blog.wizzy.com/post/Kiwix-install-at-Kwena-Malapo-school-Johannesberg Let us know specifically the end format you prefer. (BTW, I'm a collections curator, not a tech person, so please forgive me if I've made any technical errors!)
Hope this helps. Good luck! Martin (User:Walkerma, English Wikipedia 1.0 team)
Martin A. Walker Department of Chemistry SUNY College at Potsdam Potsdam, NY 13676 USA +1 (315) 267-2271
Kevin Clark wrote:
What you try to achieve already exists IMO.
I beg to differ. The existing solutions I've seen that provide only a small subset of the data and force the user to use a completely different interface. We want the user experience to be as close to the real thing as possible i.e. all the Wikipedia pages are available and the user can choose which client (browser) they use to access the data.
Kevin
Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l
I've made good progress so far but I can see that some templates are missing from my Offline Wikipedia. I only found that these templates were missing because I was checking random pages and noticed that the text "Template:abcdef" was displayed in red text. It has the alt text "Template:abcdef (page does not exist)". A couple of examples of missing templates are 'Template:Citation/make link' and 'Template:Gaps'.
Seems that I downloaded the corrupt enwiki-latest-pages-articles.xml.bz2 file in June, and this may be the reason for the missing templates. I hope to report back with a positive outcome once I have re-imported the data.
Kevin