Offline-l December 2009

offline-l@lists.wikimedia.org

9 participants
20 discussions

Re: [openZIM dev-l] Benchmarks
by emmanuel＠engelhart.org 24 Dec '09

24 Dec '09

Le jeu 24/12/09 07:56, "Manuel Schneider" manuel.schneider(a)wikimedia.ch a écrit: > Is it possible to run the same benchmark on the Ben NanoNote? 1.5 G > should fit on the memory card and as far as I understood you the ZIM > software has been already ported to the NN? Yes, this would be really interesting. The same test, with the ZIM on a DVD would also be interesting. > Because I wonder how the caches impact the result on the NN and what the > optimal settings would be. As far as you say that the caches don't have > a real impact on "big" hardware, we could just go with the optimal > settings for the NN as defaults in the zimlib. Yes. I have also multiply by 1000 the dirent cache without seing any sensitive improvement. But here again, the test should be done with a DVD player to help to determine the best default value. In any case, the results are really encouraging, although that's a deception for me that we can not save more disk usage with LZMA. Should be interesting to make the same benchmarks with a 2MB big cluster. Emmanuel

2 1

Re: [openZIM dev-l] Benchmarks
by emmanuel＠engelhart.org 24 Dec '09

24 Dec '09

Le jeu 24/12/09 10:31, "Tommi Mäkitalo" tommi(a)tntnet.org a écrit: > By the way what do you think; should we drop zlib and bzip2 compression > completely? We do not depend on zlib and bzip2 libraries any more. We have > already dropped compatibility. > > Lzma is the fastest and compresses as good as bzip2. The disadvantage is, > that we really depend of a very new and not yet released lzma library? I can not really find arguments to not abandon gzip and bzip2... but I want to be careful. I prefer do be "conservative" and keep them a little bit longer in the zimlib (in case of....). But, you can avoid the option in the zimwriter. So, if we detect for any reason, that we need to keep bzip2 and/or gzip we simply have to uncomment the options in zimwriter. Emmanuel

1 0

[openZIM dev-l] Benchmarks
by Tommi Mäkitalo 24 Dec '09

24 Dec '09

Hi, I've done some benchmarking. I have created 2 zim files from my collection of 640000 articles. One with bzip2 and one with lzma. I burnt both files on a DVD. A zim benchmark program (can be found at zimlib/zimDump/zimBench) gives interesting results. The benchmark program reads linear and random access. The linear results are not that interesting but the random access. Reading the bzip2 compressed file gives me about 12 articles per second. Lzma about 38. So uncompressing lzma is much faster. Creating the files took with bzip2 2:09 and with lzma 3:25. Size is almost identical (both 1.5G). Zimlib manages 2 caches. One for directory entries and one for uncompressed data. Varying them makes no big difference. Looks like the OS cache does a good job already. This may of course look different on other hardware. I had a fast CPU and a slow device. Tommi

2 2

[openZIM dev-l] another update in zim format
by Tommi Mäkitalo 14 Dec '09

14 Dec '09

Hi, a few days ago I realized, that I was not done with the zim format update. The mime types were still hard coded. But now I'm through. I double checked that with my own protocol of our developers meeting. Now the zim file contains a list of contained mime types and the directory entry has an index to that list. Since the index is 16 bit and the mime type 0xffff specifies redirect, a zim file can now have up to 65535 distinct mime types. In zimwriter the database source is changed, so that the mime type in the article table is not a number any more but a text. Next task I have to do is to document the changes. Manuel already installed the xz-utils package on our server so that I can start implementing lzma compression then. Tommi

1 0

[openZIM dev-l] zim2 branch is working yet
by Tommi Maekitalo 12 Dec '09

12 Dec '09

Hi, the changes, we discussed at our devolpers meeting are implemented in zimlib and zimwriter. I was already able to create the wikipedia-de.zim and the full text index wikipedia-de-x.zim. Both files are slightly larger due to the additional title including the index for it. The zint compression is also changed. As announced it is similar to utf-8. The full text index data is slightly smaller. I did not use this zint compression in the directory entry since it is really not worth the trouble. I prefered simpler directory entries to make implementation of alternatives (Java, C#) simpler. As Manuel already mentioned I changed the way, zim files are opened. The zimlib read the whole pointer list into memory when the file was opened. This is not done any more. The memory usage is now reduced quite significantly, so that I can do a full text search even on the Nanonote with 32MB RAM. I implemented this also in trunk. Currently I'm working on porting zimreader to the new library. Porting is necessary since we decided to drop th qunicode feature. But it is still very simple since I just need to replace all references to qunicode strings with simple strings. Also I'm looking at the API of lzma. The documentation is quite confusing, but I'm making some progress here also. I promised to implement the file changes until end of this year and since I'm actually through, maybe I can add lzma compression also. I also decide to drop support for zlib and bzip2 compression as soon as lzma is working. This reduces the external dependencies. I see no advantage in supporting multiple compression methods. What do you think? Tommi

1 0

Re: [openZIM dev-l] [Argentina] WikiBrowse improvements
by emmanuel＠engelhart.org 08 Dec '09

08 Dec '09

Le mer 02/12/09 21:32, "Madeleine Price Ball" meprice(a)fas.harvard.edu a écrit: > > We think that the specific knowledge of the > publishers should be how to > select the content - which content goes where in > which form - and not > technical questions such as compression, storage or > retrieving the data on > the user's end. > > OK, if I shouldn't be talking to you guys, tell me who to talk to. > > Yes, selecting content is very difficult. I couldn't get Peru or SJ to > contribute meaningfully to generating a simple blacklist of articles > that should NOT be included on the OLPC activity. (Recall it is being > given to young children!) I ended up making the blacklist myself based > on my own gut feelings. If Peru's board of education or OLPC's > "director of content" couldn't get their act together for this simple > task, expecting others to do this task for you will be a huge > roadblock to getting content out. > > Traffic based content is simple and effective and it doesn't involve a > lot of opinions on what should or should not be included. Yes, compiling such stats with incoming link counter & interwiki counter, it is possible to get a pretty accurate and "neutral" selection quickly for any Wikipedia. I have scripts to do that automaticaly: http://kiwix.svn.sourceforge.net/viewvc/kiwix/selection_tools/ Emmanuel

1 0

Re: [openZIM dev-l] new branch for openzim created
by emmanuel＠engelhart.org 08 Dec '09

08 Dec '09

Le dim 29/11/09 18:31, "Tommi Mäkitalo" tommi(a)tntnet.org a écrit: > We already discussed about using it in the directory entry also to save > some bytes. I'm not sure if it worth the trouble. It makes alternative > implementations more difficult. If alternative implementations do not care > about full text index or categories, they don't need to implement the zint > compression. On the other hand the code for it is straight forward and > should ne easy to port. I have no particular opinion about that. > The branch is not working yet. It may take some time until it will since I > have to implement the changes in the reader as well as the writer before I > can even start testing (other than verify, that it compiles). It was easier > with zeno when I started, since I had already working zeno files. Ok, this is good to have done like that, so we have a svn trunk which continues to compile&run. Thx Emmanuel

1 0

[openZIM dev-l] [Fwd: [Wikimedia CH Board] Wikiwix und Linterweb fördern die aktuellen Spendenaktionen der lokalen Wikimedia-Vereine:]
by Manuel Schneider 08 Dec '09

08 Dec '09

FYI. As a member of the Wikimedia community I am very pleased to read that. And as Wikimedia CH is sponsoring openZIM I say thank you on behalf of the openZIM team. /Manuel -------- Original-Nachricht -------- Betreff: [Wikimedia CH Board] Wikiwix und Linterweb fördern die aktuellen Spendenaktionen der lokalen Wikimedia-Vereine: Datum: Fri, 4 Dec 2009 08:23:36 +0100 Von: Nando Stöcklin <nando.stoecklin(a)gmail.com> Antwort an: Board list <board(a)wikimedia.ch> An: Board list <board(a)wikimedia.ch> http://blog.wikiwix.com/de/2009/12/03/wikiwix-et-linterweb-soutiennent-les-… Gruss, Nando -- Regards Manuel Schneider Wikimedia CH - Verein zur Förderung Freien Wissens Wikimedia CH - Association for the advancement of free knowledge www.wikimedia.ch _______________________________________________ Board mailing list Board(a)wikimedia.ch http://lists.wikimedia.ch/listinfo/board

2 2

Re: [openZIM dev-l] [Argentina] WikiBrowse improvements
by Manuel Schneider 03 Dec '09

03 Dec '09

Hi Madeleine, Madeleine Price Ball schrieb: > I was curious if you include images? If not, are you considering doing > so, and what's stopping you? If so, how do you pick them? we haven't done this on our "Test DVD" this summer even though this is easily possible. Well, easily means: It is easy for the format, but the problem is to choose the images. Emmanuel Engelhart (Kiwix, he is also part of the openZIM team) has made some perl scripts for that. We didn't do it for two reasons: Lack of time, because even though the tools exist, it's a lot of work. Searching through the articles, get all the image URLs, get the images, decide in which size to resize them etc... And the openZIM project is not a publisher of offline content. We are developing a stable, efficient format allowing free interchange of contents between reader applications and devices and providing a GPL'ed sample implementation of it. > I did the work for picking which articles & images went into the XO > activity, based on traffic stats. (We only had 100MB, we got 24k > articles in 80MB and spent the other 20MB on highly compressed > images.) OLPC has other more critical things to worry about these > days, but some of the volunteers who worked on that project might be > interested in helping others. Well, we had a "Offline Meeting" at Wikimania in Buenos Aires this summer where Samuel Klein was also participating. Our goal is to contribute the right technology to make all the offline projects able to collaborate. Currently everyone is reinventing the wheel when it comes to storage of the content. We think that the specific knowledge of the publishers should be how to select the content - which content goes where in which form - and not technical questions such as compression, storage or retrieving the data on the user's end. Wikimedia Foundation is supporting us so far as they share our goal and work on a regular export of all Wikimedia wikis into ZIM format (like you already get SQL and XML dumps on download.wikimedia.org). They also have a lots of contacts whith publishers and other projects working on Wikipedia Offline and connect them with us, so we can improve ZIM to fit for all the Wikipedia Offline projects. That's why we defined a new format version at the last Developers Meeting. Greets, Manuel -- Regards Manuel Schneider Wikimedia CH - Verein zur Förderung Freien Wissens Wikimedia CH - Association for the advancement of free knowledge www.wikimedia.ch

2 2

Re: [openZIM dev-l] [Argentina] WikiBrowse improvements
by Samuel Klein 03 Dec '09

03 Dec '09

Jimbo - thanks for the spur to clean up the existing work. All - Let's start by cleaning up the mailing lists and setting a few short-term goals :-) It's a good sign that we have both charity and love converging to make something happen. * For all-platform all-purpose wikireaders, let's use offline-l(a)lists.wikimedia, as we discussed a month ago in the aftermath of Wikimania (Erik, were you going to set this up? I think we agreed to deprecate wiki-offline-reader-l and replace it with offline-l.) * For wikireaders such as WikiBrowse and Infoslicer on the XO, please continue to use wikireader(a)lists.laptop I would like to see WikiBrowse become the 'sugarized' version of a reader that combines the best of that and the openZim work. A standalone DVD or USB drive that comes with its own search tools would be another version of the same. As far as merging codebases goes, I don't think the WikiBrowse developers are invested in the name. I think we have a good first cut at selecting articles, weeding out stubs, and including thumbnail images. Maybe someone working on openZim can suggest how to merge the search processes, and that file format seems unambiguously better. Kul - perhaps part of the work you've been helping along for standalone usb-key snapshots would be useful here. Please continue to update this page with your thoughts and progress! http://meta.wikimedia.org/wiki/Offline_readers SJ 2009/10/23 Iris Fernández <irisfernandez(a)gmail.com> > On Fri, Oct 23, 2009 at 1:37 PM, Jimmy Wales <jwales(a)wikia-inc.com> wrote: > > > > My dream is quite simple: a DVD that can be shipped to millions of people > with an all-free-software solution for reading Wikipedia in Spanish. It > should have a decent search solution, doesn't have to be perfect, but it > should be full-text. It should be reasonably fast, but super-perfect is not > a consideration. > > > > Hello! I am an educator, not a programmer. I can help selecting > articles or developing categories related to school issues. > Iris - you know the main page of WikiBrowse that you see when the reader first loads? You could help with a new version of that page. Madeleine (copied here) worked on the first one, but your thoughts on improving it would be welcome.

7 8

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Offline-l December 2009