Le jeu 24/12/09 07:56, "Manuel Schneider" manuel.schneider(a)wikimedia.ch a écrit:
> Is it possible to run the same benchmark on the Ben NanoNote? 1.5 G
> should fit on the memory card and as far as I understood you the ZIM
> software has been already ported to the NN?
Yes, this would be really interesting.
The same test, with the ZIM on a DVD would also be interesting.
> Because I wonder how the caches impact the result on the NN and what the
> optimal settings would be. As far as you say that the caches don't have
> a real impact on "big" hardware, we could just go with the optimal
> settings for the NN as defaults in the zimlib.
Yes.
I have also multiply by 1000 the dirent cache without seing any sensitive improvement.
But here again, the test should be done with a DVD player to help to determine the best default value.
In any case, the results are really encouraging, although that's a deception for me that we can not save
more disk usage with LZMA. Should be interesting to make the same benchmarks with a 2MB big cluster.
Emmanuel
Le jeu 24/12/09 10:31, "Tommi Mäkitalo" tommi(a)tntnet.org a écrit:
> By the way what do you think; should we drop zlib and bzip2 compression
> completely? We do not depend on zlib and bzip2 libraries any more. We have
> already dropped compatibility.
>
> Lzma is the fastest and compresses as good as bzip2. The disadvantage is,
> that we really depend of a very new and not yet released lzma library?
I can not really find arguments to not abandon gzip and bzip2... but I want to be careful.
I prefer do be "conservative" and keep them a little bit longer in the zimlib (in case of....).
But, you can avoid the option in the zimwriter.
So, if we detect for any reason, that we need to keep bzip2 and/or gzip we simply have to uncomment the options in zimwriter.
Emmanuel
Hi,
I've done some benchmarking. I have created 2 zim files from my collection of
640000 articles. One with bzip2 and one with lzma. I burnt both files on a DVD.
A zim benchmark program (can be found at zimlib/zimDump/zimBench) gives
interesting results. The benchmark program reads linear and random access. The
linear results are not that interesting but the random access.
Reading the bzip2 compressed file gives me about 12 articles per second. Lzma
about 38. So uncompressing lzma is much faster.
Creating the files took with bzip2 2:09 and with lzma 3:25.
Size is almost identical (both 1.5G).
Zimlib manages 2 caches. One for directory entries and one for uncompressed
data. Varying them makes no big difference. Looks like the OS cache does a good
job already. This may of course look different on other hardware. I had a fast
CPU and a slow device.
Tommi
Hi,
a few days ago I realized, that I was not done with the zim format update. The
mime types were still hard coded. But now I'm through. I double checked that
with my own protocol of our developers meeting.
Now the zim file contains a list of contained mime types and the directory
entry has an index to that list. Since the index is 16 bit and the mime type
0xffff specifies redirect, a zim file can now have up to 65535 distinct mime
types.
In zimwriter the database source is changed, so that the mime type in the
article table is not a number any more but a text.
Next task I have to do is to document the changes.
Manuel already installed the xz-utils package on our server so that I can
start implementing lzma compression then.
Tommi
Hi,
the changes, we discussed at our devolpers meeting are implemented in zimlib
and zimwriter. I was already able to create the wikipedia-de.zim and the full
text index wikipedia-de-x.zim. Both files are slightly larger due to the
additional title including the index for it.
The zint compression is also changed. As announced it is similar to utf-8. The
full text index data is slightly smaller.
I did not use this zint compression in the directory entry since it is really
not worth the trouble. I prefered simpler directory entries to make
implementation of alternatives (Java, C#) simpler.
As Manuel already mentioned I changed the way, zim files are opened. The zimlib
read the whole pointer list into memory when the file was opened. This is not
done any more. The memory usage is now reduced quite significantly, so that I
can do a full text search even on the Nanonote with 32MB RAM. I implemented
this also in trunk.
Currently I'm working on porting zimreader to the new library. Porting is
necessary since we decided to drop th qunicode feature. But it is still very
simple since I just need to replace all references to qunicode strings with
simple strings.
Also I'm looking at the API of lzma. The documentation is quite confusing, but
I'm making some progress here also. I promised to implement the file changes
until end of this year and since I'm actually through, maybe I can add lzma
compression also.
I also decide to drop support for zlib and bzip2 compression as soon as lzma
is working. This reduces the external dependencies. I see no advantage in
supporting multiple compression methods. What do you think?
Tommi
Le mer 02/12/09 21:32, "Madeleine Price Ball" meprice(a)fas.harvard.edu a écrit:
> > We think that the specific knowledge of the
> publishers should be how to
> select the content - which content goes where in
> which form - and not
> technical questions such as compression, storage or
> retrieving the data on
> the user's end.
>
> OK, if I shouldn't be talking to you guys, tell me who to talk to.
>
> Yes, selecting content is very difficult. I couldn't get Peru or SJ to
> contribute meaningfully to generating a simple blacklist of articles
> that should NOT be included on the OLPC activity. (Recall it is being
> given to young children!) I ended up making the blacklist myself based
> on my own gut feelings. If Peru's board of education or OLPC's
> "director of content" couldn't get their act together for this simple
> task, expecting others to do this task for you will be a huge
> roadblock to getting content out.
>
> Traffic based content is simple and effective and it doesn't involve a
> lot of opinions on what should or should not be included.
Yes, compiling such stats with incoming link counter & interwiki counter,
it is possible to get a pretty accurate and "neutral" selection quickly for any Wikipedia.
I have scripts to do that automaticaly:
http://kiwix.svn.sourceforge.net/viewvc/kiwix/selection_tools/
Emmanuel
Le dim 29/11/09 18:31, "Tommi Mäkitalo" tommi(a)tntnet.org a écrit:
> We already discussed about using it in the directory entry also to save
> some bytes. I'm not sure if it worth the trouble. It makes alternative
> implementations more difficult. If alternative implementations do not care
> about full text index or categories, they don't need to implement the zint
> compression. On the other hand the code for it is straight forward and
> should ne easy to port.
I have no particular opinion about that.
> The branch is not working yet. It may take some time until it will since I
> have to implement the changes in the reader as well as the writer before I
> can even start testing (other than verify, that it compiles). It was easier
> with zeno when I started, since I had already working zeno files.
Ok, this is good to have done like that, so we have a svn trunk which continues to compile&run.
Thx
Emmanuel
FYI.
As a member of the Wikimedia community I am very pleased to read that.
And as Wikimedia CH is sponsoring openZIM I say thank you on behalf of
the openZIM team.
/Manuel
-------- Original-Nachricht --------
Betreff: [Wikimedia CH Board] Wikiwix und Linterweb fördern die
aktuellen Spendenaktionen der lokalen Wikimedia-Vereine:
Datum: Fri, 4 Dec 2009 08:23:36 +0100
Von: Nando Stöcklin <nando.stoecklin(a)gmail.com>
Antwort an: Board list <board(a)wikimedia.ch>
An: Board list <board(a)wikimedia.ch>
http://blog.wikiwix.com/de/2009/12/03/wikiwix-et-linterweb-soutiennent-les-…
Gruss,
Nando
--
Regards
Manuel Schneider
Wikimedia CH - Verein zur Förderung Freien Wissens
Wikimedia CH - Association for the advancement of free knowledge
www.wikimedia.ch
_______________________________________________
Board mailing list
Board(a)wikimedia.ch
http://lists.wikimedia.ch/listinfo/board
Hi Madeleine,
Madeleine Price Ball schrieb:
> I was curious if you include images? If not, are you considering doing
> so, and what's stopping you? If so, how do you pick them?
we haven't done this on our "Test DVD" this summer even though this is
easily possible.
Well, easily means: It is easy for the format, but the problem is to
choose the images. Emmanuel Engelhart (Kiwix, he is also part of the
openZIM team) has made some perl scripts for that.
We didn't do it for two reasons:
Lack of time, because even though the tools exist, it's a lot of work.
Searching through the articles, get all the image URLs, get the images,
decide in which size to resize them etc...
And the openZIM project is not a publisher of offline content. We are
developing a stable, efficient format allowing free interchange of
contents between reader applications and devices and providing a GPL'ed
sample implementation of it.
> I did the work for picking which articles & images went into the XO
> activity, based on traffic stats. (We only had 100MB, we got 24k
> articles in 80MB and spent the other 20MB on highly compressed
> images.) OLPC has other more critical things to worry about these
> days, but some of the volunteers who worked on that project might be
> interested in helping others.
Well, we had a "Offline Meeting" at Wikimania in Buenos Aires this
summer where Samuel Klein was also participating. Our goal is to
contribute the right technology to make all the offline projects able to
collaborate. Currently everyone is reinventing the wheel when it comes
to storage of the content.
We think that the specific knowledge of the publishers should be how to
select the content - which content goes where in which form - and not
technical questions such as compression, storage or retrieving the data
on the user's end.
Wikimedia Foundation is supporting us so far as they share our goal and
work on a regular export of all Wikimedia wikis into ZIM format (like
you already get SQL and XML dumps on download.wikimedia.org). They also
have a lots of contacts whith publishers and other projects working on
Wikipedia Offline and connect them with us, so we can improve ZIM to fit
for all the Wikipedia Offline projects.
That's why we defined a new format version at the last Developers Meeting.
Greets,
Manuel
--
Regards
Manuel Schneider
Wikimedia CH - Verein zur Förderung Freien Wissens
Wikimedia CH - Association for the advancement of free knowledge
www.wikimedia.ch
Jimbo - thanks for the spur to clean up the existing work.
All - Let's start by cleaning up the mailing lists and setting a few
short-term goals :-) It's a good sign that we have both charity and love
converging to make something happen.
* For all-platform all-purpose wikireaders, let's use
offline-l(a)lists.wikimedia, as we discussed a month ago in the aftermath of
Wikimania (Erik, were you going to set this up? I think we agreed to
deprecate wiki-offline-reader-l and replace it with offline-l.)
* For wikireaders such as WikiBrowse and Infoslicer on the XO, please
continue to use wikireader(a)lists.laptop
I would like to see WikiBrowse become the 'sugarized' version of a reader
that combines the best of that and the openZim work. A standalone DVD or
USB drive that comes with its own search tools would be another version of
the same. As far as merging codebases goes, I don't think the WikiBrowse
developers are invested in the name.
I think we have a good first cut at selecting articles, weeding out stubs,
and including thumbnail images. Maybe someone working on openZim can
suggest how to merge the search processes, and that file format seems
unambiguously better.
Kul - perhaps part of the work you've been helping along for standalone
usb-key snapshots would be useful here.
Please continue to update this page with your thoughts and progress!
http://meta.wikimedia.org/wiki/Offline_readers
SJ
2009/10/23 Iris Fernández <irisfernandez(a)gmail.com>
> On Fri, Oct 23, 2009 at 1:37 PM, Jimmy Wales <jwales(a)wikia-inc.com> wrote:
> >
> > My dream is quite simple: a DVD that can be shipped to millions of people
> with an all-free-software solution for reading Wikipedia in Spanish. It
> should have a decent search solution, doesn't have to be perfect, but it
> should be full-text. It should be reasonably fast, but super-perfect is not
> a consideration.
> >
>
> Hello! I am an educator, not a programmer. I can help selecting
> articles or developing categories related to school issues.
>
Iris - you know the main page of WikiBrowse that you see when the reader
first loads? You could help with a new version of that page. Madeleine
(copied here) worked on the first one, but your thoughts on improving it
would be welcome.