Dear Ariel,
1) Profiling
WP-MIRROR 0.6 saw the introduction of a `--profile' command-line
option. Which provides a detailed breakdown of where time is spent
during a mirror build. Unfortunately, WP-MIRROR 0.5 did not have this
feature, so only aggregate comparisons are possible.
2) Performance studies
Much of the Winter and Spring was devoted to time trials. These are
all documented in WP-MIRROR 0.6 Reference Manual, Appendix G. The
following may be very interesting for you
2.1) G.8 Experiments with InnoDB data compression
Currently, InnoDB offers two on-dist storage formats, Antelope and
Barracuda. The later offers data compress. I performed experiments
on many of the largest wiki tables (e.g. categorylinks, image,
langlinks, pagelinks, templatelinks, text) to determine the space
savings and time penalty. For a summary of results, please take a look
at Table G.1 Database table size v. ROW_FORMAT and KEY_BLOCK_SIZE.
2.2) G.9 Experiments with commonswiki.image
For several reasons, I wanted to import commonswiki.image as part of
the mirror build process. That table is large and takes time to
import. However, once imported WP-MIRROR 0.6 saves a greater amount of
time elsewhere (e.g. I do not have to scrape the XML dumps for image
file names).
Because of the size of commonswiki.image, I performed a lengthy series
of experiments to determine the fastest way of importing it. Many of
the best methods make use of features first offered in MySQL 5.5 and
InnoDB 1.1 (e.g fast index creation). For a summary of results,
please take a look at Figure G.1 InnoDB Experiments Importing
commonswiki.image Table.
3) Documentation
The above mentioned WP-MIRROR 0.6 Reference Manual may be found at
<http://www.nongnu.org/wp-mirror/manual/>. It is also included in
with the DEB package.
4) Questions
4.1) mwxml2sql
While reading the code for `mwxml2sql.c', I noticed `-n, --nodrop'
option which sets KEY_BLOCK_SIZE=16 for the `text' table. From my
experiments (see Appendix G.8) I do not think that any compression
takes place. KEY_BLOCK_SIZE=4 might be a better choice.
4.2) INSERT IGNORE vs. REPLACE INTO
For the initial mirror build, INSERT INTO commands get the job done.
However, for updates to an existing mirror, I would prefer not to DROP
TABLE every time. For this reason, I rewrite the INSERT INTO commands
into REPLACE INTO commands. This rewrite works fine.
What I would like to know from you is this: for which table is INSERT
IGNORE a better choice than REPLACE INTO. I can see that the
`revision' and `text' table are candidates because they are never
updated, only added to. But what about the other tables? Any advise
would be appreciated.
Sincerely Yours,
Kent
On 6/3/13, Ariel T. Glenn <ariel(a)wikimedia.org> wrote:
> Στις 03-06-2013, ημέρα Δευ, και ώρα 10:22 -0400, ο/η wp mirror έγραψε:
>> Dear list members,
>>
>> I am pleased to announce the release of WP-MIRROR 0.6.
>>
>> The main design objective was this: PERFORMANCE. WP-MIRROR 0.6 now
>> builds the `enwiki' (which is the most demanding case) with 80% less
>> time and 75% less memory than v0.5.
>>
> Sounds great! Can you give us some benchmarks? I'm particularly
> interested in length of time with the old and nw versions of your
> package for the various stages of setting up a dump of current pages for
> the English language Wikipedia, on whatever hardware you are using for
> testing.
>
> Ariel
>
>
Dear list members,
I am pleased to announce the release of WP-MIRROR 0.6.
The main design objective was this: PERFORMANCE. WP-MIRROR 0.6 now
builds the `enwiki' (which is the most demanding case) with 80% less
time and 75% less memory than v0.5.
Feature: One new feature was added. WP-MIRROR 0.6 can now mirror
wikis from most other WMF projects (e.g. wikibooks, wiktionary, etc).
Reliability: Downloads are now performed with the aid of `wget' which
has an automatic restart feature. This virtually eliminates the
problem of partial downloads.
Images: WP-MIRROR 0.6 makes use of image dump tarballs found at
<http://ftpmirror.your.org/>. WP-MIRROR 0.6 then does a thorough job
of identifying image files missing from the tarballs, and downloads
them efficiently using HTTP/1.1 persistent connections.
Packaging: The DEB package for WP-MIRROR 0.6 should work
`out-of-the-box' with no user configuration for the following
distributions:
o Debian GNU/Linux 7.0 (wheezy)
o Ubuntu 12.10 (quantal)
Virtual Hosts: Browsing of mirrored wikis is done via virtual hosts
with names like <http://simple.wikipedia.site/> and
<http://simple.wiktionary.site/>. Simply take the URL that WMF
offers, and replace `.org' with `.site'.
Project Home Page: <http://www.nongnu.org/wp-mirror/> has been updated.
Feedback is welcome.
Sincerely Yours,
Dr. Kent L. Miller
xowa is a new, open-source, offline wiki application. It imports directly
from the Wikimedia data dumps, and shows articles in an HTML browser
window. It can also download and display images.
v0.5.0 is a general rollup release. It is intended to be stable.
These are the changes since the last announcment (v0.4.0):
* Wikidata #property tag (Phase 2)
* Wikidata JSON structured data formatter {contributed by Schnark}
* Score extension for music transcription through lilypond
* Improved Scribunto support for 2013-04 / 2013-05 English Wikipedia
* MediaWiki-like Categories (command line install only)
* JavaScript injection prevention
The files are here: https://sourceforge.net/projects/xowa/files/v0.5.0.1/
As always, any feedback is appreciated.
Hi,
A first version of Kiwix for Android was released a month ago. The app
was warm welcome with around 2.000 total installations and 1.000 active
ones.
An average note of 4.25/5 were given in 25 feedbacks. Almost no bug were
detected and people simply want more features.
We have released a few hours ago a new version fixing the only one bug
we have detected and providing a few new features:
https://play.google.com/store/apps/details?id=org.kiwix.kiwixmobile
We need new Java developers to implement features like tabs, bookmarks,
navigation history... More details in the bug tracker:
https://sourceforge.net/p/kiwix/feature-requests/search/?q=status%3Aopen+%2…
Beginners are welcome, stepping in is almost trivial, everything is
explained step by step in the COMPILE file:
https://sourceforge.net/p/kiwix/kiwix/ci/master/tree/
Regards
Emmanuel
Dear upstream-tracker.org,
My name is Micah Roth. I am a contributor for Fedora working on getting
Kiwix, an offline Wikipedia reader, packaged for our repos. I would like to
register zimlib, which is the standard implementation of the OpenZIM
library, for tracking on upstream-tracker.org. zimlib is required for Kiwix
to function.
The project's website is http://www.openzim.org/wiki/Main_Page
Please let me know if you need me to do anything else. I have notified
upstream about this registration in IRC. They requested that I cc their
mailing list.
Thank you,
~Micah
fyi
-------- Original Message --------
Subject: Offline Wikipedia (was Re: GSoC Project)
Date: Tue, 30 Apr 2013 09:07:53 -0700
From: Quim Gil <qgil(a)wikimedia.org>
Organization: Wikimedia Foundation
To: wikitech-l(a)lists.wikimedia.org
Thank you for this reply, Emmanuel. GSoC / OPW candidates learn a lot
from emails like this! (and the rest of us too)
On 04/29/2013 03:48 PM, Emmanuel Engelhart wrote:
> let me thank:
> * Quim for having renamed this thread... I wouldn't have got a chance to
> read it otherwise.
Then please make sure the "Offline Wikipedia" is in the subject of your
replies. ;)
> If your or someone else is interested we would probably be able to find
> a tutor.
Please find one. Or even better two co-mentors, as we are requesting at
http://lists.wikimedia.org/pipermail/wikitech-l/2013-April/068873.html
There is not much time left.
> PS: Wikimedia has an offline centric mailing list, let me add it in CC:
> https://lists.wikimedia.org/mailman/listinfo/offline-l
Cross-posting to the offline list since this is urgent.
--
Quim Gil
Technical Contributor Coordinator @ Wikimedia Foundation
http://www.mediawiki.org/wiki/User:Qgil
Dear Kiran
Before commenting your proposal, let me thank:
* Quim for having renamed this thread... I wouldn't have got a chance to
read it otherwise.
* Gnosygnu and Sumana for their previous answers.
Your emails points three problems:
(1) The size of the offline dumps
(2) Server mode of the offline solution
(3) The need of incremental updates
Regarding (1), I disagree. We have the ZIM format which is open, has an
extremly efficient standard implementation, provides high compression
rates and fast random access: http://www.openzim.org
Regarding (2), Kiwix, which is a ZIM reader, already does it: you can
either share Kiwix on a network disk or use Kiwix HTTP compatible daemon
called kiwix-serve: http://www.kiwix.org/wiki/Kiwix-serve
Regarding (3), I agree. This is an old feature request in the openZIM
project. It's both on the roadmap and in the bug tracker:
* http://www.openzim.org/wiki/Roadmap
* https://bugzilla.wikimedia.org/show_bug.cgi?id=47406
But, I also think the solution you propose isn't adapted to the problem.
Setting up a Mediawiki is not easy, it's resource intensive and you
don't need all this power (of the software setup) for the usage you want
to do.
On the other side, with ZIM you have a format which provides all what
you need, runs on devices which costs only a few dozens of USD and we
will make this incremental update trivial for the final user (it's just
a matter of time ;).
So to fix that problem, there is my approach: we should implement two
tools I call "zimdiff" and "zimpatch":
* zimdiff is a tool able to compute the difference between two ZIM files
* zimpatch is a tool able to patch a ZIM file with a ZIM diff file
The incrementation process would be:
* Compute a ZIM diff file (done by the ZIM provider)
* Download and path the "old" ZIM file with the ZIM diff file (done by
the user)
We could implement two modes for zimpatch, "leasy" and "normal":
* leasy mode: simple merge of the file and rewriting of the index (fast
but need a lot of mass storage)
* normal mode: recompute a new file (slow but need less mass storage)
Regarding the ZIM diff file format... the discussion is open, but it
looks like we could simply reuse the ZIM format and zimpatch would work
like a "zimmerge" (does not exist, it's just for the explanation).
Everything could be done IMO in "only" a few hundreds of smart lines of
C++. I would be really surprised if this need more than 2000 lines. But,
to do that, we need a pretty talentuous C++ developer, maybe you?
If your or someone else is interested we would probably be able to find
a tutor.
Kind regards
Emmanuel
PS: Wikimedia has an offline centric mailing list, let me add it in CC:
https://lists.wikimedia.org/mailman/listinfo/offline-l
Le 26/04/2013 22:27, Kiran Mathew Koshy a écrit :
> Hi guys,
>
> I have an own idea for my GSoC project that I'd like to share with you.
> Its not a perfect one, so please forgive any mistakes.
>
> The project is related to the existing GSoC project "*Incremental Data dumps
> *" , but is in no way a replacement for it.
>
>
> *Offline Wikipedia*
>
> For a long time, a lot of offline solutions for Wikipedia have sprung up on
> the internet. All of these have been unofficial solutions, and have
> limitations. A major problem is the* increasing size of the data dumps*,
> and the problem of *updating the local content. *
>
> Consider the situation in a place where internet is costly/
> unavailable.(For the purpose of discussion, lets consider a school in a 3rd
> world country.) Internet speeds are extremely slow, and accessing Wikipedia
> directly from the web is out of the question.
> Such a school would greatly benefit from an instance of Wikipedia on a
> local server. Now up to here, the school can use any of the freely
> available offline Wikipedia solutions to make a local instance. The problem
> arises when the database in the local instance becomes obsolete. The client
> is then required to download an entire new dump(approx. 10 GB in size) and
> load it into the database.
> Another problem that arises is that most 3rd part programs *do not allow
> network access*, and a new instance of the database is required(approx. 40
> GB) on each installation.For instance, in a school with around 50 desktops,
> each desktop would require a 40 GB database. Plus, *updating* them becomes
> even more difficult.
>
> So here's my *idea*:
> Modify the existing MediaWiki software and to add a few PHP/Python scripts
> which will automatically update the database and will run in the
> background.(Details on how the update is done is described later).
> Initially, the MediaWiki(modified) will take an XML dump/ SQL dump (SQL
> dump preferred) as input and will create the local instance of Wikipedia.
> Later on, the updates will be added to the database automatically by the
> script.
>
> The installation process is extremely easy, it just requires a server
> package like XAMPP and the MediaWiki bundle.
>
>
> Process of updating:
>
> There will be two methods of updating the server. Both will be implemented
> into the MediaWiki bundle. Method 2 requires the functionality of
> incremental data dumps, so it can be completed only after the functionality
> is available. Perhaps I can collaborate with the student selected for
> incremental data dumps.
>
> Method 1: (online update) A list of all pages are made and published by
> Wikipedia. This can be in an XML format. The only information in the XML
> file will be the page IDs and the last-touched date. This file will be
> downloaded by the MediaWiki bundle, and the page IDs will be compared with
> the pages of the existing local database.
>
> case 1: A new page ID in XML file: denotes a new page added.
> case 2: A page which is present in the local database is not among the page
> IDs- denotes a deleted page.
> case 3: A page in the local database has a different 'last touched'
> compared to the one in the local database- denotes an edited page.
>
> In each case, the change is made in the local database and if the new page
> data is required, the data is obtained using MediaWiki API.
> These offline instances of Wikipedia will be only used in cases where the
> internet speeds are very low, so they *won't cause much load on the servers*
> .
>
> method 2: (offline update): (Requires the functionality of the existing
> project "Incremental data dumps"):
> In this case, the incremental data dumps are downloaded by the
> user(admin) and fed to the MediaWiki installation the same way the original
> dump is fed(as a normal file), and the corresponding changes are made by
> the bundle. Since I'm not aware of the XML format used in incremental
> updates, I cannot describe it now.
>
> Advantages : An offline solution can be provided for regions where internet
> access is a scarce resource. this would greatly benefit developing nations
> , and would help in making the world's information more free and openly
> available to everyone.
>
> All comments are welcome !
>
> PS: about me: I'm a 2nd year undergraduate student in Indian Institute of
> Technology, Patna. I code for fun.
> Languages: C/C++,Python,PHP,etc.
> hobbies: CUDA programming, robotics, etc.
>
Hi
The openZIM project finally released a first version of its standard
implementation code, the zimlib:
http://www.openzim.org/download/zimlib-1.0.tar.gz
This tarball contains the C++ dev files:
* to read a ZIM file,
* to write a ZIM file,
* a few tools
* a few example programs.
Releasing this was mandatory to:
* help GNU/Linux packagers.
* have a stable code base for new bindings, like phpzim.
More information about the openZIM project at:
http://www.openzim.org
Regards
Emmanuel
Hi,
Am 2013-04-18 10:33, schrieb Emmanuel Engelhart:
> Wikimedia CH has sponsored a long Kiwix Hackathon (and the WMF two
> tablets) and there is the result, the first version of Kiwix (and
> also
> the first ZIM reader) for Android:
> https://play.google.com/store/apps/details?id=org.kiwix.kiwixmobile
this is so great! I just wanted to install it but it doesn't seem to be
available for my phone. It's just two years old and was one of the best
phones when I bought it (HTC Desire Z).
Anyway, I am very happy to see another Android ZIM Reader. What did
happen to the WikiOnBoard reader by the way?
Regards,
Manuel
--
Manuel Schneider
Wikimedia CH - Gesellschaft zur Förderung freien Wissens
www.wikimedia.ch