Hi
I want to announce here the publication of new datasets to easy make
selections of Wikipedia articles. This data can be used by any developer
or tech-friendly guy to create subset of Wikipedia. You can find the
data here: http://download.kiwix.org/wp1/ (or via FTP).
This data repository will be kept up-to-date every month thanks to a few
scripts which are published here:
https://github.com/openzim/wp1_selection_tools. Of course, everything is
free software.
For each of the 500.000+ Wikipedias, you can find there TSV tables which
contain usual indicators of importance for each article: like number of
interlanguage links, number of links pointing to an articles, pageviews,
... All gathered in one file. For the Wikipedia in English you will
benefit in addition of the Wikiproject importance/quality evaluations.
If you are really lazy, there is a "score" file which mix all these
indicators to give a unique score number per article. The methodology is
described here https://github.com/openzim/wp1_selection_tools. For
example, if you want tje TOP1000 articles of Wikipedia, just take the
first thousand lines of the "score" file to get your list of articles.
All this work has been done to allow the creation of TOP Wikipedia
articles ZIM files. It has also been done to make possible the creation
of ZIM extension files, a concept we want to develop to improve our
WikiMed Android apps. Both of them will appear before the end of the year.
Stay tuned!
Regards
Emmanuel
--
Kiwix - Wikipedia Offline & more
* Web: http://www.kiwix.org
* Twitter: https://twitter.com/KiwixOffline
* more: http://www.kiwix.org/wiki/Communication
Hi
We rarely write here specifically about mwoffliner, even if this tool is
time to time named by-the-way in threads. But these last months we have
done many interesting improvements to this important tool and I thought
it might be valuable to report quickly about them.
As a reminder, mwoffliner is a script which is thought to build a ZIM
file from any (recent) online Mediawiki. It scraps a snapshot of the
online wiki (HTML/JS/pictures/...) on your local disk.
Here is the list of recent improvements:
* We have introduce Parsoid as a local dependence, which means that even
if a Mediawiki does not have Parsoid/Visual Editor installed, mwoffliner
should have a chance now to build the ZIM file of it by running Parsoid
locally.
* We have introduced the Parsoid mobile layout suppport which allows to
build ZIM file with a similar layout as Wikipedia Mobile version. This
is pretty much in beta and we plan first to use it only for Wikipedia.org.
* We have introduced the support of audio/video which means that now,
like the pictures, they are mirrored too. Our first tests show that for
Wikipedia it tends to multiply the size of the ZIM file by a factor
four. As a consequence we won't use it directly everywhere. That said
the feature is there and we will step-by-step introduce video in the ZIM
files we are generating with mwoffliner.
* We have published mwoffliner (and mwmatrixoffliner) to the npmjs
repository: https://www.npmjs.com/package/mwoffliner. Now everybody can
install it easily (but you still need to take care about the dependences).
* We have made the script a bit more modular: you can call it like any
other program but now you can also use it as a library in your own
Javascript/Node.js scripts.
* We have moved the git repository to the openZIM organization on
Github: https://github.com/openzim/mwoffliner. By moving all our scraper
to the openZIM organization we hope to bring a bit of clarity between
Kiwix and openZIM respective duties. Have a look to all other scrapers
we have migrated to openZIM: https://github.com/openzim
mwoffliner is not a tool for everybody but it is really important to
continue to improve it to provide quality ZIM files of Wikipedia,
Wiktionary, ... So if you have Javascript skills please come to help us
to prepare the next big steps forward
https://github.com/openzim/mwoffliner/issues
Regards
Emmanuel
--
Kiwix - Wikipedia Offline & more
* Web: http://www.kiwix.org
* Twitter: https://twitter.com/KiwixOffline
* more: http://www.kiwix.org/wiki/Communication
Hi
After a year a half of efforts, we are proud to announce that we have
done our first delivering of ZIM files of the Stack Exchange web sites.
All the ZIM files are freely available to download via the Kiwix
software or directly on the Kiwix download server:
http://download.kiwix.org/zim/stack_exchange/
Stack Exchange is a network of question-and-answer websites on topics in
varied fields, each site covering a specific topic, where questions,
answers, and users are subject to a reputation award process. This
include famous web sites like Stackoverflow.com, AskUbuntu.com or
Superuser.com. More information about Stack Exchange and its more than
100 web sites is available here:
https://en.wikipedia.org/wiki/Stack_Exchange.
This ZIM files are done thanks to regularly updated archives provided on
archives.org and an ad-hoc software our team has specially developed for
that purpose. This software is called "Sotoki" and his of course open
source. You can have a look to the source code here
https://github.com/openzim/sotoki or use it directly using Python pip
packager:
https://pypi.python.org/pypi/sotoki
We plan to release updates of these ZIM files each time new archives
will be published.
Regards
Emmanuel
--
Kiwix - Wikipedia Offline & more
* Web: http://www.kiwix.org
* Twitter: https://twitter.com/KiwixOffline
* more: http://www.kiwix.org/wiki/Communication
We're holding a Hackathon on August 13-18. right after Wikimania, in Potsdam, New York (about 2.5 hours drive from Montreal). The focus is on writing code to help produce offline collections of Wikipedia content along with other medical & educational materials, mostly for use in libraries, schools and clinics. The main wiki page can be found at http://OFF.NETWORK. We can probably help you with transportation from Montreal, and accommodation in Potsdam if you need it.
Some of the attendees will be there from familiar offline groups such as Kiwix mainly to write code. Others focus on end-use and are from the non-profit/educational community, such as Computers for Kids, Internet-in-a-box, KA Lite. These people will want to ensure that the technical needs of the groups can be met, and that resources are shared. You can see the sort of thing Internet-in-a-Box does in this recent article on opensource.com: https://opensource.com/article/17/5/internet-in-a-box-raspberry-pi
If you're interested in attending, it's probably best to contact me directly. Thanks!
Martin A. Walker (walkerma on Wikipedia)
Professor of Chemistry, SUNY Potsdam
walkerma(a)potsdam.edu<mailto:walkerma@potsdam.edu>
Hi
Last month a dozen of Kiwix developers have attend to our Spring 2017
hackathon in Lyon, France. All these developers represent the broad
spectrum of Kiwix/openZIM activities, from the scrapping to the mobile
apps, through core lib development.
During a whole week we all have lived and worked together to help the
whole project to make a significant step forward. Here is the essence of
what we have achieved to do:
*First Chrome extension*: this is a pure Web tech Kiwix reader. Still
limited in features but working and already published on the Chrome
Store
https://chrome.google.com/webstore/detail/kiwix/donaljnlmapmngakoipdmehbfci….
Give it a try and report any problem here:
https://github.com/kiwix/kiwix-html5/issues
*First Firefox extension*: sharing the same code base as the Chrome
extension, it is still in validation but should be publicly available
soon. Impatient users might install a "nightly" from here
http://download.kiwix.org/nightly/. Both Firefox/Chrome extensions have
been initiated by the project Evopedia which has merged with Kiwix last
year.
*Gutenberg scraper* improved: Developed 3 years ago, the original
scrapper has been polished and the (big) list of tickets has been
cleared. The code has been published in the Python repository
https://pypi.python.org/pypi/gutenberg2zim and a Docker image has been
created https://hub.docker.com/r/openzim/gutenberg/. New ZIM files are
currently building and will be published soon.
*First release of Sotoki*: Sotoki is our StackExchange scrapper, which
means that we can now propose ZIM files of all the SE web sites
https://stackexchange.com/sites. The most famous of them being
StackOverflow. The script has been published in Python
https://pypi.python.org/pypi/sotoki and a Docker image created
https://hub.docker.com/r/openzim/sotoki/. New ZIM files are currently
building and will be published soon.
*Youtube scrapper has been improved*: Our Youtube scrapper has been a
bit improved and published in Python repository and a Docker image has
been created https://hub.docker.com/r/openzim/youtube/.
*openedx scraper stub*: First step have been done, we currently search a
sponsor to work on this next Summer and release a first version at Fall.
*Phet scrapper UI improved*: The UI of the Phet ZIM file has been really
improved. You can check it with the last versions of the files
http://download.kiwix.org/zim/phet/. A docker image has been created
https://hub.docker.com/r/openzim/phet/. Phets Android app will be
updated in the next weeks
https://play.google.com/store/apps/details?id=org.kiwix.kiwixcustomphet
*Android apps*: many improvements have been brought to the apps, the
most visible concerning the continuous integration and the testing. But
other massive changes have been done to have a better MVC approach and
bookmarking system. Many of these improvements should be visible in
Kiwix for Android 2.3 to be released in June.
*Fist Kiwix Apache module* has been created as an alternative of running
kiwix-serve behind a reverse proxy. So far it does not provide as much
features as kiwix-serve but is definitely easier to configure and really
promising. Have a look to the git repository here
https://github.com/kiwix/kiwix-apache
*MWoffliner improved*: it benefits now from an option to scrape remote
Mediawiki, even if they do not have Parsoid installed. This is an
opportunity for creating many new ZIM files - and we will. MWoffliner
(as a script and a library) is now also available as a node.js package
https://www.npmjs.com/package/mwoffliner and we have created a Docker
image https://hub.docker.com/r/openzim/mwoffliner/. We have also created
a Docker image for zimwriterfs
https://hub.docker.com/r/openzim/zimwriterfs/.
*pibox-installer*: our intern, working on our future solution to allow
everyone (also non-tech) to build there own customed offline wifi
hotspot, was also part of the hackathon. He made a few progresses on his
project during that week. Follow his daily work at
https://framagit.org/ideascube/pibox-installer
*Khan Academy video scrapper almost finished*: This effort had been
started in January last year. It has been almost finished during the
hackathon. A few commits more and we will be able to generate ZIM files
for Khan academy videos. Please be a bit patient if you need the ZIM
files, here is the git repo https://github.com/openzim/kalite
*libzim improvements*: we have moved the Xapian fulltext engine
technology from the Kiwix-lib to the libzim (as an option). At this
occasion many improvements have been done regarding the search and the
whole integration between the libzim, libkiwix and the multiple ports.
*First release of zip2zim* which is a small service able to convert a
custom ZIP (with HTML/JS/CSS/pictures inside) in the corresponding ZIM
file. The code is available here https://github.com/openzim/zip2zim and
we have already created a Docker image
https://hub.docker.com/r/openzim/zip2zim/. We plan to setup an online
service soon.
Last but now least, the hackathon has concluded ~ 6 months long effort
to reorganize all our git repositories. All the code we have ever
produced and still produce is now available on GitHub in two organisations:
* openZIM for the low level ZIM related readers and scrappers
https://github.com/openzim/
* Kiwix for all the "high-level" Kiwix specific ports and solutions
https://github.com/kiwix
This was a long list, thank you for reading it to the end, but a dozen
of talented and committed developers can really to a lot of work during
a whole week. If you want to get a bit more details about the hackathon
have a look to that page: wiki.kiwix.org/wiki/Hackathon_Spring_2017
Next one will be this summer @Wikimania in Canada/US, here is the
organisation page:http://wiki.kiwix.org/wiki/Hackathon_Wikimania_2017
Regards
Emmanuel
--
Kiwix - Wikipedia Offline & more
* Web: http://www.kiwix.org
* Twitter: https://twitter.com/KiwixOffline
* more: http://www.kiwix.org/wiki/Communication