Re: [Wikitech-l] GSoC Project

26 Apr 2013


      On 04/26/2013 04:27 PM, Kiran Mathew Koshy wrote:
...
Hi guys,
I have an own idea  for my GSoC project that I'd like to share with you.
Its not a perfect one, so please forgive any mistakes.
The project is related to the existing GSoC project "*Incremental Data dumps
*" , but is in no way a replacement for it.
*Offline Wikipedia*
For a long time, a lot of offline solutions for Wikipedia have sprung up on
the internet. All of these have been unofficial solutions, and  have
limitations. A major problem is the* increasing size of  the data dumps*,
and the problem of *updating the local content. *
Consider the situation in a place where internet is costly/
unavailable.(For the purpose of discussion, lets consider a school in a 3rd
world country.) Internet speeds are extremely slow, and accessing Wikipedia
directly from the web is out of the question.
Such a school would greatly benefit from an instance of Wikipedia on  a
local server. Now up to here, the school can use any of the freely
available offline Wikipedia solutions to make a local instance. The problem
arises when the database in the local instance becomes obsolete. The client
is then required to download an entire new dump(approx. 10 GB in size) and
load it into the database.
Another problem that arises is that most 3rd part programs *do not allow
network access*, and a new instance of the database is required(approx. 40
GB) on each installation.For instance, in a school with around 50 desktops,
each desktop would require a 40 GB  database. Plus, *updating* them becomes
even more difficult.
So here's my *idea*:
Modify the existing MediaWiki software and to add a few PHP/Python scripts
which will automatically update the database and will run in the
background.(Details on how the update is done is described later).
Initially, the MediaWiki(modified) will take an XML dump/ SQL dump (SQL
dump preferred) as input and will create the local instance of Wikipedia.
Later on, the updates will be added to the database automatically by the
script.
The installation process is extremely easy, it just requires a server
package like XAMPP and the MediaWiki bundle.
Process of updating:
There will be two methods of updating the server. Both will be implemented
into the MediaWiki bundle. Method 2 requires the functionality of
incremental data dumps, so it can be completed only after the functionality
is available. Perhaps I can collaborate with the student selected for
incremental data dumps.
Method 1: (online update) A list of all pages are made and published by
Wikipedia. This can be in an XML format. The only information  in the XML
file will be the page IDs and the last-touched date. This file will be
downloaded by the MediaWiki bundle, and the page IDs will be compared with
the pages of the existing local database.
case 1: A new page ID in XML file: denotes a new page added.
case 2: A page which is present in the local database is not among the page
IDs- denotes a deleted page.
case 3: A page in the local database has a different 'last touched'
 compared to the one in the local database- denotes an edited page.
In each case, the change is made in the local database and if the new page
data is required, the data is obtained using MediaWiki API.
These offline instances of Wikipedia will be only used in cases where the
internet speeds are very low, so they *won't cause much load on the servers*
.
method 2: (offline update): (Requires the functionality of the existing
project "Incremental data dumps"):
   In this case, the incremental data dumps are downloaded by the
user(admin) and fed to the MediaWiki installation the same way the original
dump is fed(as a normal file), and the corresponding changes are made by
the bundle. Since I'm not aware of the XML format used in incremental
updates, I cannot describe it now.
Advantages : An offline solution can be provided for regions where internet
access is a scarce resource. this would greatly benefit developing nations
, and would help in making the world's information more free and openly
available to everyone.
All comments are welcome !
PS: about me: I'm a 2nd year undergraduate student in Indian Institute of
Technology, Patna. I code for fun.
Languages: C/C++,Python,PHP,etc.
hobbies: CUDA programming, robotics, etc.
Thanks for your ideas, Kiran!  So, a few comments:
* In the future, please use a more descriptive email subject line.  As
you can see in
http://lists.wikimedia.org/pipermail/wikitech-l/2013-April/ there's a
lot of mail on this list, especially mail about Google Summer of Code
proposals.  A subject line like "GSoC proposal: supplementing
incremental data dumps with indexes" or something like that helps people
decide to read it.
* You probably want to try to look at some statistics and research to
make sure you are solving the right problem, and see how educators and
school pupils are actually interacting with Wikipedia in
low-connectivity environments.  Check out the presentation "A Snapshot
of Open Source in West Africa" http://opensourcebridge.org/sessions/884
-- some people burn DVDs regularly and drive them around to local
schools to be copied onto lab computers, for instance.
https://meta.wikimedia.org/wiki/Research:Data and the archives of the
offline and data dumps discussion lists will be useful:
https://lists.wikimedia.org/mailman/listinfo/offline-l
and https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l .
* Thank you for aiming to work on improving Wikimedia access in places
with bad net access.  We care about this, a lot.  I hope you're able to
help out with this issue, whether it's under GSoC or not!
Thanks,
Sumana
-- 
Sumana Harihareswara
Engineering Community Manager
Wikimedia Foundation

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] GSoC Project