[Wikitech-l] Wikipedia for Schools

22 May 2007


      Okay friends,
In 48 hours we are going public on the 2007 Wikipedia Schools Selection. 
See http://en.wikipedia.org/wiki/Wikipedia:Wikipedia_CD_Selection for 
details and http://schools-wikipedia.org to browse. Brad has kindly 
given consent for the use of the Wikipedia logo, as this non-commercial 
(free, no adverts) venture is basically wmf aligned. It contains all 
Good & Featured content (except adult content) and about 1200 other 
articles tagged {{WPCD}} by en editors (except rubbish: thanks guys). We 
think the end point looks good and given Mr Sanger's recent comments in 
the UK press about the unsuitability of Wikipedia for UK schools, timing 
couldn't be better.
Full version is available as a 3.5 Gig download which we are getting set 
up with the Torrent people, and as a 3.5 Gig DVD free from our offices. 
Thumbnails only version of about 1 Gig is as a straight old fashioned 
download.
Few tech comments:
1) This selection is generated by a Perl script diffing a list of 
historical versions of articles of the form
Arthropod=131728434 etc. fetching any changed articles and associated 
images and image pages and then running the HTML through a cleanup 
script. The other route from the database looked tougher. However the 
hardest part of this route is the Perl clean up: red links is trivial 
but identifying sentences whose sole purpose is to link to unincluded 
content, and inline editorial comments is not. We think we are at over 
95%, near 98% on this (small sampling only).
2) The manual check for graffiti (all 4625 articles were checked by 
hand) found about 1% of articles at any instance had graffiti: about 5 
times more than a year ago. We all know this is getting worse; and 
remember these are pretty core WP articles. Redirect vandalism, image 
vandalism and template vandalism were also found. The various bad word 
scripts didn't really help finding vandalism.
3) We also took out other content unsuitable for children. For example 
we judged 
http://schools-wikipedia.org/wp/m/Mauthausen-Gusen_concentration_camp.htm 
as historically important but the section on "Inmates" too graphic for 
an eight year old girl to read. We also took out references/external 
links sections since it looks like the community doesn't want to vouch 
for these. Largely this is done using a simple "exclude section xyz" on 
the entry data.
4) Categories are on a neat Ruby database which is completely easy to 
use. We have reworked them around a schools curriculum and filled in gaps.
5) We now have an option to run this as a continuous project rather than 
a version one. In principle we could pick up updates (say from a page on 
en which was protected) and new approved articles every few days and 
regenerate the downloads and also regenerate the browsable copy. Whether 
people want this is a good question: eventually it might collide with 
the Stable versions project I guess.
6) There may be 1.7 m articles on Wikipedia but the quality falls off 
quickly at 10-20,000. However, there are lots of signs of improvement at 
this level.
Andrew aka BozMo@en

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] Wikipedia for Schools