[WikiEN-l] Fwd: [Wikitech-l] Wikipedia for Schools

David Gerard dgerard at gmail.com
Tue May 22 12:01:38 UTC 2007


FYI. Interesting that the quality seems to fall off around 10-20,000 articles.


- d.



---------- Forwarded message ----------
From: Andrew Cates <andrew at catesfamily.org.uk>
Date: 22-May-2007 11:01
Subject: [Wikitech-l] Wikipedia for Schools
To: Wikimedia developers <wikitech-l at lists.wikimedia.org>


Okay friends,

In 48 hours we are going public on the 2007 Wikipedia Schools Selection.
See http://en.wikipedia.org/wiki/Wikipedia:Wikipedia_CD_Selection for
details and http://schools-wikipedia.org to browse. Brad has kindly
given consent for the use of the Wikipedia logo, as this non-commercial
(free, no adverts) venture is basically wmf aligned. It contains all
Good & Featured content (except adult content) and about 1200 other
articles tagged {{WPCD}} by en editors (except rubbish: thanks guys). We
think the end point looks good and given Mr Sanger's recent comments in
the UK press about the unsuitability of Wikipedia for UK schools, timing
couldn't be better.

Full version is available as a 3.5 Gig download which we are getting set
up with the Torrent people, and as a 3.5 Gig DVD free from our offices.
Thumbnails only version of about 1 Gig is as a straight old fashioned
download.

Few tech comments:

1) This selection is generated by a Perl script diffing a list of
historical versions of articles of the form
Arthropod=131728434 etc. fetching any changed articles and associated
images and image pages and then running the HTML through a cleanup
script. The other route from the database looked tougher. However the
hardest part of this route is the Perl clean up: red links is trivial
but identifying sentences whose sole purpose is to link to unincluded
content, and inline editorial comments is not. We think we are at over
95%, near 98% on this (small sampling only).

2) The manual check for graffiti (all 4625 articles were checked by
hand) found about 1% of articles at any instance had graffiti: about 5
times more than a year ago. We all know this is getting worse; and
remember these are pretty core WP articles. Redirect vandalism, image
vandalism and template vandalism were also found. The various bad word
scripts didn't really help finding vandalism.

3) We also took out other content unsuitable for children. For example
we judged
http://schools-wikipedia.org/wp/m/Mauthausen-Gusen_concentration_camp.htm
as historically important but the section on "Inmates" too graphic for
an eight year old girl to read. We also took out references/external
links sections since it looks like the community doesn't want to vouch
for these. Largely this is done using a simple "exclude section xyz" on
the entry data.

4) Categories are on a neat Ruby database which is completely easy to
use. We have reworked them around a schools curriculum and filled in gaps.

5) We now have an option to run this as a continuous project rather than
a version one. In principle we could pick up updates (say from a page on
en which was protected) and new approved articles every few days and
regenerate the downloads and also regenerate the browsable copy. Whether
people want this is a good question: eventually it might collide with
the Stable versions project I guess.

6) There may be 1.7 m articles on Wikipedia but the quality falls off
quickly at 10-20,000. However, there are lots of signs of improvement at
this level.

Andrew aka BozMo at en


_______________________________________________
Wikitech-l mailing list
Wikitech-l at lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/wikitech-l



More information about the WikiEN-l mailing list