It looks very professional, so great work. I don't much like the idea of blanket-filtering out "adult" subjects, especially if this is done under an official Wikipedia brand. What is or isn't "adult" or age-appropriate is very much dependent on cultural & subcultural preferences.
Would it be possible to tag the exclusions with reasons, so that the script can be run in different configurations (e.g. exclude violence, sexually explicit content, etc.)?
On 5/22/07, Andrew Cates andrew@catesfamily.org.uk wrote:
Okay friends,
In 48 hours we are going public on the 2007 Wikipedia Schools Selection. See http://en.wikipedia.org/wiki/Wikipedia:Wikipedia_CD_Selection for details and http://schools-wikipedia.org to browse. Brad has kindly given consent for the use of the Wikipedia logo, as this non-commercial (free, no adverts) venture is basically wmf aligned. It contains all Good & Featured content (except adult content) and about 1200 other articles tagged {{WPCD}} by en editors (except rubbish: thanks guys). We think the end point looks good and given Mr Sanger's recent comments in the UK press about the unsuitability of Wikipedia for UK schools, timing couldn't be better.
Full version is available as a 3.5 Gig download which we are getting set up with the Torrent people, and as a 3.5 Gig DVD free from our offices. Thumbnails only version of about 1 Gig is as a straight old fashioned download.
Few tech comments:
- This selection is generated by a Perl script diffing a list of
historical versions of articles of the form Arthropod=131728434 etc. fetching any changed articles and associated images and image pages and then running the HTML through a cleanup script. The other route from the database looked tougher. However the hardest part of this route is the Perl clean up: red links is trivial but identifying sentences whose sole purpose is to link to unincluded content, and inline editorial comments is not. We think we are at over 95%, near 98% on this (small sampling only).
- The manual check for graffiti (all 4625 articles were checked by
hand) found about 1% of articles at any instance had graffiti: about 5 times more than a year ago. We all know this is getting worse; and remember these are pretty core WP articles. Redirect vandalism, image vandalism and template vandalism were also found. The various bad word scripts didn't really help finding vandalism.
- We also took out other content unsuitable for children. For example
we judged http://schools-wikipedia.org/wp/m/Mauthausen-Gusen_concentration_camp.htm as historically important but the section on "Inmates" too graphic for an eight year old girl to read. We also took out references/external links sections since it looks like the community doesn't want to vouch for these. Largely this is done using a simple "exclude section xyz" on the entry data.
- Categories are on a neat Ruby database which is completely easy to
use. We have reworked them around a schools curriculum and filled in gaps.
- We now have an option to run this as a continuous project rather than
a version one. In principle we could pick up updates (say from a page on en which was protected) and new approved articles every few days and regenerate the downloads and also regenerate the browsable copy. Whether people want this is a good question: eventually it might collide with the Stable versions project I guess.
- There may be 1.7 m articles on Wikipedia but the quality falls off
quickly at 10-20,000. However, there are lots of signs of improvement at this level.
Andrew aka BozMo@en
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l