api.php and private wikis

List overview All Threads
Download

newer

older

MediaWiki automated test run...

lucene search 2.0 test webinterface

Gurch

17 May 2007 17 May '07

3:50 a.m.

It is possible to use api.php on a wiki to which one does not have access (read-only or otherwise) to do some things to which access through the interface is denied.

For example, I can obtain a list of all pages on board.wikimedia.org or internal.wikimedia.org (neither of which I have read or write access to), while attempting to view Special:Allpages on one of these gives a "login required" error.

Attempting to retrieve revision information via the API correctly gives a "no read permission" error, so I can't actually see the content of any pages.

Is this a bug, or a feature?

-Gurch

Show replies by date

Brion Vibber

17 May 17 May

4:24 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Gurch wrote:

...

It is possible to use api.php on a wiki to which one does not have access (read-only or otherwise) to do some things to which access through the interface is denied.

For example, I can obtain a list of all pages on board.wikimedia.org or internal.wikimedia.org (neither of which I have read or write access to), while attempting to view Special:Allpages on one of these gives a "login required" error.

Attempting to retrieve revision information via the API correctly gives a "no read permission" error, so I can't actually see the content of any pages.

Is this a bug, or a feature?

Shouldn't happen; I'm disabling the read API for private wikis pending security review.

- -- brion vibber (brion @ wikimedia.org)

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGS4utwRnhpk1wk44RAhuzAJ9ck17q+gdkbAG4lkZCcJ94FPM8PACfZ00X 1LfXakGtJL/ePKV4DP4DIu4= =6Q1k -----END PGP SIGNATURE-----

Yuri Astrakhan

5:57 a.m.

Thanks for the heads up, it might be related to the recent change i committed.

Can someone help with setting up an automated unit testing for the API?

Thanks! --Yuri

On 5/16/07, Brion Vibber brion@wikimedia.org wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Gurch wrote:

...
It is possible to use api.php on a wiki to which one does not have access (read-only or otherwise) to do some things to which access through the interface is denied.

For example, I can obtain a list of all pages on board.wikimedia.org or internal.wikimedia.org (neither of which I have read or write access to), while attempting to view Special:Allpages on one of these gives a "login required" error.

Attempting to retrieve revision information via the API correctly gives a "no read permission" error, so I can't actually see the content of any pages.

Is this a bug, or a feature?

Shouldn't happen; I'm disabling the read API for private wikis pending security review.

-- brion vibber (brion @ wikimedia.org)

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGS4utwRnhpk1wk44RAhuzAJ9ck17q+gdkbAG4lkZCcJ94FPM8PACfZ00X 1LfXakGtJL/ePKV4DP4DIu4= =6Q1k -----END PGP SIGNATURE-----

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

Rob Church

11:43 a.m.

On 17/05/07, Yuri Astrakhan yuriastrakhan@gmail.com wrote:

...

Can someone help with setting up an automated unit testing for the API?

I don't mean to be rude (for a change), Yuri, but the last time I looked at the API code, it seemed to be a little bit confusing; there is a *lot* going on with OO there, and some of it is a bit of a mess.

I know I'm not the only developer with this concern. My point is that if we expect the API to be maintainable, then it might be great to have some documentation on how it works - code documentation and class diagrams would be a start.

Rob Church

Nick Jenkins

2:59 p.m.

...

...
Can someone help with setting up an automated unit testing for the API?

My point is that if we expect the API to be maintainable, then it might be great to have some documentation on how it works - code documentation and class diagrams would be a start.

Here's a rough API subsystem class diagram from late April : http://files.nickj.org/MediaWiki/API-subsystem-late-April-2007-class-map.png (creating a version with fewer crossing lines and getting it to Phil Boswell to work his magic on is still on my todo list).

If I could perhaps suggest the quickest / best bang-for-buck method towards documenting the API subsystem, my suggestion would be: * 2 or 3 sentence top-level class descriptions for all API classes, in JavaDoc comments above the class declaration, to describe what the purpose or intent of the class is. * A handful of extra method comments here and there for 4 classes - the ApiBase class (already commented), the ApiQueryGeneratorBase class (semi-commented currently), the ApiQueryBase class (semi-commented currently), and the ApiFormatBase class (already commented apart from getNeedsRawData) - since those 4 are the parent classes for the whole API subsystem. Understand those and, I'm presuming, everything else should fall into place.

-- All the best, Nick.

Yuri Astrakhan

8 p.m.

My next check-in will have a substantially improved doc. Thanks for the feedback. Most of the main structures have already been ironed out, so it should be fairly straightforward to describe how it works.

Unit testing is still a big problem though. Any recommendations for automated API validation? Regression bugs in the API is the worst thing -- unlike a minor UI might not even get noticed, but a change in the API output will likely cause problems in the client code. Anything like NUnit / JUnit here?

I guess i could use the Test wiki for all tests - as long as noone changes designated test pages. Does test wiki ever gets erased?

--Yuri

On 5/17/07, Nick Jenkins nickpj@gmail.com wrote:

...

...
...
Can someone help with setting up an automated unit testing for the

API?

...
My point is that if we expect the API to be maintainable, then it might be great to have some documentation on how it works - code documentation and class diagrams would be a start.

Here's a rough API subsystem class diagram from late April :

http://files.nickj.org/MediaWiki/API-subsystem-late-April-2007-class-map.png (creating a version with fewer crossing lines and getting it to Phil Boswell to work his magic on is still on my todo list).

If I could perhaps suggest the quickest / best bang-for-buck method towards documenting the API subsystem, my suggestion would be:

2 or 3 sentence top-level class descriptions for all API classes, in

JavaDoc comments above the class declaration, to describe what the purpose or intent of the class is.

A handful of extra method comments here and there for 4 classes - the

ApiBase class (already commented), the ApiQueryGeneratorBase class (semi-commented currently), the ApiQueryBase class (semi-commented currently), and the ApiFormatBase class (already commented apart from getNeedsRawData) - since those 4 are the parent classes for the whole API subsystem. Understand those and, I'm presuming, everything else should fall into place.

-- All the best, Nick.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

Rob Church

8:09 p.m.

On 17/05/07, Yuri Astrakhan yuriastrakhan@gmail.com wrote:

...

I guess i could use the Test wiki for all tests - as long as noone changes designated test pages. Does test wiki ever gets erased?

I would recommend not using a public test wiki for that; stewards give out permissions on it like there's no tomorrow (including permissions some users aren't entitled to have, I note) and stuff gets messed about with and deleted all the time.

An automated testing suite for it might be neat, though, but then, so would a suite for the entire product. :)

Rob Church

Yuri Astrakhan

8:22 p.m.

would it be hard to set up a small very restricted additional wiki specifically for unit testing?

On 5/17/07, Rob Church robchur@gmail.com wrote:

...

On 17/05/07, Yuri Astrakhan yuriastrakhan@gmail.com wrote:

...
I guess i could use the Test wiki for all tests - as long as noone

changes

...
designated test pages. Does test wiki ever gets erased?

I would recommend not using a public test wiki for that; stewards give out permissions on it like there's no tomorrow (including permissions some users aren't entitled to have, I note) and stuff gets messed about with and deleted all the time.

An automated testing suite for it might be neat, though, but then, so would a suite for the entire product. :)

Rob Church

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

Rob Church

8:28 p.m.

On 17/05/07, Yuri Astrakhan yuriastrakhan@gmail.com wrote:

...

would it be hard to set up a small very restricted additional wiki specifically for unit testing?

I don't see the point. It would be better to have automated tests that can be run. You could operate something quite similar to the parser test suite we have now, where a bunch of clone tables are created temporarily, some dummy data is inserted into these, and then perform a bunch of requests "or something" and see what you get out.

"Small, very restricted" wikis make it harder for the development community to add, delete and otherwise alter tests, as is typically necessary.

Rob Church

Andrew Cates

22 May 22 May

3:31 p.m.

New subject: Wikipedia for Schools

Okay friends,

In 48 hours we are going public on the 2007 Wikipedia Schools Selection. See http://en.wikipedia.org/wiki/Wikipedia:Wikipedia_CD_Selection for details and http://schools-wikipedia.org to browse. Brad has kindly given consent for the use of the Wikipedia logo, as this non-commercial (free, no adverts) venture is basically wmf aligned. It contains all Good & Featured content (except adult content) and about 1200 other articles tagged {{WPCD}} by en editors (except rubbish: thanks guys). We think the end point looks good and given Mr Sanger's recent comments in the UK press about the unsuitability of Wikipedia for UK schools, timing couldn't be better.

Full version is available as a 3.5 Gig download which we are getting set up with the Torrent people, and as a 3.5 Gig DVD free from our offices. Thumbnails only version of about 1 Gig is as a straight old fashioned download.

Few tech comments:

1) This selection is generated by a Perl script diffing a list of historical versions of articles of the form Arthropod=131728434 etc. fetching any changed articles and associated images and image pages and then running the HTML through a cleanup script. The other route from the database looked tougher. However the hardest part of this route is the Perl clean up: red links is trivial but identifying sentences whose sole purpose is to link to unincluded content, and inline editorial comments is not. We think we are at over 95%, near 98% on this (small sampling only).

2) The manual check for graffiti (all 4625 articles were checked by hand) found about 1% of articles at any instance had graffiti: about 5 times more than a year ago. We all know this is getting worse; and remember these are pretty core WP articles. Redirect vandalism, image vandalism and template vandalism were also found. The various bad word scripts didn't really help finding vandalism.

3) We also took out other content unsuitable for children. For example we judged http://schools-wikipedia.org/wp/m/Mauthausen-Gusen_concentration_camp.htm as historically important but the section on "Inmates" too graphic for an eight year old girl to read. We also took out references/external links sections since it looks like the community doesn't want to vouch for these. Largely this is done using a simple "exclude section xyz" on the entry data.

4) Categories are on a neat Ruby database which is completely easy to use. We have reworked them around a schools curriculum and filled in gaps.

5) We now have an option to run this as a continuous project rather than a version one. In principle we could pick up updates (say from a page on en which was protected) and new approved articles every few days and regenerate the downloads and also regenerate the browsable copy. Whether people want this is a good question: eventually it might collide with the Stable versions project I guess.

6) There may be 1.7 m articles on Wikipedia but the quality falls off quickly at 10-20,000. However, there are lots of signs of improvement at this level.

Andrew aka BozMo@en

Erik Moeller

3:45 p.m.

New subject: Wikipedia for Schools

It looks very professional, so great work. I don't much like the idea of blanket-filtering out "adult" subjects, especially if this is done under an official Wikipedia brand. What is or isn't "adult" or age-appropriate is very much dependent on cultural & subcultural preferences.

Would it be possible to tag the exclusions with reasons, so that the script can be run in different configurations (e.g. exclude violence, sexually explicit content, etc.)?

On 5/22/07, Andrew Cates andrew@catesfamily.org.uk wrote:

...

Okay friends,

In 48 hours we are going public on the 2007 Wikipedia Schools Selection. See http://en.wikipedia.org/wiki/Wikipedia:Wikipedia_CD_Selection for details and http://schools-wikipedia.org to browse. Brad has kindly given consent for the use of the Wikipedia logo, as this non-commercial (free, no adverts) venture is basically wmf aligned. It contains all Good & Featured content (except adult content) and about 1200 other articles tagged {{WPCD}} by en editors (except rubbish: thanks guys). We think the end point looks good and given Mr Sanger's recent comments in the UK press about the unsuitability of Wikipedia for UK schools, timing couldn't be better.

Full version is available as a 3.5 Gig download which we are getting set up with the Torrent people, and as a 3.5 Gig DVD free from our offices. Thumbnails only version of about 1 Gig is as a straight old fashioned download.

Few tech comments:

This selection is generated by a Perl script diffing a list of

historical versions of articles of the form Arthropod=131728434 etc. fetching any changed articles and associated images and image pages and then running the HTML through a cleanup script. The other route from the database looked tougher. However the hardest part of this route is the Perl clean up: red links is trivial but identifying sentences whose sole purpose is to link to unincluded content, and inline editorial comments is not. We think we are at over 95%, near 98% on this (small sampling only).

The manual check for graffiti (all 4625 articles were checked by

hand) found about 1% of articles at any instance had graffiti: about 5 times more than a year ago. We all know this is getting worse; and remember these are pretty core WP articles. Redirect vandalism, image vandalism and template vandalism were also found. The various bad word scripts didn't really help finding vandalism.

We also took out other content unsuitable for children. For example

we judged http://schools-wikipedia.org/wp/m/Mauthausen-Gusen_concentration_camp.htm as historically important but the section on "Inmates" too graphic for an eight year old girl to read. We also took out references/external links sections since it looks like the community doesn't want to vouch for these. Largely this is done using a simple "exclude section xyz" on the entry data.

Categories are on a neat Ruby database which is completely easy to

use. We have reworked them around a schools curriculum and filled in gaps.

We now have an option to run this as a continuous project rather than

a version one. In principle we could pick up updates (say from a page on en which was protected) and new approved articles every few days and regenerate the downloads and also regenerate the browsable copy. Whether people want this is a good question: eventually it might collide with the Stable versions project I guess.

There may be 1.7 m articles on Wikipedia but the quality falls off

quickly at 10-20,000. However, there are lots of signs of improvement at this level.

Andrew aka BozMo@en

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- Peace & Love, Erik DISCLAIMER: This message does not represent an official position of the Wikimedia Foundation or its Board of Trustees. "An old, rigid civilization is reluctantly dying. Something new, open, free and exciting is waking up." -- Ming the Mechanic

Andrew Cates

4:09 p.m.

New subject: Wikipedia for Schools

Thanks.

It is possible to run the script without any exclusions, and of course it doesn't have to be used for children. The most explicit text we allowed was in http://schools-wikipedia.org/wp/b/Birth_control.htm which is about where I would draw the line with an 8 year old.

However there aren't a vast number of content related section excludes (its mainly done by choice of article: e.g. an article about an 18+ only computer game didn't do in), perhaps 20. But there are about a thousand section excludes, 99% of which are stub or empty sections in otherwise good articles.

Tagging with an exclusion reason would mean going through them all by hand so time consuming but possible. I could easily give you the list?

Andrew ============ Erik Moeller wrote:

...

It looks very professional, so great work. I don't much like the idea of blanket-filtering out "adult" subjects, especially if this is done under an official Wikipedia brand. What is or isn't "adult" or age-appropriate is very much dependent on cultural & subcultural preferences.

Would it be possible to tag the exclusions with reasons, so that the script can be run in different configurations (e.g. exclude violence, sexually explicit content, etc.)?

On 5/22/07, Andrew Cates andrew@catesfamily.org.uk wrote:

...
Okay friends,

In 48 hours we are going public on the 2007 Wikipedia Schools Selection. See http://en.wikipedia.org/wiki/Wikipedia:Wikipedia_CD_Selection for details and http://schools-wikipedia.org to browse. Brad has kindly given consent for the use of the Wikipedia logo, as this non-commercial (free, no adverts) venture is basically wmf aligned. It contains all Good & Featured content (except adult content) and about 1200 other articles tagged {{WPCD}} by en editors (except rubbish: thanks guys). We think the end point looks good and given Mr Sanger's recent comments in the UK press about the unsuitability of Wikipedia for UK schools, timing couldn't be better.

Full version is available as a 3.5 Gig download which we are getting set up with the Torrent people, and as a 3.5 Gig DVD free from our offices. Thumbnails only version of about 1 Gig is as a straight old fashioned download.

Few tech comments:

This selection is generated by a Perl script diffing a list of

historical versions of articles of the form Arthropod=131728434 etc. fetching any changed articles and associated images and image pages and then running the HTML through a cleanup script. The other route from the database looked tougher. However the hardest part of this route is the Perl clean up: red links is trivial but identifying sentences whose sole purpose is to link to unincluded content, and inline editorial comments is not. We think we are at over 95%, near 98% on this (small sampling only).

The manual check for graffiti (all 4625 articles were checked by

hand) found about 1% of articles at any instance had graffiti: about 5 times more than a year ago. We all know this is getting worse; and remember these are pretty core WP articles. Redirect vandalism, image vandalism and template vandalism were also found. The various bad word scripts didn't really help finding vandalism.

We also took out other content unsuitable for children. For example

we judged http://schools-wikipedia.org/wp/m/Mauthausen-Gusen_concentration_camp.htm as historically important but the section on "Inmates" too graphic for an eight year old girl to read. We also took out references/external links sections since it looks like the community doesn't want to vouch for these. Largely this is done using a simple "exclude section xyz" on the entry data.

Categories are on a neat Ruby database which is completely easy to

use. We have reworked them around a schools curriculum and filled in gaps.

We now have an option to run this as a continuous project rather than

a version one. In principle we could pick up updates (say from a page on en which was protected) and new approved articles every few days and regenerate the downloads and also regenerate the browsable copy. Whether people want this is a good question: eventually it might collide with the Stable versions project I guess.

There may be 1.7 m articles on Wikipedia but the quality falls off

quickly at 10-20,000. However, there are lots of signs of improvement at this level.

Andrew aka BozMo@en

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

Erik Moeller

4:17 p.m.

New subject: Wikipedia for Schools

On 5/22/07, Andrew Cates andrew@catesfamily.org.uk wrote:

...

Tagging with an exclusion reason would mean going through them all by hand so time consuming but possible. I could easily give you the list?

If you implement the tagging, I will help with the adding of the tags. Is this stuff version-controlled somewhere (wiki or a VCS)? That will make it easier to collaborate on it.

Andrew Cates

4:55 p.m.

New subject: Wikipedia for Schools

It used to be all at http://en.wikipedia.org/wiki/User:BozMo/wpcd2 then diffs at http://en.wikipedia.org/w/index.php?title=Wikipedia:Wikipedia_CD_Selection&a... but I got bored with the fact that everyone emailed me updates rather than used the wiki so now its on a tortoise svn on one of our servers.

Personally I am more than happy putting it back on the project pages; we can pick lists up from anywhere into the database. Implementing the tagging I will put on a todo list for now as there are various other bits I am keen to fix first. In practice there is always the 0.5 version which is uncensored if anyone really wants something more liberal for their kids...

Andrew ================== Erik Moeller wrote:

...

On 5/22/07, Andrew Cates andrew@catesfamily.org.uk wrote:

...
Tagging with an exclusion reason would mean going through them all by hand so time consuming but possible. I could easily give you the list?

If you implement the tagging, I will help with the adding of the tags. Is this stuff version-controlled somewhere (wiki or a VCS)? That will make it easier to collaborate on it.

Erik Moeller

5:13 p.m.

New subject: Wikipedia for Schools

On 5/22/07, Andrew Cates andrew@catesfamily.org.uk wrote:

...

Personally I am more than happy putting it back on the project pages; we can pick lists up from anywhere into the database. Implementing the tagging I will put on a todo list for now as there are various other bits I am keen to fix first. In practice there is always the 0.5 version which is uncensored if anyone really wants something more liberal for their kids...

OK, as long as it's something that eventually gets done .. please let me know when & I'll try to help.

Andrew Cates

25 May 25 May

2:09 p.m.

New subject: Wikipedia for Schools

Erik/ Martin,

We are proposing the following Q&A to accompany the press release (which has been moved to Tues after the UK public holiday weekend):

Question: What exactly is the relationship between this and the Release Version 0.5, 0.7 etc.?

We work very closely with the Release Version team. We launched our first children's Wikipedia CD a year ahead of 0.5, but their process for identifying and reviewing articles was much more thorough. We think our clean-up script and database is better, but their word index is certainly better. We are quite keen to maintain a "children-orientated" flavour to these selections and there is discussion going on with merging our project into Wikiproject Release 1.0, when it comes out in a year or two. Wikimedia board member *Erik Möller * has suggested that we tag child-related editorial decisions in our script so that we can produce Release 1.0 and a children's release at the same time. This makes a lot of sense and we are looking at it technically.

I have put an updates/additions at http://en.wikipedia.org/wiki/Wikipedia:Wikipedia_CD_Selection/additions_and_... which I will use as the master list for new articles and updates.

The current list of section excludes is at http://en.wikipedia.org/wiki/Wikipedia:Wikipedia_CD_Selection/section_exclud... including at the top the exclude from all articles if you are happy with the syntax

Andrew

PS You are welcome to the Press Release text and full list of Q&As if it won't bore you... ================= Erik Moeller wrote:

...

On 5/22/07, Andrew Cates andrew@catesfamily.org.uk wrote:

...
Personally I am more than happy putting it back on the project pages; we can pick lists up from anywhere into the database. Implementing the tagging I will put on a todo list for now as there are various other bits I am keen to fix first. In practice there is always the 0.5 version which is uncensored if anyone really wants something more liberal for their kids...

OK, as long as it's something that eventually gets done .. please let me know when & I'll try to help.

John Vandenberg

22 May 22 May

7:04 p.m.

New subject: Wikipedia for Schools

On 5/22/07, Andrew Cates andrew@catesfamily.org.uk wrote:

...

Okay friends, ... 3) We also took out other content unsuitable for children. For example we judged http://schools-wikipedia.org/wp/m/Mauthausen-Gusen_concentration_camp.htm as historically important but the section on "Inmates" too graphic for an eight year old girl to read. We also took out references/external links sections since it looks like the community doesn't want to vouch for these.

I've had a look at the website, and it looks great ... but ...

The #CITEREF links in the texts do not work for the articles I have viewed, and you mention above that the citations have been stripped. I think removing the external links section is a fine idea, but removing the citations appears odd. I expect that adults would appreciate citations to be in the compilation in order to know where they need to turn in order to answer the unique questions that children ask. Displaying citations also reinforces the notion that the facts and opinions recorded in Wikipedia are all there because the research was done in other works. Is it possible to keep the citation data, but strip out any URLs in them? Is there is anything that we can do to clean up the citation data in order to be acceptable?

The text in disclaimer.htm is in need of a layout tidy-up; specifically it is missing a carriage return between "WIKIPEDIA MAKES NO GUARANTEE OF VALIDITY" and "Wikipedia is an...".

-- John

Dave Sigafoos

7:10 p.m.

New subject: Wikipedia for Schools

First let me say Bravo. I can't even imagine the amount of work that this took.

I do have a wonder about section 3

As a parent I understand the need for some exclusion of information but I worry about what appears to be a blanket removal (hiding?) of information. I agree with a methodology which will suppress information based on age. But it appears from your note that the information is suppressed period. Does this same " .. too graphic for an eight year old girl to read .. " information get suppressed for a 13, 15 or 18 year old?

Once again Bravo for this endeavor, I just hope that the methodology will improve as time goes on. As much as I want to protect my children, also would hate to have valid information kept from them.

Thank you so much for your hard work.

You wrote .. 3) We also took out other content unsuitable for children. For example we judged http://schools-wikipedia.org/wp/m/Mauthausen-Gusen_concentration_camp.ht m as historically important but the section on "Inmates" too graphic for an eight year old girl to read. We also took out references/external links sections since it looks like the community doesn't want to vouch for these. Largely this is done using a simple "exclude section xyz" on the entry data.

DSig David Tod Sigafoos | SANMAR Corporation PICK Guy 206-770-5585 davesigafoos@sanmar.com

Mohamed Magdy

7:20 p.m.

New subject: Wikipedia for Schools

Andrew Cates wrote:

...

Okay friends,

In 48 hours we are going public on the 2007 Wikipedia Schools Selection. See http://en.wikipedia.org/wiki/Wikipedia:Wikipedia_CD_Selection for details and http://schools-wikipedia.org to browse. Brad has kindly given consent for the use of the Wikipedia logo, as this non-commercial (free, no adverts) venture is basically wmf aligned. It contains all Good & Featured content (except adult content) and about 1200 other articles tagged {{WPCD}} by en editors (except rubbish: thanks guys). We think the end point looks good and given Mr Sanger's recent comments in the UK press about the unsuitability of Wikipedia for UK schools, timing couldn't be better.

Full version is available as a 3.5 Gig download which we are getting set up with the Torrent people, and as a 3.5 Gig DVD free from our offices. Thumbnails only version of about 1 Gig is as a straight old fashioned download.

Few tech comments:

This selection is generated by a Perl script diffing a list of

historical versions of articles of the form Arthropod=131728434 etc. fetching any changed articles and associated images and image pages and then running the HTML through a cleanup script. The other route from the database looked tougher. However the hardest part of this route is the Perl clean up: red links is trivial but identifying sentences whose sole purpose is to link to unincluded content, and inline editorial comments is not. We think we are at over 95%, near 98% on this (small sampling only).

The manual check for graffiti (all 4625 articles were checked by

hand) found about 1% of articles at any instance had graffiti: about 5 times more than a year ago. We all know this is getting worse; and remember these are pretty core WP articles. Redirect vandalism, image vandalism and template vandalism were also found. The various bad word scripts didn't really help finding vandalism.

We also took out other content unsuitable for children. For example

we judged http://schools-wikipedia.org/wp/m/Mauthausen-Gusen_concentration_camp.htm as historically important but the section on "Inmates" too graphic for an eight year old girl to read. We also took out references/external links sections since it looks like the community doesn't want to vouch for these. Largely this is done using a simple "exclude section xyz" on the entry data.

Categories are on a neat Ruby database which is completely easy to

use. We have reworked them around a schools curriculum and filled in gaps.

We now have an option to run this as a continuous project rather than

a version one. In principle we could pick up updates (say from a page on en which was protected) and new approved articles every few days and regenerate the downloads and also regenerate the browsable copy. Whether people want this is a good question: eventually it might collide with the Stable versions project I guess.

There may be 1.7 m articles on Wikipedia but the quality falls off

quickly at 10-20,000. However, there are lots of signs of improvement at this level.

Andrew aka BozMo@en

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

Great work guys!

Surely checking each article by hand weren't easy ;)

But what about the search? other than the alphabetical index

Articles about movies will always have that white gap unless you modify the template. like in http://schools-wikipedia.org/wp/b/Blade_Runner.htm

how related subjects at the top is placed?

Nick Jenkins

23 May 23 May

5:44 a.m.

New subject: Wikipedia for Schools

...

Okay friends,

In 48 hours we are going public on the 2007 Wikipedia Schools Selection. See http://en.wikipedia.org/wiki/Wikipedia:Wikipedia_CD_Selection for details and http://schools-wikipedia.org to browse.

Do you want to have an external link in the footer back to the current article in the Wikipedia? (example possible footer: "(see www.wikipedia.org for <a href="http://en.wikipedia.org/wiki/Mauthausen-Gusen_concentration_camp">the original</a> and <a href="http://en.wikipedia.org/w/index.php?title=Mauthausen-Gusen_concentration_camp&action=history">details of authors</a> and sources")

[ Note: It does have the original source URL in the HTML source, but wrapped in a <div class="printfooter"> tag, and in the CSS it has ".printfooter { display: none; }", so it's completely invisible in the output. ]

The downside is that maybe there is material that mightn't be appropriate for young kids to see on the Wikipedia; but the upside is that a DVD is a static thing, and the wiki is a living thing - so the later versions will be available to them, and if they find any mistakes or omissions in the DVD material they can add or fix these (let's not underestimate the kids). Also a small link in the page footer isn't very noticeable, which is probably ideal.

...

We also took out other content unsuitable for children. For example

we judged http://schools-wikipedia.org/wp/m/Mauthausen-Gusen_concentration_camp.htm as historically important but the section on "Inmates" too graphic for an eight year old girl to read. We also took out references/external links sections since it looks like the community doesn't want to vouch for these. Largely this is done using a simple "exclude section xyz" on the entry data.

If you cull sections, the links to sub-sections that were part of the culled section should probably also be de-linked. For example, the http://schools-wikipedia.org/wp/m/Mauthausen-Gusen_concentration_camp.htm page has an anchor link in the 3rd paragraph on the word "death toll". However, that link doesn't work. The reason it doesn't work is because the "inmates" section ( http://en.wikipedia.org/wiki/Mauthausen-Gusen_concentration_camp#Inmates ) which the "death toll" subsection comes under has been culled. Therefore "death toll" should probably be de-linked / unlinked.

-- All the best, Nick.

6423

Age (days ago)

6432

Last active (days ago)

wikitech-l@lists.wikimedia.org

19 comments

10 participants

tags (0)

participants (10)

Andrew Cates
Brion Vibber
Dave Sigafoos
Erik Moeller
Gurch
John Vandenberg
Mohamed Magdy
Nick Jenkins
Rob Church
Yuri Astrakhan