[Foundation-l] Wikisource and reCAPTCHA

Andre Engels andreengels at gmail.com
Fri Jun 25 03:13:57 UTC 2010

On Thu, Jun 24, 2010 at 4:37 PM, Samuel Klein <meta.sj at gmail.com> wrote:
> I love those proofreading features, and the new default layout for a
> book's pages and TOC.  Wikisource is becoming AWESOME.
> Do we have PGDP contributors who can weigh on on how similar the
> processes are?  Is there a way for us to actually merge workflows with
> them?

I am quite active on PGDP, but not on Wikisource, so I can tell about
how things work there, but not on how similar it is to Wikisource.

Typical about the PGDP workflow are an emphasis on quality above
quantity (exemplified in running not 1 or 2 but 3 rounds of human
checking of the OCR result - correctness in copying is well above
99.99% for most books) and work being done in page-size chunks rather
than whole books, chapters, paragraphs, sentences, words or whatever
else one could think of.

There's a number of people involved, although people can and often do
fill several roles for one book.

First, there is the Content Provider (CP).

He or she first contacts Project Gutenberg to get a clearance. This is
basically a statement from PG that they believe the work is out of
copyright. In general, US copyright is what is taken into account for
this, although there are also servers in other countries (Canada and
Australia as far as I know), which publish some material that is out
of copyright in those countries even if it is not in the US. Such
works do not go through PGDP, but may go through its sister projects
DPCanada or DPEurope.

Next, the CP will scan the book, or harvest the scans from the web,
and run OCR on them. They will usually also write a description of the
book for the proofreaders, so those can see whether they are
interested. The scans and the OCR are uploaded to the PGDP servers,
and the project is handed over to the Project Manager (PM) (although
in most cases CP and PM are the same person).

The Project Manager is responsible for the project in the next stages.
This means:
* specifying the rules and guidelines that are to be followed when
proofreading the book, at least there where those differ from the
standard guidelines
* answer questions by proofreaders
* keep the good and bad words lists up to date. These are used in
wordcheck (a kind of spellchecker) so that words are considered
correct or incorrect by it

The project then goes through a number of rounds. The standard number
is 5 rounds, of which 3 are proofreading and 2 are formatting, but it
is possible for the PM to make a request to skip one or more rounds or
go through a round twice.

In the first three, proofreading, rounds, a proofreader requests one
page at a time, compares the OCR output (or the previous proofreader's
output) with the scan, and changes the text to correspond to the scan.
In the first round (P1) everyone can do this, the second round (P2) is
only accessible to those who have been at the site some time and done
a certain amount of pages (21 days and 300 pages, if I recall
correctly), for the third round (P3) one has to qualify. For
qualification one's P2 pages are checked (using the subsequent edits
of P3). The norm is that one should not leave more than one error per
five pages.

After the three (or two or four) rounds of proofing, the foofing
(formatting) rounds are gone through. In these, again a proofreader
(now called formatter) requests and edits one page at the time, but
where the proofreaders dealt with copying the text as precisely as
possible, the formatter will deal with all other aspects of the work.
They denote when text is italic, bold or otherwise in a special
format, which texts are chapter headers, how tables are laid out,
etcetera. Here there are two rounds, although the second one can be
skipped or a round duplicated, like before. The first formatting round
(F1) has the same entrance restrictions as P2, F2 has a qualification
system comparable to P3.

After this, the PM gives the book on to the Post-Processor (PP).
Again, this is often the same person, but not always. In some other
cases, the PP has already been appointed, in others it will sit in a
pool until picked up by a willing PP. The PP does all that is needed
to get from the F2 output to something that can be put on Project
Gutenberg: they recombine the pages into one work, move stuff around
where needed, change the formatters' mark-up in something that's more
appropriate for reading, in most cases generate an HTML version,

A PP that has already post-processed several books in a good way can
then send it to PG. In other cases, the book will then go to the PPV
(Post-Processing Verifier), an experienced PP, who checks the PP's
work, and gives them hints on what should be improved or makes those
improvements themselves.

Finally, if the PP or PPV sends the book to PG, there is a whitewasher
who checks the book once again; however, that is outside the scope of
this (already too long) description, because it belongs to PG's
process rather than PGDP's.

To stop the rounds from overcrowding with books, there are queues for
each round, containing books that are ready to enter the round, but
have not yet done so. To keep some variety, there are different queues
by language and/or subject type. A problem with this has been that the
later rounds, having less manpower because of the higher standards
required, could not keep up with P1 and F1. There has been work to do
something about it, and the P2 queues have been brought down to decent
size, but in P3 and F2 books can literally sit in the queues for
years, and PP still is a bottleneck as well.

André Engels, andreengels at gmail.com

More information about the wikimedia-l mailing list