Andre, this is a great summary -- I've linked to it from the english ws Scriptorium.
Do you see opportunities for the two projects to coordinate their wofklows better?
SJ
On Thu, Jun 24, 2010 at 11:13 PM, Andre Engels andreengels@gmail.com wrote:
On Thu, Jun 24, 2010 at 4:37 PM, Samuel Klein meta.sj@gmail.com wrote:
I love those proofreading features, and the new default layout for a book's pages and TOC. Wikisource is becoming AWESOME.
Do we have PGDP contributors who can weigh on on how similar the processes are? Is there a way for us to actually merge workflows with them?
I am quite active on PGDP, but not on Wikisource, so I can tell about how things work there, but not on how similar it is to Wikisource.
Typical about the PGDP workflow are an emphasis on quality above quantity (exemplified in running not 1 or 2 but 3 rounds of human checking of the OCR result - correctness in copying is well above 99.99% for most books) and work being done in page-size chunks rather than whole books, chapters, paragraphs, sentences, words or whatever else one could think of.
There's a number of people involved, although people can and often do fill several roles for one book.
First, there is the Content Provider (CP).
He or she first contacts Project Gutenberg to get a clearance. This is basically a statement from PG that they believe the work is out of copyright. In general, US copyright is what is taken into account for this, although there are also servers in other countries (Canada and Australia as far as I know), which publish some material that is out of copyright in those countries even if it is not in the US. Such works do not go through PGDP, but may go through its sister projects DPCanada or DPEurope.
Next, the CP will scan the book, or harvest the scans from the web, and run OCR on them. They will usually also write a description of the book for the proofreaders, so those can see whether they are interested. The scans and the OCR are uploaded to the PGDP servers, and the project is handed over to the Project Manager (PM) (although in most cases CP and PM are the same person).
The Project Manager is responsible for the project in the next stages. This means:
- specifying the rules and guidelines that are to be followed when
proofreading the book, at least there where those differ from the standard guidelines
- answer questions by proofreaders
- keep the good and bad words lists up to date. These are used in
wordcheck (a kind of spellchecker) so that words are considered correct or incorrect by it
The project then goes through a number of rounds. The standard number is 5 rounds, of which 3 are proofreading and 2 are formatting, but it is possible for the PM to make a request to skip one or more rounds or go through a round twice.
In the first three, proofreading, rounds, a proofreader requests one page at a time, compares the OCR output (or the previous proofreader's output) with the scan, and changes the text to correspond to the scan. In the first round (P1) everyone can do this, the second round (P2) is only accessible to those who have been at the site some time and done a certain amount of pages (21 days and 300 pages, if I recall correctly), for the third round (P3) one has to qualify. For qualification one's P2 pages are checked (using the subsequent edits of P3). The norm is that one should not leave more than one error per five pages.
After the three (or two or four) rounds of proofing, the foofing (formatting) rounds are gone through. In these, again a proofreader (now called formatter) requests and edits one page at the time, but where the proofreaders dealt with copying the text as precisely as possible, the formatter will deal with all other aspects of the work. They denote when text is italic, bold or otherwise in a special format, which texts are chapter headers, how tables are laid out, etcetera. Here there are two rounds, although the second one can be skipped or a round duplicated, like before. The first formatting round (F1) has the same entrance restrictions as P2, F2 has a qualification system comparable to P3.
After this, the PM gives the book on to the Post-Processor (PP). Again, this is often the same person, but not always. In some other cases, the PP has already been appointed, in others it will sit in a pool until picked up by a willing PP. The PP does all that is needed to get from the F2 output to something that can be put on Project Gutenberg: they recombine the pages into one work, move stuff around where needed, change the formatters' mark-up in something that's more appropriate for reading, in most cases generate an HTML version, etcetera.
A PP that has already post-processed several books in a good way can then send it to PG. In other cases, the book will then go to the PPV (Post-Processing Verifier), an experienced PP, who checks the PP's work, and gives them hints on what should be improved or makes those improvements themselves.
Finally, if the PP or PPV sends the book to PG, there is a whitewasher who checks the book once again; however, that is outside the scope of this (already too long) description, because it belongs to PG's process rather than PGDP's.
To stop the rounds from overcrowding with books, there are queues for each round, containing books that are ready to enter the round, but have not yet done so. To keep some variety, there are different queues by language and/or subject type. A problem with this has been that the later rounds, having less manpower because of the higher standards required, could not keep up with P1 and F1. There has been work to do something about it, and the P2 queues have been brought down to decent size, but in P3 and F2 books can literally sit in the queues for years, and PP still is a bottleneck as well.
-- André Engels, andreengels@gmail.com
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l