[Foundation-l] Wikisource and reCAPTCHA

Wed Jun 30 09:49:13 UTC 2010

Andre, this is a great summary -- I've linked to it from the english
ws Scriptorium.

Do you see opportunities for the two projects to coordinate their
wofklows better?

SJ

On Thu, Jun 24, 2010 at 11:13 PM, Andre Engels <andreengels at gmail.com> wrote:
> On Thu, Jun 24, 2010 at 4:37 PM, Samuel Klein <meta.sj at gmail.com> wrote:
>> I love those proofreading features, and the new default layout for a
>> book's pages and TOC.  Wikisource is becoming AWESOME.
>>
>> Do we have PGDP contributors who can weigh on on how similar the
>> processes are?  Is there a way for us to actually merge workflows with
>> them?
>
> I am quite active on PGDP, but not on Wikisource, so I can tell about
> how things work there, but not on how similar it is to Wikisource.
>
> Typical about the PGDP workflow are an emphasis on quality above
> quantity (exemplified in running not 1 or 2 but 3 rounds of human
> checking of the OCR result - correctness in copying is well above
> 99.99% for most books) and work being done in page-size chunks rather
> than whole books, chapters, paragraphs, sentences, words or whatever
> else one could think of.
>
> There's a number of people involved, although people can and often do
> fill several roles for one book.
>
> First, there is the Content Provider (CP).
>
> He or she first contacts Project Gutenberg to get a clearance. This is
> basically a statement from PG that they believe the work is out of
> copyright. In general, US copyright is what is taken into account for
> this, although there are also servers in other countries (Canada and
> Australia as far as I know), which publish some material that is out
> of copyright in those countries even if it is not in the US. Such
> works do not go through PGDP, but may go through its sister projects
> DPCanada or DPEurope.
>
> Next, the CP will scan the book, or harvest the scans from the web,
> and run OCR on them. They will usually also write a description of the
> book for the proofreaders, so those can see whether they are
> interested. The scans and the OCR are uploaded to the PGDP servers,
> and the project is handed over to the Project Manager (PM) (although
> in most cases CP and PM are the same person).
>
> The Project Manager is responsible for the project in the next stages.
> This means:
> * specifying the rules and guidelines that are to be followed when
> proofreading the book, at least there where those differ from the
> standard guidelines
> * answer questions by proofreaders
> * keep the good and bad words lists up to date. These are used in
> wordcheck (a kind of spellchecker) so that words are considered
> correct or incorrect by it
>
> The project then goes through a number of rounds. The standard number
> is 5 rounds, of which 3 are proofreading and 2 are formatting, but it
> is possible for the PM to make a request to skip one or more rounds or
> go through a round twice.
>
> In the first three, proofreading, rounds, a proofreader requests one
> page at a time, compares the OCR output (or the previous proofreader's
> output) with the scan, and changes the text to correspond to the scan.
> In the first round (P1) everyone can do this, the second round (P2) is
> only accessible to those who have been at the site some time and done
> a certain amount of pages (21 days and 300 pages, if I recall
> correctly), for the third round (P3) one has to qualify. For
> qualification one's P2 pages are checked (using the subsequent edits
> of P3). The norm is that one should not leave more than one error per
> five pages.
>
> After the three (or two or four) rounds of proofing, the foofing
> (formatting) rounds are gone through. In these, again a proofreader
> (now called formatter) requests and edits one page at the time, but
> where the proofreaders dealt with copying the text as precisely as
> possible, the formatter will deal with all other aspects of the work.
> They denote when text is italic, bold or otherwise in a special
> format, which texts are chapter headers, how tables are laid out,
> etcetera. Here there are two rounds, although the second one can be
> skipped or a round duplicated, like before. The first formatting round
> (F1) has the same entrance restrictions as P2, F2 has a qualification
> system comparable to P3.
>
> After this, the PM gives the book on to the Post-Processor (PP).
> Again, this is often the same person, but not always. In some other
> cases, the PP has already been appointed, in others it will sit in a
> pool until picked up by a willing PP. The PP does all that is needed
> to get from the F2 output to something that can be put on Project
> Gutenberg: they recombine the pages into one work, move stuff around
> where needed, change the formatters' mark-up in something that's more
> appropriate for reading, in most cases generate an HTML version,
> etcetera.
>
> A PP that has already post-processed several books in a good way can
> then send it to PG. In other cases, the book will then go to the PPV
> (Post-Processing Verifier), an experienced PP, who checks the PP's
> work, and gives them hints on what should be improved or makes those
> improvements themselves.
>
> Finally, if the PP or PPV sends the book to PG, there is a whitewasher
> who checks the book once again; however, that is outside the scope of
> this (already too long) description, because it belongs to PG's
> process rather than PGDP's.
>
> To stop the rounds from overcrowding with books, there are queues for
> each round, containing books that are ready to enter the round, but
> have not yet done so. To keep some variety, there are different queues
> by language and/or subject type. A problem with this has been that the
> later rounds, having less manpower because of the higher standards
> required, could not keep up with P1 and F1. There has been work to do
> something about it, and the P2 queues have been brought down to decent
> size, but in P3 and F2 books can literally sit in the queues for
> years, and PP still is a bottleneck as well.
>
>
> --
> André Engels, andreengels at gmail.com
>
> _______________________________________________
> foundation-l mailing list
> foundation-l at lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>

-- 
Samuel Klein          identi.ca:sj           w:user:sj