[Foundation-l] Wikisource and reCAPTCHA

John Vandenberg jayvdb at gmail.com
Wed Jun 30 11:24:35 UTC 2010

On Wed, Jun 30, 2010 at 8:42 PM, Samuel J Klein <sj at wikimedia.org> wrote:
> On Wed, Jun 30, 2010 at 6:13 AM, John Vandenberg <jayvdb at gmail.com> wrote:
>> irrespective of whether it is verified, OCR
>> quality, or if it is vandalism.  However, wikisource keeps the images
>> and the text unified from day 0 to eternity.
> Some works become verified, and reach high OCR quality.
> < PGDP has a very strict and arduous workflow...  The
>> result is quality, however only the text is sent downstream.
> Why not send images and text downstream?

Good question! ;-)
Storage is one issue.
It would be interesting to estimate the storage requirements of
Wikisource if we had produced the PGDP etexts.

>> Wikisource and PGDP don't interoperate.  We *could*, but when I looked
>> at importing a PGDP project into Wikisource, I put it in the too hard basket.
> That's what I mean by 'coordinate'.  "hard" here seems like a one-time
> hardship followed by a permanent useful coordination.

They don't have an 'export' function, and I doubt they are going to
build one so that they can interoperate with us.

My 'import' function was a scraper; not something that can be used in
a large scale without their permission.

In the end, it is simpler to avoid starting WS projects that would
duplicate unfinished PGDP projects.  There are plenty of works that
have not been transcribed yet ;-)

>> Wikisource is trying to become a credible competitor to PGDP.
> Perhaps we have competing interfaces / workflows.

This is like saying that Wikipedia and Brittanica have competing
interfaces / workflows.

The wikisource workflow is a *symptom* of it being a "wiki", with all
that entails.  There is a lot more than merely the workflow which
distinguishes the two projects.

> .. but I expect we
> would be glad to share 99.99%-verified high-quality
> texts-unified-with-images if it were easy for both projects to
> identify that combination of quality and comprehensive data.

Good luck with that.

PGDP publishes etexts via PG.

If PGDP gives images+text to Wikisource for projects that are stuck in
their rounds, it becomes published online immediately at whatever
stage it is at - its a wiki.  That is at odds with the objective of
PGDP, unless they are completely abandoning the project.

It is more likely that PGDP will release images+text at the same time
they publish the etext to PG.
The best way for PGDP to do this is to produce a djvu with images and
verified text, and then upload it to archive.org so everyone benefits.

> and
> would be glad to share metadata so that a WS editor could quickly
> check to see if there's a PGDP effort covering an edition of the text
> she is proofing; and vice-versa.

IIRC, obtaining the list of ongoing PGDP projects requires a PGDP
account, but anyone can create an account.

The WS project list is in google. ;-)

John Vandenberg

More information about the wikimedia-l mailing list