On Wed, Jun 30, 2010 at 8:42 PM, Samuel J Klein <sj(a)wikimedia.org> wrote:
On Wed, Jun 30, 2010 at 6:13 AM, John Vandenberg
irrespective of whether it is verified, OCR
quality, or if it is vandalism. However, wikisource keeps the images
and the text unified from day 0 to eternity.
Some works become verified, and reach high OCR quality.
< PGDP has a very strict and arduous workflow... The
result is quality, however only the text is sent
Why not send images and text downstream?
Good question! ;-)
Storage is one issue.
It would be interesting to estimate the storage requirements of
Wikisource if we had produced the PGDP etexts.
PGDP don't interoperate. We *could*, but when I looked
at importing a PGDP project into Wikisource, I put it in the too hard basket.
That's what I mean by 'coordinate'. "hard" here seems like a
hardship followed by a permanent useful coordination.
They don't have an 'export' function, and I doubt they are going to
build one so that they can interoperate with us.
My 'import' function was a scraper; not something that can be used in
a large scale without their permission.
In the end, it is simpler to avoid starting WS projects that would
duplicate unfinished PGDP projects. There are plenty of works that
have not been transcribed yet ;-)
trying to become a credible competitor to PGDP.
Perhaps we have competing interfaces / workflows.
This is like saying that Wikipedia and Brittanica have competing
interfaces / workflows.
The wikisource workflow is a *symptom* of it being a "wiki", with all
that entails. There is a lot more than merely the workflow which
distinguishes the two projects.
.. but I expect we
would be glad to share 99.99%-verified high-quality
texts-unified-with-images if it were easy for both projects to
identify that combination of quality and comprehensive data.
Good luck with that.
PGDP publishes etexts via PG.
If PGDP gives images+text to Wikisource for projects that are stuck in
their rounds, it becomes published online immediately at whatever
stage it is at - its a wiki. That is at odds with the objective of
PGDP, unless they are completely abandoning the project.
It is more likely that PGDP will release images+text at the same time
they publish the etext to PG.
The best way for PGDP to do this is to produce a djvu with images and
verified text, and then upload it to archive.org
so everyone benefits.
would be glad to share metadata so that a WS editor could quickly
check to see if there's a PGDP effort covering an edition of the text
she is proofing; and vice-versa.
IIRC, obtaining the list of ongoing PGDP projects requires a PGDP
account, but anyone can create an account.
The WS project list is in google. ;-)