On Wed, Jun 30, 2010 at 8:42 PM, Samuel J Klein sj@wikimedia.org wrote:
On Wed, Jun 30, 2010 at 6:13 AM, John Vandenberg jayvdb@gmail.com wrote:
irrespective of whether it is verified, OCR quality, or if it is vandalism. However, wikisource keeps the images and the text unified from day 0 to eternity.
Some works become verified, and reach high OCR quality.
< PGDP has a very strict and arduous workflow... The
result is quality, however only the text is sent downstream.
Why not send images and text downstream?
Good question! ;-) Storage is one issue. It would be interesting to estimate the storage requirements of Wikisource if we had produced the PGDP etexts.
Wikisource and PGDP don't interoperate. We *could*, but when I looked at importing a PGDP project into Wikisource, I put it in the too hard basket.
That's what I mean by 'coordinate'. "hard" here seems like a one-time hardship followed by a permanent useful coordination.
They don't have an 'export' function, and I doubt they are going to build one so that they can interoperate with us.
My 'import' function was a scraper; not something that can be used in a large scale without their permission.
In the end, it is simpler to avoid starting WS projects that would duplicate unfinished PGDP projects. There are plenty of works that have not been transcribed yet ;-)
Wikisource is trying to become a credible competitor to PGDP.
Perhaps we have competing interfaces / workflows.
This is like saying that Wikipedia and Brittanica have competing interfaces / workflows.
The wikisource workflow is a *symptom* of it being a "wiki", with all that entails. There is a lot more than merely the workflow which distinguishes the two projects.
.. but I expect we would be glad to share 99.99%-verified high-quality texts-unified-with-images if it were easy for both projects to identify that combination of quality and comprehensive data.
Good luck with that.
PGDP publishes etexts via PG.
If PGDP gives images+text to Wikisource for projects that are stuck in their rounds, it becomes published online immediately at whatever stage it is at - its a wiki. That is at odds with the objective of PGDP, unless they are completely abandoning the project.
It is more likely that PGDP will release images+text at the same time they publish the etext to PG. The best way for PGDP to do this is to produce a djvu with images and verified text, and then upload it to archive.org so everyone benefits.
and would be glad to share metadata so that a WS editor could quickly check to see if there's a PGDP effort covering an edition of the text she is proofing; and vice-versa.
IIRC, obtaining the list of ongoing PGDP projects requires a PGDP account, but anyone can create an account.
The WS project list is in google. ;-)
-- John Vandenberg