Re: [Wikisource-l] PDF/Djvu to Index

13 Apr 2010

Lars Aronsson a écrit :
...
  It is increasingly common to add books to Wikisource
 by finding a PDF or Djvu file, uploading it to Commons,
 and then to create an Index: page on Wikisource
 for proofreading.

 But this would be much easier if:

 1) The fields (author, title, etc.) of the Index
 page were filled in from the data already given
 on Commons. (Yes, those could be wrong or need
 additional care, but this could always be
 edited afterwards, if initial values are fetched
 from Commons.)
    the problem is that djvu pages on common
do not have a parsable format.

...
  2) The <pagelist/> tag was already in the
 "pages" box.
    that's easy.
I did it for sites using http://wikisource.org/wiki/MediaWiki:IndexForm.js

...
  3) All pages were created in automatically
 with the OCR text from Commons, instead
 of leaving a long list of red links. (This
 would require the text for each page to be
 extracted, something that pdftotext can do
 in seconds, but Commons takes weeks to do.)
    I do not understand what you mean.
What you describe is _already_ implemented :
when a page is created, its text is extracted
from the text layer of the corresponding djvu or pdf.
All you need to do is create djvu files with a proper text layer.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikisource-l] PDF/Djvu to Index