It is not a trivial matter. The best bet would be to take an existing pdf import tool for a word processor, and try to write a similar tool for wikitext.
There is the Oracle PDF Import Extension for Open Office, the code can be browsed, maybe it can give you some ideas http://extensions.services.openoffice.org/project/pdfimport
Micru
On Wed, Jun 12, 2013 at 12:38 PM, Alex Brollo alex.brollo@gmail.com wrote:
When we tried to convert into wiki code (a needed step to add links and to convert files into a "wiki hypertext") a pdf file, that's a opaque, closed format, such a work turned off in a nightmare. If we simply load free pdf books "as they are", I don't see any advantage, but "feed wikisource numbers/statistics" nd this in presently far from my personal interest.
As you guess, I'm one of users who don't support Aubrey's enthusiasm about texts born digital, even if free. :-)
Alex
2013/6/12 David Cuenca dacuetu@gmail.com
Nobody is saying anything about using copyrighted works, there are many books that have an open license that would allow to include them in Wikisource.
For instance in ca-ws we have this translation from 2009:
http://ca.wikisource.org/wiki/Llibre:El_secret_de_l%E2%80%99or_que_creix_%28...
The original is in the PD, and the translator gave away his rights. It would have been much easier to work directly with the pdf, instead of converting to djvu.
Micru
On Wed, Jun 12, 2013 at 10:47 AM, Aarti K. Dwivedi < ellydwivedi2093@gmail.com> wrote:
If I am not wrong, as of today, most books that were born digital, are still under copyright. Of course, they are available freely on the internet. But we can't use the pirated copies. How would we go about the procurement of these books? If we procure these copyrighted books, then the only we would have to do is to check for proper formatting. Isn't it?
On Wed, Jun 12, 2013 at 7:58 PM, Lars Aronsson lars@aronsson.se wrote:
On 06/12/2013 02:48 PM, Andrea Zanni wrote:
We could define some tasks as
- corrected the page
- OPTIONAL added optional templates/links/annotations
*...
Geotagged all the photos, ...
The list doesn't end. You need a generic mechanism for any new feature you can invent. But aren't our existing templates and categories the best way to do this? You could just add to each page: {{done|proofread=user1|**validated=user2|geotagged=**user4|...}}
-- Lars Aronsson (lars@aronsson.se) Project Runeberg - free Nordic literature - http://runeberg.org/
______________________________**_________________ Wikisource-l mailing list Wikisource-l@lists.wikimedia.**org Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikisource-lhttps://lists.wikimedia.org/mailman/listinfo/wikisource-l
-- Aarti K. Dwivedi
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
-- Etiamsi omnes, ego non _______________________________________________ Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
If you are interested in working with PDFs, study this blog :-) http://blogs.ch.cam.ac.uk/pmr/
(these fellows are open access activist, btw)
Aubrey
On Wed, Jun 12, 2013 at 7:04 PM, David Cuenca dacuetu@gmail.com wrote:
It is not a trivial matter. The best bet would be to take an existing pdf import tool for a word processor, and try to write a similar tool for wikitext.
There is the Oracle PDF Import Extension for Open Office, the code can be browsed, maybe it can give you some ideas http://extensions.services.openoffice.org/project/pdfimport
Micru
On Wed, Jun 12, 2013 at 12:38 PM, Alex Brollo alex.brollo@gmail.comwrote:
When we tried to convert into wiki code (a needed step to add links and to convert files into a "wiki hypertext") a pdf file, that's a opaque, closed format, such a work turned off in a nightmare. If we simply load free pdf books "as they are", I don't see any advantage, but "feed wikisource numbers/statistics" nd this in presently far from my personal interest.
As you guess, I'm one of users who don't support Aubrey's enthusiasm about texts born digital, even if free. :-)
Alex
2013/6/12 David Cuenca dacuetu@gmail.com
Nobody is saying anything about using copyrighted works, there are many books that have an open license that would allow to include them in Wikisource.
For instance in ca-ws we have this translation from 2009:
http://ca.wikisource.org/wiki/Llibre:El_secret_de_l%E2%80%99or_que_creix_%28...
The original is in the PD, and the translator gave away his rights. It would have been much easier to work directly with the pdf, instead of converting to djvu.
Micru
On Wed, Jun 12, 2013 at 10:47 AM, Aarti K. Dwivedi < ellydwivedi2093@gmail.com> wrote:
If I am not wrong, as of today, most books that were born digital, are still under copyright. Of course, they are available freely on the internet. But we can't use the pirated copies. How would we go about the procurement of these books? If we procure these copyrighted books, then the only we would have to do is to check for proper formatting. Isn't it?
On Wed, Jun 12, 2013 at 7:58 PM, Lars Aronsson lars@aronsson.sewrote:
On 06/12/2013 02:48 PM, Andrea Zanni wrote:
We could define some tasks as
- corrected the page
- OPTIONAL added optional templates/links/annotations
*...
Geotagged all the photos, ...
The list doesn't end. You need a generic mechanism for any new feature you can invent. But aren't our existing templates and categories the best way to do this? You could just add to each page: {{done|proofread=user1|**validated=user2|geotagged=**user4|...}}
-- Lars Aronsson (lars@aronsson.se) Project Runeberg - free Nordic literature - http://runeberg.org/
______________________________**_________________ Wikisource-l mailing list Wikisource-l@lists.wikimedia.**org Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikisource-lhttps://lists.wikimedia.org/mailman/listinfo/wikisource-l
-- Aarti K. Dwivedi
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
-- Etiamsi omnes, ego non _______________________________________________ Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
-- Etiamsi omnes, ego non _______________________________________________ Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
2013/6/12 David Cuenca dacuetu@gmail.com:
It is not a trivial matter. The best bet would be to take an existing pdf import tool for a word processor, and try to write a similar tool for wikitext.
There is the Oracle PDF Import Extension for Open Office, the code can be browsed, maybe it can give you some ideas http://extensions.services.openoffice.org/project/pdfimport
PDF scraping is a technique that's is gaining more and more attention since a lot of data on the web are hidden in PDF; so some libraries for this task are under development. My favorite language being Python I will suggest this blog post: http://blog.scraperwiki.com/2010/12/17/scraping-pdfs-now-26-less-unpleasant-...
C
wikisource-l@lists.wikimedia.org