Hi all,
I've attempted to start a phab ticket about what the import wizard should look like: https://phabricator.wikimedia.org/T154413
There are plenty of unanswered questions I'm sure, and lots missing still. Please edit the task or add comments about anything.
This is 2016 Wishlist #73, so I'm not sure it'll get much 'official' comm-tech time (yet; there *is* a plan to address further-down wishes, but they may take some time), but I'm keen to work on it in my own time anyway.
One thing I'd love to have in a Wikisource upload wizard is a thing that I can show to Glam people that makes it easier for them to see the value (and ease) in getting their stuff online and ready for crowd-sourced transcription. :-)
Thanks, Sam.
Very interesting.
About djvu files on IA, they can be built simply by pdf2djvu from pdf files of IA, but quality is very poor; or they can be built, with some more pain, from _jp2.zip images merged with _djvu.xml files, the quality is high but resulting djvu is heavy.
As Aubrey told some time ago, it.source uses a python script to do the latter job, but it is a DIY (do it yourself) script, just to proof that *it can be done*.
Alex
2017-01-02 5:29 GMT+01:00 Sam Wilson sam@samwilson.id.au:
Hi all,
I've attempted to start a phab ticket about what the import wizard should look like: https://phabricator.wikimedia.org/T154413
There are plenty of unanswered questions I'm sure, and lots missing still. Please edit the task or add comments about anything.
This is 2016 Wishlist #73, so I'm not sure it'll get much 'official' comm-tech time (yet; there *is* a plan to address further-down wishes, but they may take some time), but I'm keen to work on it in my own time anyway.
One thing I'd love to have in a Wikisource upload wizard is a thing that I can show to Glam people that makes it easier for them to see the value (and ease) in getting their stuff online and ready for crowd-sourced transcription. :-)
Thanks, Sam.
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Yes, I've been wondering about the best approach with that. Obviously, better quality is better, but we don't want to overwhelm the various tools that deal with the DjVus.
And if we're building DjVus for Commons from IA files (either PDFs or the Jpegs), should we also be adding those DjVus back to the IA item? (Actually, can we even edit IA items that we haven't created ourselves?) I'm figuring not doing so (but maybe adding a comment to the IA item that links to the DjVu on Commons).
—sam
On Mon, 2 Jan 2017, at 05:08 PM, Alex Brollo wrote:
Very interesting.
About djvu files on IA, they can be built simply by pdf2djvu from pdf files of IA, but quality is very poor; or they can be built, with some more pain, from _jp2.zip images merged with _djvu.xml files, the quality is high but resulting djvu is heavy.
As Aubrey told some time ago, it.source uses a python script to do the latter job, but it is a DIY (do it yourself) script, just to proof that *it can be done*.
Alex
2017-01-02 5:29 GMT+01:00 Sam Wilson sam@samwilson.id.au:
Hi all,
I've attempted to start a phab ticket about what the import wizard
should look like:
There are plenty of unanswered questions I'm sure, and lots missing still. Please edit the task or add comments about anything.
This is 2016 Wishlist #73, so I'm not sure it'll get much 'official' comm-tech time (yet; there *is* a plan to address further-down wishes, but they may take some time), but I'm keen to work on it in my own time anyway.
One thing I'd love to have in a Wikisource upload wizard is a thing that I can show to Glam people that makes it easier for them to see the value (and ease) in getting their stuff online and ready for crowd-sourced transcription. :-)
Thanks,
Sam.
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
On Mon, Jan 2, 2017 at 10:17 AM, Sam Wilson sam@samwilson.id.au wrote:
And if we're building DjVus for Commons from IA files (either PDFs or the Jpegs), should we also be adding those DjVus back to the IA item? (Actually, can we even edit IA items that we haven't created ourselves?) I'm figuring not doing so (but maybe adding a comment to the IA item that links to the DjVu on Commons).
Ideally, we should talk to IA about this. Adding a comment on the IA item is a very low-cost solution and I think is important, adding the djvu would be much better. We should check if a script can edit every kind of item and add files (I think not).
Aubrey
On Mon, 2 Jan 2017, at 05:29 PM, Andrea Zanni wrote:
Ideally, we should talk to IA about this.
Adding a comment on the IA item is a very low-cost solution and I think is important, adding the djvu would be much better. We should check if a script can edit every kind of item and add files (I think not). Aubrey
Yes, good idea about talking to them.
I wonder about the workflow too, because what about the situation of someone uploading a new work with our tool: the script creates a new IA item then (I assume as the 'wikisource-import-tool' or whatever user) and then it will have full permissions over that item. So the update-DjVu scenario will only apply for IA items that already exist but which don't have DjVu files (i.e. only the last few months' worth). Which is good...
—sam
Please take a look to https://archive.org/details/spinoza_etica_paravia_djvu, this is precisely a djvu-only item that I uploaded some days ago. I asked for permission to create "djvu-only items" into IA forum and I got it; this is the fiirst item I created; as you see there's some "implicit convention" too (the name of item is the original one + a _djvu suffix: it has been derived from https://archive.org/details/spinoza_etica_paravia) and metadata are the same, but a standard warning "Derived from files into L'Etica https://archive.org/details/spinoza_etica_paravia" into the description field.
So far I did not do the last step, t.i. adding a "backlink" from original item to the derived one.
internetarchive.py allows to automatize the whole work (to download metadata of source item, to build the new item name and to add the warning do description field and to upload the new item).
Alex
2017-01-02 14:37 GMT+01:00 Sam Wilson sam@samwilson.id.au:
On Mon, 2 Jan 2017, at 05:29 PM, Andrea Zanni wrote:
Ideally, we should talk to IA about this. Adding a comment on the IA item is a very low-cost solution and I think is important, adding the djvu would be much better. We should check if a script can edit every kind of item and add files (I think not). Aubrey
Yes, good idea about talking to them.
I wonder about the workflow too, because what about the situation of someone uploading a new work with our tool: the script creates a new IA item then (I assume as the 'wikisource-import-tool' or whatever user) and then it will have full permissions over that item. So the update-DjVu scenario will only apply for IA items that already exist but which don't have DjVu files (i.e. only the last few months' worth). Which is good...
—sam
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Done :-)
Alex
2017-01-02 16:49 GMT+01:00 Alex Brollo alex.brollo@gmail.com:
Please take a look to https://archive.org/details/spinoza_etica_paravia_ djvu, this is precisely a djvu-only item that I uploaded some days ago. I asked for permission to create "djvu-only items" into IA forum and I got it; this is the fiirst item I created; as you see there's some "implicit convention" too (the name of item is the original one + a _djvu suffix: it has been derived from https://archive.org/details/spinoza_etica_paravia) and metadata are the same, but a standard warning "Derived from files into L'Etica https://archive.org/details/spinoza_etica_paravia" into the description field.
So far I did not do the last step, t.i. adding a "backlink" from original item to the derived one.
internetarchive.py allows to automatize the whole work (to download metadata of source item, to build the new item name and to add the warning do description field and to upload the new item).
Alex
2017-01-02 14:37 GMT+01:00 Sam Wilson sam@samwilson.id.au:
On Mon, 2 Jan 2017, at 05:29 PM, Andrea Zanni wrote:
Ideally, we should talk to IA about this. Adding a comment on the IA item is a very low-cost solution and I think is important, adding the djvu would be much better. We should check if a script can edit every kind of item and add files (I think not). Aubrey
Yes, good idea about talking to them.
I wonder about the workflow too, because what about the situation of someone uploading a new work with our tool: the script creates a new IA item then (I assume as the 'wikisource-import-tool' or whatever user) and then it will have full permissions over that item. So the update-DjVu scenario will only apply for IA items that already exist but which don't have DjVu files (i.e. only the last few months' worth). Which is good...
—sam
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Good idea. I guess it's not ideal to end up with two items, but at least the 2nd will be updateable from our end.
It looks like we can add HTML links to IA reviews too, which is nice: https://archive.org/details/spinoza_etica_paravia
On Mon, 2 Jan 2017, at 11:52 PM, Alex Brollo wrote:
Done :-)
Alex
2017-01-02 16:49 GMT+01:00 Alex Brollo alex.brollo@gmail.com:
Please take a look to https://archive.org/details/spinoza_etica_paravia_djvu, this is precisely a djvu-only item that I uploaded some days ago. I asked for permission to create "djvu-only items" into IA forum and I got it; this is the fiirst item I created; as you see there's some "implicit convention" too (the name of item is the original one + a _djvu suffix: it has been derived from https://archive.org/details/spinoza_etica_paravia) and metadata are the same, but a standard warning "Derived from files into L'Etica[1]" into the description field.
So far I did not do the last step, t.i. adding a "backlink" from original item to the derived one.
internetarchive.py allows to automatize the whole work (to download metadata of source item, to build the new item name and to add the warning do description field and to upload the new item).
Links:
I wonder if, rather than creating a new IA item, we should just link the original IA item to the DjVu on Commons (via a review)? Or is there a discoverability benefit to be had by having the DjVu also on IA?
On Tue, 3 Jan 2017, at 07:07 AM, Sam Wilson wrote:
Good idea. I guess it's not ideal to end up with two items, but at least the 2nd will be updateable from our end.
It looks like we can add HTML links to IA reviews too, which is nice: https://archive.org/details/spinoza_etica_paravia
On Mon, 2 Jan 2017, at 11:52 PM, Alex Brollo wrote:
Done :-)
Alex
2017-01-02 16:49 GMT+01:00 Alex Brollo alex.brollo@gmail.com:
Please take a look to https://archive.org/details/spinoza_etica_paravia_djvu, this is precisely a djvu-only item that I uploaded some days ago. I asked for permission to create "djvu-only items" into IA forum and I got it; this is the fiirst item I created; as you see there's some "implicit convention" too (the name of item is the original one + a _djvu suffix: it has been derived from https://archive.org/details/spinoza_etica_paravia) and metadata are the same, but a standard warning "Derived from files into L'Etica[1]" into the description field.
So far I did not do the last step, t.i. adding a "backlink" from original item to the derived one.
internetarchive.py allows to automatize the whole work (to download metadata of source item, to build the new item name and to add the warning do description field and to upload the new item).
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
Links:
You can see a great advantage of djvu files over pdf files into the present file list of any IA item. You can see that IA removed djvu files, but it builds and publishes _djvu.xml file. Why? I presume that IA uses that file to "map words" into its book viewer, since it has a good text structure while being *pretty simple*. It can be translated into hOCR, and editing its text nodes the edited text can be uploaded again into the djvu file. Itsource is testing, on some texts, tricks to mass-fix djvu text layer (removing scannos etc.) *before* uploading it into Commons.
It's a pity IMHO that this magic book format has been disregarded. Its structure is *open* just as the pdf structure is *closed*.
Alex
2017-01-03 0:19 GMT+01:00 Sam Wilson sam@samwilson.id.au:
I wonder if, rather than creating a new IA item, we should just link the original IA item to the DjVu on Commons (via a review)? Or is there a discoverability benefit to be had by having the DjVu also on IA?
On Tue, 3 Jan 2017, at 07:07 AM, Sam Wilson wrote:
Good idea. I guess it's not ideal to end up with two items, but at least the 2nd will be updateable from our end.
It looks like we can add HTML links to IA reviews too, which is nice: https://archive.org/details/spinoza_etica_paravia
On Mon, 2 Jan 2017, at 11:52 PM, Alex Brollo wrote:
Done :-)
Alex
2017-01-02 16:49 GMT+01:00 Alex Brollo alex.brollo@gmail.com:
Please take a look to https://archive.org/details /spinoza_etica_paravia_djvu, this is precisely a djvu-only item that I uploaded some days ago. I asked for permission to create "djvu-only items" into IA forum and I got it; this is the fiirst item I created; as you see there's some "implicit convention" too (the name of item is the original one + a _djvu suffix: it has been derived from https://archive.org/details/spinoza_etica_paravia) and metadata are the same, but a standard warning "Derived from files into L'Etica https://archive.org/details/spinoza_etica_paravia" into the description field.
So far I did not do the last step, t.i. adding a "backlink" from original item to the derived one.
internetarchive.py allows to automatize the whole work (to download metadata of source item, to build the new item name and to add the warning do description field and to upload the new item).
*_______________________________________________* Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
There's also this new Phab task, that's looking at a more limited first-step:
Investigation: Could we build a Tool Labs project to generate Djvu files for WikiSource https://phabricator.wikimedia.org/T154538
On Tue, 3 Jan 2017, at 07:46 AM, Alex Brollo wrote:
You can see a great advantage of djvu files over pdf files into the present file list of any IA item. You can see that IA removed djvu files, but it builds and publishes _djvu.xml file. Why? I presume that IA uses that file to "map words" into its book viewer, since it has a good text structure while being *pretty simple*. It can be translated into hOCR, and editing its text nodes the edited text can be uploaded again into the djvu file. Itsource is testing, on some texts, tricks to mass-fix djvu text layer (removing scannos etc.) *before* uploading it into Commons.
It's a pity IMHO that this magic book format has been disregarded. Its structure is *open* just as the pdf structure is *closed*.
Alex
2017-01-03 0:19 GMT+01:00 Sam Wilson sam@samwilson.id.au:
__
I wonder if, rather than creating a new IA item, we should just link the original IA item to the DjVu on Commons (via a review)? Or is there a discoverability benefit to be had by having the DjVu also on IA?
On Tue, 3 Jan 2017, at 07:07 AM, Sam Wilson wrote:
Good idea. I guess it's not ideal to end up with two items, but at least the 2nd will be updateable from our end.
It looks like we can add HTML links to IA reviews too, which is nice: https://archive.org/details/spinoza_etica_paravia
On Mon, 2 Jan 2017, at 11:52 PM, Alex Brollo wrote:
Done :-)
Alex
2017-01-02 16:49 GMT+01:00 Alex Brollo alex.brollo@gmail.com:
Please take a look to https://archive.org/details/spinoza_etica_paravia_djvu, this is precisely a djvu-only item that I uploaded some days ago. I asked for permission to create "djvu-only items" into IA forum and I got it; this is the fiirst item I created; as you see there's some "implicit convention" too (the name of item is the original one + a _djvu suffix: it has been derived from https://archive.org/details/spinoza_etica_paravia) and metadata are the same, but a standard warning "Derived from files into L'Etica[1]" into the description field.
So far I did not do the last step, t.i. adding a "backlink" from original item to the derived one.
internetarchive.py allows to automatize the whole work (to download metadata of source item, to build the new item name and to add the warning do description field and to upload the new item).
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
Links:
Very interesting.
About djvu files on IA, they can be built simply by pdf2djvu from pdf files of IA, but quality is very poor;
[...]
Did you try to set the -d parameter to something higher than the default 300? While converting PDF files from Polish digital libraries, I often use -d 450 or -d 600 with good results.
Ankry
The problem is that many new IA pdf files have a poor resolution / too high compression from beginning, so their quality can't be improved.
IA viewer doesnìt use pdf or djvu file, it uses jpg images coming from jp2 images; this explains why images seen by the viewer are so beautiful, while pdf or djvu files are poor.
@Sam: About uploading djvu into IA item lacking of it: no, nobody but the original contributor or a sysop can upload files into an item. But it can be uploaded as a new item linked with the original one; its link could be shown into source item adding a comment (a "review"),
2017-01-02 12:52 GMT+01:00 Ankry ankry@mif.pg.gda.pl:
Very interesting.
About djvu files on IA, they can be built simply by pdf2djvu from pdf files of IA, but quality is very poor;
[...]
Did you try to set the -d parameter to something higher than the default 300? While converting PDF files from Polish digital libraries, I often use -d 450 or -d 600 with good results.
Ankry
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
wikisource-l@lists.wikimedia.org