Back to the Scroll

List overview All Threads
Download

newer

older

Re: [Wikisource-l] Parallel text...

page numbers and the <pages/>...

Magnus Manske

14 Aug 2010 14 Aug '10

6:15 p.m.

Hi all,

after some discussion on wikitech-l, I made a Google Books-like display demo for wikisource content. It should work on any multipage djvu or PDF. To link to it, you'll need: * The file name for the original (needs to be on commons) * The total number of pages (couldn't find a way to get that automatically anywhere...) * The page to start on

Thus armed, you can construct a URL like this: http://toolserver.org/~magnus/book2scroll/index.html?file=Transactions_of_th...

The default parameter-less URL will fall back on the DNB vol. 11: http://toolserver.org/~magnus/book2scroll/index.html

Note that this is HTML/CSS/JS only; no toolserver backend script/database is involved.

Awaiting onslaught of critique, Magnus

Show replies by date

Klaus Graf

14 Aug 14 Aug

6:26 p.m.

2010/8/14 Magnus Manske magnusmanske@googlemail.com:

...

Hi all,

after some discussion on wikitech-l, I made a Google Books-like display demo for wikisource content.

Please do not confuse en Wikisource with Wikisource. BTW: I don't understand the sense of this viewer.

Klaus Graf

Magnus Manske

7:18 p.m.

On Sat, Aug 14, 2010 at 7:26 PM, Klaus Graf klausgraf@googlemail.com wrote:

...

2010/8/14 Magnus Manske magnusmanske@googlemail.com:

...
Hi all,

after some discussion on wikitech-l, I made a Google Books-like display demo for wikisource content.

Please do not confuse en Wikisource with Wikisource.

Is that your way of saying "please enable other languages"?

...

BTW: I don't understand the sense of this viewer.

To browse books more fluently.

Cheers, Magnus

thomasV1＠gmx.de

7 p.m.

nice demo ; it would be nice to make it more efficient.

I see that you are generating images with a size that is adapted to the user's screen. If you do this, hundreds of thumbnails of all possible sizes will be generated at commons for each page. To avoid this, you should quantize the size : use a width that is a multiple of 100 pixels, as ProofreadPage does (Tim asked me to do this). In addition, if you use this restricted set of widths, then thumbnails will be more likely to already exist and will load faster.

To gain some speed, you could also preload pages p+1 and p-1, as google books does.

Also, in on_body_scroll, you could avoid the for loop : divide $('#body').position()['scrollTop'] by the height of an image

Thomas

-------- Original-Nachricht --------

...

Datum: Sat, 14 Aug 2010 19:15:47 +0100 Von: Magnus Manske magnusmanske@googlemail.com An: wikisource-l@lists.wikimedia.org Betreff: [Wikisource-l] Back to the Scroll

...

Hi all,

after some discussion on wikitech-l, I made a Google Books-like display demo for wikisource content. It should work on any multipage djvu or PDF. To link to it, you'll need:

The file name for the original (needs to be on commons)

The total number of pages (couldn't find a way to get that

automatically anywhere...)

The page to start on

Thus armed, you can construct a URL like this: http://toolserver.org/~magnus/book2scroll/index.html?file=Transactions_of_th...

The default parameter-less URL will fall back on the DNB vol. 11: http://toolserver.org/~magnus/book2scroll/index.html

Note that this is HTML/CSS/JS only; no toolserver backend script/database is involved.

Awaiting onslaught of critique, Magnus

Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

-- GMX DSL SOMMER-SPECIAL: Surf & Phone Flat 16.000 fÃŒr nur 19,99 Â¿/mtl.!* http://portal.gmx.net/de/go/dsl

Magnus Manske

7:20 p.m.

On Sat, Aug 14, 2010 at 8:00 PM, thomasV1@gmx.de wrote:

...

nice demo ; it would be nice to make it more efficient.

I see that you are generating images with a size that is adapted to the user's screen. If you do this, hundreds of thumbnails of all possible sizes will be generated at commons for each page. To avoid this, you should quantize the size : use a width that is a multiple of 100 pixels, as ProofreadPage does (Tim asked me to do this). In addition, if you use this restricted set of widths, then thumbnails will be more likely to already exist and will load faster.

To gain some speed, you could also preload pages p+1 and p-1, as google books does.

Both good ideas, I'll do that.

...

Also, in on_body_scroll, you could avoid the for loop : divide $('#body').position()['scrollTop'] by the height of an image

'fraid not - sometimes the rendered text runs longer than the image, so the "row" can be higher than the image. Example: http://toolserver.org/~magnus/book2scroll/index.html (scroll down and you'll see it)

Cheers, Magnus

Thomas Voegtlin

7:49 p.m.

...

...
Also, in on_body_scroll, you could avoid the for loop : divide

$('#body').position()['scrollTop'] by the height of an image

'fraid not - sometimes the rendered text runs longer than the image, so the "row" can be higher than the image. Example: http://toolserver.org/~magnus/book2scroll/index.html (scroll down and you'll see it)

hmm, you are right ; I had a "pure scan" version in mind.

But it would be nice to have a version that does not load the text, just in order to see if the WMF servers are fast enough to provide the same fluidity as in the Google Books interface.

For the size quantization, I think it is better to request a desired width than a desired height ; the API does not exactly give you the height you request. In addition, if you quantize the width you will be likely to request thumbs that are already created by ProofreadPage.

Also, for the text, I just had a crazy idea : instead of requesting the text of each page, you can do a single request for the whole book, using &action=parse (pass the <pages/> command to it, as in this script : http://wikisource.org/wiki/MediaWiki:Dictionary.js ).

Then we can split the returned string with a regexp that detects the page breaks (they are in a special span element), and place it in the corresponding divs ; things will break whenever a html formatting element ends on a different page than where it begins, but we could write a function that balances the missing elements.

Thomas

-- GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01

Magnus Manske

15 Aug 15 Aug

1:32 p.m.

On Sat, Aug 14, 2010 at 8:49 PM, Thomas Voegtlin thomasV1@gmx.de wrote:

...

...
...
Also, in on_body_scroll, you could avoid the for loop : divide

$('#body').position()['scrollTop'] by the height of an image

'fraid not - sometimes the rendered text runs longer than the image, so the "row" can be higher than the image. Example: http://toolserver.org/~magnus/book2scroll/index.html (scroll down and you'll see it)

hmm, you are right ; I had a "pure scan" version in mind.

But it would be nice to have a version that does not load the text, just in order to see if the WMF servers are fast enough to provide the same fluidity as in the Google Books interface.

I don't think the text retrieval is the slow step here...

...

For the size quantization, I think it is better to request a desired width than a desired height ; the API does not exactly give you the height you request. In addition, if you quantize the width you will be likely to request thumbs that are already created by ProofreadPage.

I've switched to specifying width rounded to 100s; however, the API still gives me one-off images (599 instead of 600 px). I could hack the API thumbnail URL, though. Better yet, I can probably skip that step entirely after the first one...

...

Also, for the text, I just had a crazy idea : instead of requesting the text of each page, you can do a single request for the whole book, using &action=parse (pass the <pages/> command to it, as in this script : http://wikisource.org/wiki/MediaWiki:Dictionary.js ).

Then we can split the returned string with a regexp that detects the page breaks (they are in a special span element), and place it in the corresponding divs ; things will break whenever a html formatting element ends on a different page than where it begins, but we could write a function that balances the missing elements.

Why load a giant text and then hack around on broken HTML, when I can just query each page individually? It's not really slow, at least not in Google Chrome.

Meanwhile, I added a feature to hide "header elements" like the proofread line, which kind of disrupts the reading flow. There's a checkbox to toggle header display.

And for Klaus, I added de.wikisource: http://toolserver.org/~magnus/book2scroll/index.html?lang=de&numlen=3&am...

Cheers, Magnus

Klaus Graf

3:20 p.m.

2010/8/15 Magnus Manske magnusmanske@googlemail.com:

...

And for Klaus, I added de.wikisource: http://toolserver.org/~magnus/book2scroll/index.html?lang=de&numlen=3&am...

Thank you!

Klaus Graf

Magnus Manske

5:02 p.m.

On Sun, Aug 15, 2010 at 4:20 PM, Klaus Graf klausgraf@googlemail.com wrote:

...

2010/8/15 Magnus Manske magnusmanske@googlemail.com:

...
And for Klaus, I added de.wikisource: http://toolserver.org/~magnus/book2scroll/index.html?lang=de&numlen=3&am...

Thank you!

And now with "search in this book" function (abusing page search through toolserver), with in-text highlighting, page markers, etc. ! :-)

Cheers, Magnus

ThomasV

5:46 p.m.

Magnus Manske a écrit :

...

On Sat, Aug 14, 2010 at 8:49 PM, Thomas Voegtlin thomasV1@gmx.de wrote:

...
...
...
Also, in on_body_scroll, you could avoid the for loop : divide

$('#body').position()['scrollTop'] by the height of an image

'fraid not - sometimes the rendered text runs longer than the image, so the "row" can be higher than the image. Example: http://toolserver.org/~magnus/book2scroll/index.html (scroll down and you'll see it)

hmm, you are right ; I had a "pure scan" version in mind.

But it would be nice to have a version that does not load the text, just in order to see if the WMF servers are fast enough to provide the same fluidity as in the Google Books interface.

I don't think the text retrieval is the slow step here...

No, but the for loop in the scroll handler makes it a bit slow.

Another problem occurs when you are viewing page p, and when p-1 is not loaded yet : if you scroll up, at the moment where p-1 is loaded, the size of its container div increases, and the text you are viewing (page p) is pushed towards the bottom. On the Dictionary of National Biography this offset can be quite big, so you lose track of the text you are viewing.

I don't really know how to solve this ; but it seems to me that using divs with variable size is part of the problem here too.

...

I've switched to specifying width rounded to 100s; however, the API still gives me one-off images (599 instead of 600 px). I could hack the API thumbnail URL, though. Better yet, I can probably skip that step entirely after the first one...

I can see that too (599 instead of 600); but that's not a problem, because the filename does not change, it is "600px-"

...

Why load a giant text and then hack around on broken HTML, when I can just query each page individually? It's not really slow, at least not in Google Chrome.

oh, that was in order to display the text without headers, footers and page breaks ; but I guess it's ok to show headers, because they are in the scans too. (here I'm not talking about the headers that you hide with your button ; I mean the other elements that are in this field : running title, references, etc.)

Thomas

Magnus Manske

8:07 p.m.

On Sun, Aug 15, 2010 at 6:46 PM, ThomasV thomasV1@gmx.de wrote:

...

Magnus Manske a écrit :

...
On Sat, Aug 14, 2010 at 8:49 PM, Thomas Voegtlin thomasV1@gmx.de wrote:

...
...
...
Also, in on_body_scroll, you could avoid the for loop : divide

$('#body').position()['scrollTop'] by the height of an image

'fraid not - sometimes the rendered text runs longer than the image, so the "row" can be higher than the image. Example: http://toolserver.org/~magnus/book2scroll/index.html (scroll down and you'll see it)

hmm, you are right ; I had a "pure scan" version in mind.

But it would be nice to have a version that does not load the text, just in order to see if the WMF servers are fast enough to provide the same fluidity as in the Google Books interface.

I don't think the text retrieval is the slow step here...

No, but the for loop in the scroll handler makes it a bit slow.

Another problem occurs when you are viewing page p, and when p-1 is not loaded yet : if you scroll up, at the moment where p-1 is loaded, the size of its container div increases, and the text you are viewing (page p) is pushed towards the bottom. On the Dictionary of National Biography this offset can be quite big, so you lose track of the text you are viewing.

I don't really know how to solve this ; but it seems to me that using divs with variable size is part of the problem here too.

I tried to solve that by fixing the div height to the same as the image div and using overflow-y to have per-page "sub-scroll". However, it does not seem to work with divs "display:table-cell", and altering that breaks the entire layout. I suppose I could go back to good 'ol table, but that would be a shame...

...

...
I've switched to specifying width rounded to 100s; however, the API still gives me one-off images (599 instead of 600 px). I could hack the API thumbnail URL, though. Better yet, I can probably skip that step entirely after the first one...

I can see that too (599 instead of 600); but that's not a problem, because the filename does not change, it is "600px-"

...
Why load a giant text and then hack around on broken HTML, when I can just query each page individually? It's not really slow, at least not in Google Chrome.

oh, that was in order to display the text without headers, footers and page breaks ; but I guess it's ok to show headers, because they are in the scans too. (here I'm not talking about the headers that you hide with your button ; I mean the other elements that are in this field : running title, references, etc.)

Yes, I know what you mean, but they're not really in that way...

Anyway, I've added a permalink.

Also, I've fiddled with the scrolling for loop; it should be much quicker now.

Cheers, Magnus

Magnus Manske

9:03 p.m.

Last one for today : Automatic retrieval of max page number and "number length" (e.g. "001" instead of just "1"). Manual parameters will override (and save a query).

Cheers, Magnus

Alex Brollo

16 Aug 16 Aug

12:02 p.m.

Thanks!

I just tried to build a template like this:

[[File:Library-logo-blue-outline.png|30px|link= http://toolserver.org/~magnus/book2scroll/index.html ?lang=it&file={{urlencode:{{PAGENAME}}}}]], to be used into Index: pages.

But urlencode converts spaces into +, where your script doesn't like the output of urlencode... it only likes names where spaces are replaced by underscores.

How can I obtain this transformation by a parser function/by a template? Or: can you modify the scripts, so that url encoded titles are accepted too?

I apologyze, if my question is banal.

Alex

Magnus Manske

12:22 p.m.

On Mon, Aug 16, 2010 at 1:02 PM, Alex Brollo alex.brollo@gmail.com wrote:

...

Thanks!

I just tried to build a template like this:

[[File:Library-logo-blue-outline.png|30px|link=http://toolserver.org/~magnus/book2scroll/index.html ?lang=it&file={{urlencode:{{PAGENAME}}}}]], to be used into Index: pages.

But urlencode converts spaces into +, where your script doesn't like the output of urlencode... it only likes names where spaces are replaced by underscores.

How can I obtain this transformation by a parser function/by a template? Or: can you modify the scripts, so that url encoded titles are accepted too?

I apologyze, if my question is banal.

Not at all, though http://toolserver.org/~magnus/book2scroll/index.html?lang=it&file=De%27+... seems to work fine. You can also try {{PAGENAMEE}} (note the two E).

Note that sometimes, the Index: name is not the same at the actual .djvu file, so you should allow attribute {{{1}}} to be the filename (PAGENAME by default), and maybe the start page as {{{2}}}, default "1". That would also allow for the template to be used anywhere, not just on the index page.

Cheers, Magnus

Alex Brollo

4:35 p.m.

2010/8/16 Magnus Manske magnusmanske@googlemail.com

...

Not at all, though

http://toolserver.org/~magnus/book2scroll/index.html?lang=it&file=De%27+...http://toolserver.org/%7Emagnus/book2scroll/index.html?lang=it&file=De%27+matematici+italiani+anteriori+all%27invenzione+della+stampa.djvu seems to work fine. You can also try {{PAGENAMEE}} (note the two E).

My question was definitely banal. With {{PAGENAMEE}} the template http://it.wikisource.org/wiki/Template:BackToScroll runs perfectly: see http://it.wikisource.org/wiki/Indice:Rime_%28Vittorelli%29.djvu . Thanks Magnus (even if some of our it.source users poined my mistake, I appreciated a lot your attention and suggestions!)

Alex

Magnus Manske

17 Aug 17 Aug

8:11 a.m.

On Mon, Aug 16, 2010 at 5:35 PM, Alex Brollo alex.brollo@gmail.com wrote:

...

2010/8/16 Magnus Manske magnusmanske@googlemail.com

...
Not at all, though

http://toolserver.org/~magnus/book2scroll/index.html?lang=it&file=De%27+... seems to work fine. You can also try {{PAGENAMEE}} (note the two E).

My question was definitely banal. With {{PAGENAMEE}} the template http://it.wikisource.org/wiki/Template:BackToScroll runs perfectly: see http://it.wikisource.org/wiki/Indice:Rime_%28Vittorelli%29.djvu . Thanks Magnus (even if some of our it.source users poined my mistake, I appreciated a lot your attention and suggestions!)

Nice! Thanks!

Magnus

Billinghurst

10:20 a.m.

Hi Magnus,

The tool has choked on something for http://toolserver.org/~magnus/book2scroll/index.html?lang=en&file=Mrs_Ca... might it be the apostrophe in the url? This url just shows the one page, and no subsequent pages, and that is for whichever starting page one feeds into the url.

Regards Andrew

On 16 Aug 2010 at 13:22, Magnus Manske wrote:

...

On Mon, Aug 16, 2010 at 1:02 PM, Alex Brollo alex.brollo@gmail.com wrote:

...
Thanks!

I just tried to build a template like this:

[[File:Library-logo-blue-outline.png|30px|link=http://toolserver.org/~magnus/book2scroll/index.html ?lang=it&file={{urlencode:{{PAGENAME}}}}]], to be used into Index: pages.

But urlencode converts spaces into +, where your script doesn't like the output of urlencode... it only likes names where spaces are replaced by underscores.

How can I obtain this transformation by a parser function/by a template? Or: can you modify the scripts, so that url encoded titles are accepted too?

I apologyze, if my question is banal.

Not at all, though http://toolserver.org/~magnus/book2scroll/index.html?lang=it&file=De%27+... seems to work fine. You can also try {{PAGENAMEE}} (note the two E).

Note that sometimes, the Index: name is not the same at the actual .djvu file, so you should allow attribute {{{1}}} to be the filename (PAGENAME by default), and maybe the start page as {{{2}}}, default "1". That would also allow for the template to be used anywhere, not just on the index page.

Cheers, Magnus

Alex Brollo

1:15 p.m.

2010/8/17 Billinghurst billinghurst@gmail.com

...

Hi Magnus,

The tool has choked on something for

http://toolserver.org/~magnus/book2scroll/index.html?lang=en&file=Mrs_Ca...http://toolserver.org/%7Emagnus/book2scroll/index.html?lang=en&file=Mrs_Caudle%2527s_curtain_lectures.djvu&startpage=1 might it be the apostrophe in the url? This url just shows the one page, and no subsequent pages, and that is for whichever starting page one feeds into the url.

Regards Andrew

I'd got a similar problem, solved simply *avoiding urlencode* and using the plain output of {{PAGENAMEE}} or similar variables.

http://toolserver.org/~magnus/book2scroll/index.html?lang=en&file=Mrs_Ca...http://toolserver.org/%7Emagnus/book2scroll/index.html?lang=en&file=Mrs_Caudle%27s_curtain_lectures.djvu&startpage=1runs.

Alex

Alex Brollo

1:52 p.m.

2010/8/17 Alex Brollo alex.brollo@gmail.com

Marcus, a challenge for you...

http://toolserver.org/~magnus/book2scroll/index.html?lang=it&file=Hymnus...http://toolserver.org/%7Emagnus/book2scroll/index.html?lang=it&file=Hymnus_in_Romam.djvu&startpage=9

This is a text from it.source using {{Iwpage}} ThomasV's trick, linking many pages into la.source . It would be great to see the content coming from Iwpage interwiki transclusion!

Alex

Magnus Manske

2:34 p.m.

On Tue, Aug 17, 2010 at 2:52 PM, Alex Brollo alex.brollo@gmail.com wrote:

...

2010/8/17 Alex Brollo alex.brollo@gmail.com

Marcus, a challenge for you...

http://toolserver.org/~magnus/book2scroll/index.html?lang=it&file=Hymnus...

This is a text from it.source using {{Iwpage}} ThomasV's trick, linking many pages into la.source . It would be great to see the content coming from Iwpage interwiki transclusion!

I'm not sure how to detect interwiki transclusion universally (in all languages) with a single system, so I fall back to "manual": http://toolserver.org/~magnus/book2scroll/index.html?lang=it&file=Hymnus...

Cheers, Magnus

Magnus Manske

2:40 p.m.

On Tue, Aug 17, 2010 at 3:34 PM, Magnus Manske magnusmanske@googlemail.com wrote:

...

On Tue, Aug 17, 2010 at 2:52 PM, Alex Brollo alex.brollo@gmail.com wrote:

...
2010/8/17 Alex Brollo alex.brollo@gmail.com

Marcus, a challenge for you...

http://toolserver.org/~magnus/book2scroll/index.html?lang=it&file=Hymnus...

This is a text from it.source using {{Iwpage}} ThomasV's trick, linking many pages into la.source . It would be great to see the content coming from Iwpage interwiki transclusion!

I'm not sure how to detect interwiki transclusion universally (in all languages) with a single system, so I fall back to "manual": http://toolserver.org/~magnus/book2scroll/index.html?lang=it&file=Hymnus...

ARGH! Only /some/ of the pages are transcluded!

Magnus Manske

6:19 p.m.

On Tue, Aug 17, 2010 at 2:52 PM, Alex Brollo alex.brollo@gmail.com wrote:

...

2010/8/17 Alex Brollo alex.brollo@gmail.com

Marcus, a challenge for you...

http://toolserver.org/~magnus/book2scroll/index.html?lang=it&file=Hymnus...

This is a text from it.source using {{Iwpage}} ThomasV's trick, linking many pages into la.source . It would be great to see the content coming from Iwpage interwiki transclusion!

Now reloads transcluded pages from the correct language automatically, no textlang parameter neccessary!

Magnus

Alex Brollo

6:47 p.m.

2010/8/17 Magnus Manske magnusmanske@googlemail.com

...

Now reloads transcluded pages from the correct language automatically, no textlang parameter neccessary!

Magnus

Great! :-) Alex

5240

Age (days ago)

5243

Last active (days ago)

wikisource-l@lists.wikimedia.org

22 comments

7 participants

tags (0)

participants (7)

Alex Brollo
Billinghurst
Klaus Graf
Magnus Manske
Thomas Voegtlin
ThomasV
thomasV1＠gmx.de