Thank you very much.

> Did you look at the wikitext of that page?

I did now, I see that the text displayed is not actually present in the wikitext / source text. I am seeing these ".djvu include" lines:

<pages index="A simplified grammar of the Swedish language.djvu" include=7 />

What is this? Is it a common format for a Wikisource book? 

> prop=extracts works, but I would say it's a poor fit for many (most?) wikisource pages.

Why? Because it just pulls out sentences from the wikitext? What is different about the functioning of prop=revisions, for example?

> Plaintext as in wikitext or in parsed html converted to plaintext?

Whatever you think is preferable, the point is to have some clean, readable text. If the parsed HTML has any awkward formatting issues, I might prefer the wikitext, or vice versa. Whichever is easier to work with. Technically since wikitext is a markup format it might be easier to pull out from specific fields you are seeking? I don't know.

> You could use something like this to fetch every page 

Thanks. I tried replacing the title with a different, more normal book and it didn't seem to work. 

https://en.wikisource.org/w/api.php?generator=allpages&action=query&prop=revisions&rvprop=content&rvslots=main&gapprefix=Moby-Dick_(1851)_US_edition


I guess it's the same problem, "revisions" also pulls out wikitext but Wikisource wikitext pulls in its text from separate files?


So would the "parse" action of the API be the tool of choice?


 the WS Export tool can do that


Thanks very much, will give that a shot next.


Thank you,

Julius






On Tue, Sep 20, 2022 at 2:14 AM Sam Wilson <sam@samwilson.id.au> wrote:
 

How can I get the full plaintext from an entire book on Wikisource with the API?
 
Plaintext as in wikitext or in parsed html converted to plaintext?



If it's the latter, the WS Export tool can do that: https://ws-export.wmcloud.org/?format=txt


_______________________________________________
Mediawiki-api mailing list -- mediawiki-api@lists.wikimedia.org
To unsubscribe send an email to mediawiki-api-leave@lists.wikimedia.org