Thank you very much.
Did you look at the wikitext of that page?
I did now, I see that the text displayed is not actually present in the
wikitext / source text. I am seeing these ".djvu include" lines:
<pages index="A simplified grammar of the Swedish language.djvu" include=7
/>
What is this? Is it a common format for a Wikisource book?
prop=extracts works, but I would say it's a poor
fit for many (most?)
wikisource pages.
Why? Because it just pulls out sentences from the wikitext? What is
different about the functioning of prop=revisions, for example?
Plaintext as in wikitext or in parsed html converted
to plaintext?
Whatever you think is preferable, the point is to have some clean, readable
text. If the parsed HTML has any awkward formatting issues, I might prefer
the wikitext, or vice versa. Whichever is easier to work with. Technically
since wikitext is a markup format it might be easier to pull out from
specific fields you are seeking? I don't know.
You could use something like this to fetch every page
Thanks. I tried replacing the title with a different, more normal book and
it didn't seem to work.
https://en.wikisource.org/w/api.php?generator=allpages&action=query&…
I guess it's the same problem, "revisions" also pulls out wikitext but
Wikisource wikitext pulls in its text from separate files?
So would the "parse" action of the API be the tool of choice?
the WS Export tool can do that
Thanks very much, will give that a shot next.
Thank you,
Julius
On Tue, Sep 20, 2022 at 2:14 AM Sam Wilson <sam(a)samwilson.id.au> wrote:
>
>
>
> How can I get the full plaintext from an entire book on Wikisource with
>> the API?
>>
>
Plaintext as in wikitext or in parsed html converted
to plaintext?
>
>
>
> If it's the latter, the WS Export tool can do that:
>
https://ws-export.wmcloud.org/?format=txt
>
>
> _______________________________________________
> Mediawiki-api mailing list -- mediawiki-api(a)lists.wikimedia.org
> To unsubscribe send an email to mediawiki-api-leave(a)lists.wikimedia.org
>