Hey,
It seems the following API call works for Wikipedia pages:
https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exsent...
But not for Wikisource pages:
https://en.wikisource.org/w/api.php?action=query&prop=extracts&exsen...
Is there documentation somewhere about the API not working for Wikisource or perhaps only certain actions / props working for certain sites?
How can I get the full plaintext from an entire book on Wikisource with the API?
Thanks, Julius
On Mon, 19 Sept 2022 at 17:03, Julius Hamilton juliushamilton100@gmail.com wrote:
Hey,
It seems the following API call works for Wikipedia pages:
https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exsent...
But not for Wikisource pages:
https://en.wikisource.org/w/api.php?action=query&prop=extracts&exsen...
Is there documentation somewhere about the API not working for Wikisource or perhaps only certain actions / props working for certain sites?
Did you look at the wikitext of that page? https://en.wikisource.org/w/index.php?title=A_Simplified_Grammar_of_the_Swed...
prop=extracts works, but I would say it's a poor fit for many (most?) wikisource pages. https://en.wikisource.org/w/api.php?action=query&prop=extracts&exsen...
How can I get the full plaintext from an entire book on Wikisource with the
API?
Plaintext as in wikitext or in parsed html converted to plaintext?
You could use something like this to fetch every page under A_Simplified_Grammar_of_the_Swedish_Language: https://en.wikisource.org/w/api.php?generator=allpages&action=query&...
Regards
How can I get the full plaintext from an entire book on Wikisource with the API?
Plaintext as in wikitext or in parsed html converted to plaintext?
If it's the latter, the WS Export tool can do that: https://ws-export.wmcloud.org/?format=txt
Thank you very much.
Did you look at the wikitext of that page?
I did now, I see that the text displayed is not actually present in the wikitext / source text. I am seeing these ".djvu include" lines:
<pages index="A simplified grammar of the Swedish language.djvu" include=7 />
What is this? Is it a common format for a Wikisource book?
prop=extracts works, but I would say it's a poor fit for many (most?)
wikisource pages.
Why? Because it just pulls out sentences from the wikitext? What is different about the functioning of prop=revisions, for example?
Plaintext as in wikitext or in parsed html converted to plaintext?
Whatever you think is preferable, the point is to have some clean, readable text. If the parsed HTML has any awkward formatting issues, I might prefer the wikitext, or vice versa. Whichever is easier to work with. Technically since wikitext is a markup format it might be easier to pull out from specific fields you are seeking? I don't know.
You could use something like this to fetch every page
Thanks. I tried replacing the title with a different, more normal book and it didn't seem to work.
https://en.wikisource.org/w/api.php?generator=allpages&action=query&...
I guess it's the same problem, "revisions" also pulls out wikitext but Wikisource wikitext pulls in its text from separate files?
So would the "parse" action of the API be the tool of choice?
the WS Export tool can do that
Thanks very much, will give that a shot next.
Thank you,
Julius
On Tue, Sep 20, 2022 at 2:14 AM Sam Wilson sam@samwilson.id.au wrote:
How can I get the full plaintext from an entire book on Wikisource with
the API?
Plaintext as in wikitext or in parsed html converted to plaintext?
If it's the latter, the WS Export tool can do that: https://ws-export.wmcloud.org/?format=txt
Mediawiki-api mailing list -- mediawiki-api@lists.wikimedia.org To unsubscribe send an email to mediawiki-api-leave@lists.wikimedia.org
I am currently trying to install ws-export ( https://github.com/wikimedia/ws-export) and I’m having trouble with “compose”, would anyone know anything about this?
composer install --no-dev
Your lock file does not contain a compatible set of packages. Please run
composer update.
composer update
Your requirements could not be resolved to an installable set of packages.
Problem 1
- Root composer.json requires PHP extension ext-dom * but it is missing from your system. Install or enable PHP's dom extension. Problem 2 - Root composer.json requires PHP extension ext-intl * but it is missing from your system. Install or enable PHP's intl extension. Problem 3 - Root composer.json requires PHP extension ext-sqlite3 * but it is missing from your system. Install or enable PHP's sqlite3 extension. Problem 4 - Root composer.json requires PHP extension ext-zip * but it is missing from your system. Install or enable PHP's zip extension. Problem 5 - symfony/framework-bundle[v5.4.0, ..., v5.4.12] require ext-xml * -> it is missing from your system. Install or enable PHP's xml extension. - Root composer.json requires symfony/framework-bundle 5.4.* -> satisfiable by symfony/framework-bundle[v5.4.0, ..., v5.4.12].
To enable extensions, verify that they are enabled in your .ini files: - /etc/php/7.4/cli/php.ini - /etc/php/7.4/cli/conf.d/10-opcache.ini - /etc/php/7.4/cli/conf.d/10-pdo.ini - /etc/php/7.4/cli/conf.d/20-calendar.ini - /etc/php/7.4/cli/conf.d/20-ctype.ini - /etc/php/7.4/cli/conf.d/20-exif.ini - /etc/php/7.4/cli/conf.d/20-ffi.ini - /etc/php/7.4/cli/conf.d/20-fileinfo.ini - /etc/php/7.4/cli/conf.d/20-ftp.ini - /etc/php/7.4/cli/conf.d/20-gettext.ini - /etc/php/7.4/cli/conf.d/20-iconv.ini - /etc/php/7.4/cli/conf.d/20-json.ini - /etc/php/7.4/cli/conf.d/20-phar.ini - /etc/php/7.4/cli/conf.d/20-posix.ini - /etc/php/7.4/cli/conf.d/20-readline.ini - /etc/php/7.4/cli/conf.d/20-shmop.ini - /etc/php/7.4/cli/conf.d/20-sockets.ini - /etc/php/7.4/cli/conf.d/20-sysvmsg.ini - /etc/php/7.4/cli/conf.d/20-sysvsem.ini - /etc/php/7.4/cli/conf.d/20-sysvshm.ini - /etc/php/7.4/cli/conf.d/20-tokenizer.ini You can also run `php --ini` in a terminal to see which files are used by PHP in CLI mode. Alternatively, you can run Composer with `--ignore-platform-req=ext-dom --ignore-platform-req=ext-intl --ignore-platform-req=ext-sqlite3 --ignore-platform-req=ext-zip --ignore-platform-req=ext-xml` to temporarily ignore these required extensions.
I need to install these 5 extensions? Is that really the solution? Shouldn’t they be automatically installed?
Thank you, Julius
On Tue 20. Sep 2022 at 17:41, Julius Hamilton juliushamilton100@gmail.com wrote:
Thank you very much.
Did you look at the wikitext of that page?
I did now, I see that the text displayed is not actually present in the wikitext / source text. I am seeing these ".djvu include" lines:
<pages index="A simplified grammar of the Swedish language.djvu" include=7 />
What is this? Is it a common format for a Wikisource book?
prop=extracts works, but I would say it's a poor fit for many (most?)
wikisource pages.
Why? Because it just pulls out sentences from the wikitext? What is different about the functioning of prop=revisions, for example?
Plaintext as in wikitext or in parsed html converted to plaintext?
Whatever you think is preferable, the point is to have some clean, readable text. If the parsed HTML has any awkward formatting issues, I might prefer the wikitext, or vice versa. Whichever is easier to work with. Technically since wikitext is a markup format it might be easier to pull out from specific fields you are seeking? I don't know.
You could use something like this to fetch every page
Thanks. I tried replacing the title with a different, more normal book and it didn't seem to work.
https://en.wikisource.org/w/api.php?generator=allpages&action=query&...
I guess it's the same problem, "revisions" also pulls out wikitext but Wikisource wikitext pulls in its text from separate files?
So would the "parse" action of the API be the tool of choice?
the WS Export tool can do that
Thanks very much, will give that a shot next.
Thank you,
Julius
On Tue, Sep 20, 2022 at 2:14 AM Sam Wilson sam@samwilson.id.au wrote:
How can I get the full plaintext from an entire book on Wikisource with
the API?
Plaintext as in wikitext or in parsed html converted to plaintext?
If it's the latter, the WS Export tool can do that: https://ws-export.wmcloud.org/?format=txt
Mediawiki-api mailing list -- mediawiki-api@lists.wikimedia.org To unsubscribe send an email to mediawiki-api-leave@lists.wikimedia.org
I need to install these 5 extensions? Is that really the solution? Shouldn’t they be automatically installed?
Yes. They are likely to be already provided alongside PHP or maybe not activated. For example on Debian and its derivatives they are packages as php-dom, php-intl...
More conveniently you might also just use the public Wsexport instance: https://ws-export.wmcloud.org/ It's not suitable if you want to export tens of thousands of pages but for small workloads it should be fine.
Thomas
Le mar. 20 sept. 2022 à 18:43, Julius Hamilton juliushamilton100@gmail.com a écrit :
I am currently trying to install ws-export ( https://github.com/wikimedia/ws-export) and I’m having trouble with “compose”, would anyone know anything about this?
composer install --no-dev
Your lock file does not contain a compatible set of packages. Please run composer update.
composer update
Your requirements could not be resolved to an installable set of packages.
Problem 1
- Root composer.json requires PHP extension ext-dom * but it is missing from your system. Install or enable PHP's dom extension.
Problem 2 - Root composer.json requires PHP extension ext-intl * but it is missing from your system. Install or enable PHP's intl extension. Problem 3 - Root composer.json requires PHP extension ext-sqlite3 * but it is missing from your system. Install or enable PHP's sqlite3 extension. Problem 4 - Root composer.json requires PHP extension ext-zip * but it is missing from your system. Install or enable PHP's zip extension. Problem 5 - symfony/framework-bundle[v5.4.0, ..., v5.4.12] require ext-xml * -> it is missing from your system. Install or enable PHP's xml extension. - Root composer.json requires symfony/framework-bundle 5.4.* -> satisfiable by symfony/framework-bundle[v5.4.0, ..., v5.4.12].
To enable extensions, verify that they are enabled in your .ini files: - /etc/php/7.4/cli/php.ini - /etc/php/7.4/cli/conf.d/10-opcache.ini - /etc/php/7.4/cli/conf.d/10-pdo.ini - /etc/php/7.4/cli/conf.d/20-calendar.ini - /etc/php/7.4/cli/conf.d/20-ctype.ini - /etc/php/7.4/cli/conf.d/20-exif.ini - /etc/php/7.4/cli/conf.d/20-ffi.ini - /etc/php/7.4/cli/conf.d/20-fileinfo.ini - /etc/php/7.4/cli/conf.d/20-ftp.ini - /etc/php/7.4/cli/conf.d/20-gettext.ini - /etc/php/7.4/cli/conf.d/20-iconv.ini - /etc/php/7.4/cli/conf.d/20-json.ini - /etc/php/7.4/cli/conf.d/20-phar.ini - /etc/php/7.4/cli/conf.d/20-posix.ini - /etc/php/7.4/cli/conf.d/20-readline.ini - /etc/php/7.4/cli/conf.d/20-shmop.ini - /etc/php/7.4/cli/conf.d/20-sockets.ini - /etc/php/7.4/cli/conf.d/20-sysvmsg.ini - /etc/php/7.4/cli/conf.d/20-sysvsem.ini - /etc/php/7.4/cli/conf.d/20-sysvshm.ini - /etc/php/7.4/cli/conf.d/20-tokenizer.ini You can also run `php --ini` in a terminal to see which files are used by PHP in CLI mode. Alternatively, you can run Composer with `--ignore-platform-req=ext-dom --ignore-platform-req=ext-intl --ignore-platform-req=ext-sqlite3 --ignore-platform-req=ext-zip --ignore-platform-req=ext-xml` to temporarily ignore these required extensions.
I need to install these 5 extensions? Is that really the solution? Shouldn’t they be automatically installed?
Thank you, Julius
On Tue 20. Sep 2022 at 17:41, Julius Hamilton juliushamilton100@gmail.com wrote:
Thank you very much.
Did you look at the wikitext of that page?
I did now, I see that the text displayed is not actually present in the wikitext / source text. I am seeing these ".djvu include" lines:
<pages index="A simplified grammar of the Swedish language.djvu" include=7 />
What is this? Is it a common format for a Wikisource book?
prop=extracts works, but I would say it's a poor fit for many (most?) wikisource pages.
Why? Because it just pulls out sentences from the wikitext? What is different about the functioning of prop=revisions, for example?
Plaintext as in wikitext or in parsed html converted to plaintext?
Whatever you think is preferable, the point is to have some clean, readable text. If the parsed HTML has any awkward formatting issues, I might prefer the wikitext, or vice versa. Whichever is easier to work with. Technically since wikitext is a markup format it might be easier to pull out from specific fields you are seeking? I don't know.
You could use something like this to fetch every page
Thanks. I tried replacing the title with a different, more normal book and it didn't seem to work.
https://en.wikisource.org/w/api.php?generator=allpages&action=query&...
I guess it's the same problem, "revisions" also pulls out wikitext but Wikisource wikitext pulls in its text from separate files?
So would the "parse" action of the API be the tool of choice?
the WS Export tool can do that
Thanks very much, will give that a shot next.
Thank you,
Julius
On Tue, Sep 20, 2022 at 2:14 AM Sam Wilson sam@samwilson.id.au wrote:
How can I get the full plaintext from an entire book on Wikisource with the API?
Plaintext as in wikitext or in parsed html converted to plaintext?
If it's the latter, the WS Export tool can do that: https://ws-export.wmcloud.org/?format=txt
Mediawiki-api mailing list -- mediawiki-api@lists.wikimedia.org To unsubscribe send an email to mediawiki-api-leave@lists.wikimedia.org
Mediawiki-api mailing list -- mediawiki-api@lists.wikimedia.org To unsubscribe send an email to mediawiki-api-leave@lists.wikimedia.org
mediawiki-api@lists.wikimedia.org