Marek Czuma <marek.czuma(a)contractors.roche.com>
3:36 PM (7 minutes ago)
SAVE AS RECORD
to mediawiki-api
Good morning!
I'm despaired, cause I have some problem with wikimedia api and I really
can't find answer.
I am programmer and I try to deal with allPages endpoint.
I try to fetch 500 pages, take apcontinue and once again fetch more 500
pages (from apfrom point).
Everything is ok until the moment I want to fetch something like
Somenamespace:Page. I can't send request with colon inside a request.
Response;
"error": {
"code": "invalidtitle",
"info": "Bad title \"Somenamespace:Page\".",
"*": "See http://syswiki.gene.com/syswiki/api.php for API usage.
Subscribe to the mediawiki-api-announce mailing list at <
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce> for
notice of API deprecations and breaking changes."
}
Could you help me? I must fetch this page, but I don't know how to do it
properly.
Currently the codes for uncaught exceptions include the class name, for
example "internal_api_error_ReadOnlyError", or
"internal_api_error_DBQueryError", or possibly something like
"internal_api_error_MediaWiki\Namespace\FooBarException". As you can see in
that last example, that can get rather ugly and complicates recent attempts
to verify that all error codes use a restricted character set.
Thus, we are deprecating these error codes. In the future all such errors
will use the code "internal_api_error". The date for that change has not
yet been set.
If a client for some reason needs to see the class of the uncaught
exception, this is available in a new 'errorclass' data property in the API
error. This will be returned beginning in 1.33.0-wmf.8 or later, see
https://www.mediawiki.org/wiki/MediaWiki_1.33/Roadmap for a schedule. Note
that database errors will report the actual class, such as
"MediaWiki\rdbms\DBQueryError", rather than the old unprefixed name that
had been being maintained for backwards compatibility.
Clients relying on specific internal error codes or detecting internal
errors by looking for a "internal_api_error_" prefix should be updated to
recognize "internal_api_error" and to use 'errorclass' in preference to
using any class name that might be present in the error code.
In JSON format with errorformat=bc, an internal error might look something
like this:
{
"error": {
"code": "internal_api_error_InvalidArgumentException",
"info": "[61e9f71eedbe401f17d41dd2] Exception caught: Testing",
"errorclass": "InvalidArgumentException",
"trace": "InvalidArgumentException at ..."
},
"servedby": "hostname"
}
With modern errorformats, it might look like this:
{
"errors": [
{
"code": "internal_api_error_InvalidArgumentException",
"text": "[61e9f71eedbe401f17d41dd2] Exception caught: Testing",
"data": {
"errorclass": "InvalidArgumentException"
}
}
],
"trace": "InvalidArgumentException at ...",
"servedby": "hostname"
}
--
Brad Jorsch (Anomie)
Senior Software Engineer
Wikimedia Foundation
_______________________________________________
Mediawiki-api-announce mailing list
Mediawiki-api-announce(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce
FYI
---------- Forwarded message ----------
From: Subramanya Sastry <ssastry(a)wikimedia.org>
Date: 14 November 2018 at 21:48
Subject: [Wikitech-l] Content Negotiation Protocol for Parsoid HTML in the
REST API
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
Hello everyone,
The Core Platform and Parsing teams at the Wikimedia Foundation are glad
to announce the implementation of a content negotiation protocol for
Parsoid HTML in the REST API [1]. This was deployed to the Wikimedia
cluster on October 1, 2018.
TL;DR
-----
Parsoid HTML clients can now use the Accept header to specify which
version of content they expect when requesting Parsoid HTML from the
REST API. If omitted, as before, they will get whatever version of the
HTML is in storage, regardless of any breaking changes it may contain.
Parsoid's HTML is versioned
---------------------------
An advantage of Parsoid’s HTML output is that it is both specced and
versioned [2]. By adhering to the principles of semantic versioning [3],
Parsoid can signal to clients what kinds of changes can be expected
in the output between versions.
However, until recently, Parsoid always returned the latest version
of its HTML. Naturally, this posed challenges when deploying breaking
changes since clients had to be prepared to consume the newer version.
Rolling out new HTML versions without breaking clients
------------------------------------------------------
Throughout its history, Parsoid developers have had close enough contact
with the developers of Parsoid clients (they are internal to the
Wikimedia Foundation for the most part) to coordinate deployment
of breaking changes to the HTML. This mainly involved ensuring all
known clients were forward and backwards compatible with the newer
HTML version before deploying the change. Needless to say, as more
clients were coming along, this informal process would not suffice;
a scalable and predictable version upgrade solution was needed.
Content Negotiation Protocol
----------------------------
To solve this problem, a content negotiation protocol [4] relying on
HTTP Accept headers was implemented. See RESTBase’s documentation [5]
for the exact details of the protocol. What follows is just an
informal description.
Parsoid clients are expected to pass an Accept header that specifies
the HTML version they can handle. If the version present in storage
does not satisfy the request, RESTBase will attempt to resolve the
inconsistency. However, if the requested version cannot be satisfied,
an (HTTP 406) error will be returned. The meaning of “satisfied” here
mostly follows semver’s caret semantics [6] (the main difference being
that the patch level is ignored).
If a client does not pass the Accept header, everything works exactly
like before, with all the downsides of the previous behaviour:
no protection from breaking changes; you get whatever HTML version
is currently in storage.
Caveat emptors
--------------
The deployed Parsoid version generates HTML versions 1.8.0 [7] and
2.0.0 [8]. But, it is worth mentioning that the oldest acceptable
version supported is 1.6.0, so if you’re sending an Accept header with
a version less than 1.6.0, your application will break. The reason
for this odd constraint is that we mistakenly released that version
without bumping the major version [9] even though it introduced a
breaking change. Mea culpa!
Also, RESTBase only stores the latest version so, as content gets
rerendered and storage gets replaced, clients requesting older content
have to pay a latency penalty while the stored content is downgraded
to an appropriate version. Hence, we encourage Parsoid HTML clients to
pay attention to announcements about major version changes and upgrade
promptly. Going forward, we’ll send announcements about Parsoid HTML
versions changes on the mediawiki-api-announce mailing list.
How does this impact 3rd party wikis?
-------------------------------------
Finally, astute readers will have noted that this announcement is
concerning the REST API. However, many 3rd party installs have VE
communicating directly with Parsoid and may be wondering how they’ll
be impacted by the change.
Parsoid has had a similar protocol (the difference is mainly in
respecting the patch level) implemented since the v0.9.0 release [7].
So, going forward, when upgrading Parsoid or VE, if the HTML version
requested by VE can be provided by Parsoid, the upgrade will be safe.
In Conclusion
-------------
Content negotiation now allows us to deploy new Parsoid features to the
Wikimedia cluster without needing prior coordination with all clients.
Clients can continue to request older versions until they are ready to
update (assuming they don’t fall too far behind since we only plan on
supporting two major versions concurrently). And, conversely, they can
request newer versions with the guarantee that they will not receive
incompatible content.
[1]: https://phabricator.wikimedia.org/T128040
[2]: https://www.mediawiki.org/wiki/Specs/HTML
[3]: https://semver.org/
[4]: https://tools.ietf.org/html/rfc7231#section-5.3
[5]: https://www.mediawiki.org/wiki/API_versioning#Content_format
_stability_and_negotiation
[6]: https://www.npmjs.com/package/semver#caret-ranges-123-025-004
[7]: https://www.mediawiki.org/wiki/Specs/HTML/1.8.0
[8]: https://www.mediawiki.org/wiki/Specs/HTML/2.0.0
[9]: https://lists.wikimedia.org/pipermail/mediawiki-l/2018-March
/047337.html
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hi,
I'm trying to create a dataset of summaries vs full text bodies for
automatic text summarization models.
I was looking at the online api for retrieving the summary of a page, so I
could recreate it in my Spark code for parsing wiki dumps. Specifically, I
was looking at the regex in:
https://phabricator.wikimedia.org/diffusion/ETEX/browse/master/includes/Api…
$regexp = '/^(.*?)(?=' . ExtractFormatter::SECTION_MARKER_START . ')/s';
With section marker start filled in:
$regexp = '/^(.*?)(?=' . \1\2 . ')/s';
However, when I plug that expression into an online tester (regex101.com),
I see that: \2 This token references a non-existent or invalid subpattern
I am wondering if this is a bug or if I'm placing it incorrectly?
The alternative branch is when plaintext is set to false - that's for
parsing HTML correct / not applicable for the xml in wiki dumps?
Thanks for your help,
Dan Kramer
Now that MediaWiki has a pure-PHP tidying implementation, we are
deprecating non-tidy output.[1] Further, the future rewrite of Parsoid in
PHP[2] and its merge to core will have "tidying" as an integral feature.
Thus, the disabletidy parameter to action=parse is being deprecated and
will be removed at some point in the future. Clients should stop using the
parameter and begin using tidied HTML output.
This change should be deployed to Wikimedia wikis with 1.32.0-wmf.24 or
later, see https://www.mediawiki.org/wiki/MediaWiki_1.32/Roadmap for a
schedule.
[1]: https://phabricator.wikimedia.org/T198214
[2]: https://phabricator.wikimedia.org/tag/parsoid-php/
--
Brad Jorsch (Anomie)
Senior Software Engineer
Wikimedia Foundation
_______________________________________________
Mediawiki-api-announce mailing list
Mediawiki-api-announce(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce
Hi,
I am trying to adapt a program that searches and fetches pages from the
main US Wikipedia to do the same against a private MediaWiki installation.
The program is written using the wikipedia python library
https://github.com/goldsmith/Wikipedia.
Unfortunately, I cannot figure out how to get the library to "point" at
other Mediawikis than the Wikipedia ones.
Can anyone either
a) show me how to change the target endpoint with this library or
b) point me to another Python library where I can do the same
and
c) provide me with an URL to a publicly accessible non-Wikipedia Mediawiki
site with API that I can use to test the results of a/b?
Any help will be much appreciated.
Fred
--
Fred Zimmerman
Hi there!
I'm trying to generate 10 random articles in Hebrew using this request:
https://he.wikipedia.org/w/api.php?format=xml&action=query&generator=random…
As you can see, I'd like to get their intro, but some of the articles'
intro is empty in the API response, even though it's not empty in Wikipedia
itself.
I have tried to make the same request in English, and I the problem didn't
happen.
This is the first time I encounter this weird problem after making this
request in my App for over a year... Am I missing something? Is there any
new information that I should add to the request or it's just a bug?
Thanks!
In preparation for multi-content revisions (MCR), we've made[1] several
changes to action=compare. These changes should be deployed to Wikimedia
wikis with 1.32.0-wmf.19 or later. The changes should also be available on
the Beta Cluster[2] soon for testing.
*== Supplying content using templated parameters ==*
For MCR, when specifying content (as with the `fromtext` and `totext`
parameters) we need the ability to specify content for each "slot" in the
page. The way this works for action=compare is that (1) the base revision
is determined using the parameters that identify the page and/or revision
(`fromtitle`/`totitle`, `fromrev`/`torev`, and so on), then (2) the new
`fromslots`/`toslots` parameter specifies which slots are being changed,
and then (3) new parameters for each value of `fromslots`/`toslots` specify
the content for each of those slots.
In the API help, these new parameters for each value of
`fromslots`/`toslots` are described as "templated parameters" and have a
placeholder in their names. Where the help describes "totext-{slot}", it's
meaning that if you supply "toslots=foo|bar" then there would be
corresponding parameters "totext-foo" and "totext-bar" to supply the text
for those two slots.
In Special:ApiSandbox, input fields for "totext-foo" and "totext-bar" will
appear when you enter those value for "toslots".
In the future templated parameters will be introduced for action=edit and
action=parse as well, and other modules as the need arises.
*== Deprecations and changes in action=compare ==*
The following parameters are deprecated, with replacements as indicated.
- `fromtext` is replaced with `fromtext-main` with `fromslots=main`.
- `fromcontentmodel` is replaced with `fromcontentmodel-main` with
`fromslots=main`.
- .`fromcontentformat` is replaced with `fromcontentformat-main` with
`fromslots=main`.
- `totext` is replaced with `totext-main` with `toslots=main`.
- `tocontentmodel` is replaced with `tocontentmodel-main` with
`toslots=main`.
- .`tocontentformat` is replaced with `tocontentformat-main` with
`toslots=main`.
The `fromsection` and `tosection` parameters are also deprecated with no
direct replacement. The intended use case for these parameters was to
simulate a diff of a section edit, by supplying the edited section's text
as `totext` and supplying `fromsection` to extract just the section being
edited from the current revision. This use case is now supported by
specifying `totext-main` as the edited section's text and supplying
`tosection-main` to identify the section being edited, which will be
combined into the existing content as for a section edit. This will result
in a diff more closely matching that returned for a section edit from the
web UI with respect to line numbers and context lines.
By default action=compare will return one HTML blob combining the diffs of
all slots, much as is shown in the web UI. The new `slots` parameter may be
used to get separate HTML blobs for each slot's diff and to limit which
slots' diffs are returned..
*== Other notes ==*
Note that the already-deprecated[3] diffing parameters to revision-related
modules, such as the rvdifftotext parameter to action=query&prop=revisions,
will not be updated for MCR. Code using these parameters should be updated
to use action=compare instead.
[1]: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/448160
[2]: e.g. https://en.wikipedia.beta.wmflabs.org/w/api.php?modules=compare
[3]:
https://lists.wikimedia.org/pipermail/mediawiki-api-announce/2017-June/0001…
--
Brad Jorsch (Anomie)
Senior Software Engineer
Wikimedia Foundation
_______________________________________________
Mediawiki-api-announce mailing list
Mediawiki-api-announce(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce
Thanks, Brad. Yes, I was being lazy since the token is always the same.
Matthew
|| Matthew Cahn | Linux Administrator | Dept. of Molecular Biology / Research Computing | Princeton University | (609) 258-5404 | mcahn(a)princeton.edu<mailto:mcahn@princeton.edu> ||
Oh, of course, POST, thanks. Now it works, after also removing the “r” (raw string) from the token since it already has an escaped backslash, and removing urllib.parse.urlencode from the parameters. Here’s the working version in case anyone would like to see it:
#!/bin/env python
import requests
baseUrl = 'http://chlamyannotations-test2.princeton.edu/api.php'
params = {'action': 'query',
'meta': 'tokens'}
responseFilename = '/molbio2/mcahn/temp/createPagesResponse.html'
r = requests.get(baseUrl, params=params)
print(r)
print(r.text)
params = {'action': 'edit',
'title': 'TestPage3',
'summary': 'Test summary',
'text': 'article content',
'token': '+\\'}
f = open(responseFilename, 'w')
r = requests.post(baseUrl, data=params)
print(r)
f.write(r.text)
f.close()
|| Matthew Cahn | Linux Administrator | Dept. of Molecular Biology / Research Computing | Princeton University | (609) 258-5404 | mcahn(a)princeton.edu<mailto:mcahn@princeton.edu> ||