Hi all,
We are a group of undergraduates working on a project using the MediaWiki API. While working on this project, we ran into a unique issue involving pageviews. When trying to pull pageview data for a particular page, the redirects of a page would not be counted along with the original pageviews. For example, the Hong Kong protests page only has direct views, and not views from previous titles.
We attempted to use the wmflabs.org tool, but it only shows data from a certain date. (Example link: https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=a... https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&start=2019-07-01&end=2020-01-25&pages=2019%E2%80%9320_Hong_Kong_protests%7CChina )
Then we attempted to use the redirects of a page and using the old page ids to grab the pageview data, but there was no data returned. When we attempted to grab data for a page that we knew would have a long past, but the parameter of "pvipcontinue" did not appear ( https://www.mediawiki.org/w/api.php?action=help&modules=query%2Bpageview...). (Example: https://www.mediawiki.org/wiki/Special:ApiSandbox#action=query&format=js... )
In the end, we are trying to get an accurate count of view for a certain page no matter the source.
Any guidance or assistance is greatly appreciated.
Thanks, Jackie, James, Junyi, Kirby
Hi James,
I was aware of the first issue, but this is the first time that I can recall hearing about the second. See https://phabricator.wikimedia.org/T121912. You may want to ask your second question in that thread if no one responds to it here.
Good luck,
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Mon, Feb 24, 2020 at 12:17 AM James Gardner via Wikitech-l wikitech-l@lists.wikimedia.org wrote:
Hi all,
We are a group of undergraduates working on a project using the MediaWiki API. While working on this project, we ran into a unique issue involving pageviews. When trying to pull pageview data for a particular page, the redirects of a page would not be counted along with the original pageviews. For example, the Hong Kong protests page only has direct views, and not views from previous titles.
We attempted to use the wmflabs.org tool, but it only shows data from a certain date. (Example link: https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=a... https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&start=2019-07-01&end=2020-01-25&pages=2019%E2%80%9320_Hong_Kong_protests%7CChina )
Then we attempted to use the redirects of a page and using the old page ids to grab the pageview data, but there was no data returned. When we attempted to grab data for a page that we knew would have a long past, but the parameter of "pvipcontinue" did not appear ( https://www.mediawiki.org/w/api.php?action=help&modules=query%2Bpageview...). (Example: https://www.mediawiki.org/wiki/Special:ApiSandbox#action=query&format=js... )
In the end, we are trying to get an accurate count of view for a certain page no matter the source.
Any guidance or assistance is greatly appreciated.
Thanks, Jackie, James, Junyi, Kirby _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hi,
When I tested the api it seemed to work with redirects (e.g. https://mediawiki.org/w/api.php?action=query&format=json&prop=pagevi... Where Main_Page redirects to the page MediaWiki )
Then we attempted to use the redirects of a page and using the old page
ids to grab the pageview data
Just to be clear, when a page is moved, it keeps its page_id. So redirects may have historically had the page_id that the target page has now.
If all else fails, you can look at the big dataset files at https://dumps.wikimedia.org/other/analytics/ . They should be available (in some form or another) going back to 2007, and I believe they are the source of the data that the api and all other tools return.
-- Brian
On Mon, Feb 24, 2020 at 12:17 AM James Gardner via Wikitech-l < wikitech-l@lists.wikimedia.org> wrote:
Hi all,
We are a group of undergraduates working on a project using the MediaWiki API. While working on this project, we ran into a unique issue involving pageviews. When trying to pull pageview data for a particular page, the redirects of a page would not be counted along with the original pageviews. For example, the Hong Kong protests page only has direct views, and not views from previous titles.
We attempted to use the wmflabs.org tool, but it only shows data from a certain date. (Example link:
https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=a... https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&start=2019-07-01&end=2020-01-25&pages=2019%E2%80%9320_Hong_Kong_protests%7CChina < https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=a...
)
Then we attempted to use the redirects of a page and using the old page ids to grab the pageview data, but there was no data returned. When we attempted to grab data for a page that we knew would have a long past, but the parameter of "pvipcontinue" did not appear ( https://www.mediawiki.org/w/api.php?action=help&modules=query%2Bpageview... ). (Example:
https://www.mediawiki.org/wiki/Special:ApiSandbox#action=query&format=js... )
In the end, we are trying to get an accurate count of view for a certain page no matter the source.
Any guidance or assistance is greatly appreciated.
Thanks, Jackie, James, Junyi, Kirby _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
As an aside, this may be a case where generators in the api are useful - e.g. https://en.wikipedia.org/w/api.php?action=query&generator=redirects&... (Note: does not include the actual non-redirect article in the results, and you have to pay close attention to the continue parameters) https://en.wikipedia.org/w/api.php?action=query&generator=redirects&...
On Mon, Feb 24, 2020 at 4:28 AM bawolff bawolff+wn@gmail.com wrote:
Hi,
When I tested the api it seemed to work with redirects (e.g. https://mediawiki.org/w/api.php?action=query&format=json&prop=pagevi... Where Main_Page redirects to the page MediaWiki )
Then we attempted to use the redirects of a page and using the old page
ids to grab the pageview data
Just to be clear, when a page is moved, it keeps its page_id. So redirects may have historically had the page_id that the target page has now.
If all else fails, you can look at the big dataset files at https://dumps.wikimedia.org/other/analytics/ . They should be available (in some form or another) going back to 2007, and I believe they are the source of the data that the api and all other tools return.
-- Brian
On Mon, Feb 24, 2020 at 12:17 AM James Gardner via Wikitech-l < wikitech-l@lists.wikimedia.org> wrote:
Hi all,
We are a group of undergraduates working on a project using the MediaWiki API. While working on this project, we ran into a unique issue involving pageviews. When trying to pull pageview data for a particular page, the redirects of a page would not be counted along with the original pageviews. For example, the Hong Kong protests page only has direct views, and not views from previous titles.
We attempted to use the wmflabs.org tool, but it only shows data from a certain date. (Example link:
https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=a... https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&start=2019-07-01&end=2020-01-25&pages=2019%E2%80%9320_Hong_Kong_protests%7CChina < https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=a...
)
Then we attempted to use the redirects of a page and using the old page ids to grab the pageview data, but there was no data returned. When we attempted to grab data for a page that we knew would have a long past, but the parameter of "pvipcontinue" did not appear ( https://www.mediawiki.org/w/api.php?action=help&modules=query%2Bpageview... ). (Example:
https://www.mediawiki.org/wiki/Special:ApiSandbox#action=query&format=js... )
In the end, we are trying to get an accurate count of view for a certain page no matter the source.
Any guidance or assistance is greatly appreciated.
Thanks, Jackie, James, Junyi, Kirby _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Sun, Feb 23, 2020 at 4:17 PM James Gardner via Wikitech-l < wikitech-l@lists.wikimedia.org> wrote:
We attempted to use the wmflabs.org tool, but it only shows data from a certain date. (Example link:
https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=a... https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&start=2019-07-01&end=2020-01-25&pages=2019%E2%80%9320_Hong_Kong_protests%7CChina < https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=a...
)
There's a redirectview tool (see the "redirects" links at the bottom of the page you linked) but it can't be filtered by date so it probably can't help you.
Then we attempted to use the redirects of a page and using the old page ids to grab the pageview data, but there was no data returned. When we attempted to grab data for a page that we knew would have a long past, but the parameter of "pvipcontinue" did not appear ( https://www.mediawiki.org/w/api.php?action=help&modules=query%2Bpageview... ). (Example:
https://www.mediawiki.org/wiki/Special:ApiSandbox#action=query&format=js... )
That API displays a limited set of metrics and is focused on caching and being backend-agnostic. There is no way to get old data, pvicontinue is for fetching data about more pages. If you need something more specific, you should use the Analytics Query Service (which the other APIs rely on) directly: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews
I think you'll have to piece the data together using the MediaWiki redirects API and AQS.
Hi all,
Thanks for all the help and advice with this issue, especially with the wmflabs tool with the redirect view tool. We'll try using that tool to download the pageview data we need and manually filter by dates to map redirects to the page. We'll also look into the REST API that Wiki has to see if it can help us as well.
Thanks again,
Jackie, James, Junyi, Kirby
On Sun, Feb 23, 2020 at 10:58 PM Gergo Tisza gtisza@wikimedia.org wrote:
On Sun, Feb 23, 2020 at 4:17 PM James Gardner via Wikitech-l < wikitech-l@lists.wikimedia.org> wrote:
We attempted to use the wmflabs.org tool, but it only shows data from a certain date. (Example link:
https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=a... https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&start=2019-07-01&end=2020-01-25&pages=2019%E2%80%9320_Hong_Kong_protests%7CChina < https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=a...
)
There's a redirectview tool (see the "redirects" links at the bottom of the page you linked) but it can't be filtered by date so it probably can't help you.
Then we attempted to use the redirects of a page and using the old page ids to grab the pageview data, but there was no data returned. When we attempted to grab data for a page that we knew would have a long past, but the parameter of "pvipcontinue" did not appear ( https://www.mediawiki.org/w/api.php?action=help&modules=query%2Bpageview... ). (Example:
https://www.mediawiki.org/wiki/Special:ApiSandbox#action=query&format=js... )
That API displays a limited set of metrics and is focused on caching and being backend-agnostic. There is no way to get old data, pvicontinue is for fetching data about more pages. If you need something more specific, you should use the Analytics Query Service (which the other APIs rely on) directly: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews
I think you'll have to piece the data together using the MediaWiki redirects API and AQS.
We attempted to use the wmflabs.org tool, but it only shows data from a
certain date
I'm assuming you want relative dates, not exact dates? You can do this by using the range=latest-N URL parameter (where N is the number of days). See https://tools.wmflabs.org/pageviews/url_structure/ and < https://tools.wmflabs.org/redirectviews/url_structure/%3E for Redirect Views. This mirrors the pvipdays parameter of the action API.
I'm sorry there is no backend for these tools, so if you need automation you'll have to scrape it or re-implement it's logic yourself.
In the end, we are trying to get an accurate count of view for a certain
page no matter the source.
Keep in mind that redirects can change, and historically may have not been the "same" page. For instance, if I create the article Foo, and someone else creates Bar, and some months later Foo is redirected to Bar. To accurately get the views of just Bar, you'll need to somehow exclude the time when Foo was a different article. Page moves can also cause unexpected results (Foo is moved to Baz, Bar is moved to Foo, etc.). Finally, page IDs can change too, say if I delete Foo, then move Bar to Foo. There isn't a foolproof solution, it seems, but simply including redirects is usually enough to give you what you want.
~ MA
On Mon, Feb 24, 2020 at 9:18 AM James Gardner via Wikitech-l < wikitech-l@lists.wikimedia.org> wrote:
Hi all,
Thanks for all the help and advice with this issue, especially with the wmflabs tool with the redirect view tool. We'll try using that tool to download the pageview data we need and manually filter by dates to map redirects to the page. We'll also look into the REST API that Wiki has to see if it can help us as well.
Thanks again,
Jackie, James, Junyi, Kirby
On Sun, Feb 23, 2020 at 10:58 PM Gergo Tisza gtisza@wikimedia.org wrote:
On Sun, Feb 23, 2020 at 4:17 PM James Gardner via Wikitech-l < wikitech-l@lists.wikimedia.org> wrote:
We attempted to use the wmflabs.org tool, but it only shows data from a certain date. (Example link:
https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=a... https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&start=2019-07-01&end=2020-01-25&pages=2019%E2%80%9320_Hong_Kong_protests%7CChina
<
https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=a...
<
https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=a...
)
There's a redirectview tool (see the "redirects" links at the bottom of the page you linked) but it can't be filtered by date so it probably
can't
help you.
Then we attempted to use the redirects of a page and using the old page ids to grab the pageview data, but there was no data returned. When we attempted to grab data for a page that we knew would have a long past,
but
the parameter of "pvipcontinue" did not appear (
https://www.mediawiki.org/w/api.php?action=help&modules=query%2Bpageview...
). (Example:
https://www.mediawiki.org/wiki/Special:ApiSandbox#action=query&format=js...
)
That API displays a limited set of metrics and is focused on caching and being backend-agnostic. There is no way to get old data, pvicontinue is
for
fetching data about more pages. If you need something more specific, you should use the Analytics Query Service (which the other APIs rely on) directly: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews
I think you'll have to piece the data together using the MediaWiki redirects API and AQS.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Thanks for the clarification of how redirects work, and what we should keep in mind when trying to count pageviews. Do you know if there's a way to find the date(s) when a page is redirected using the API? We know we can get the 'old' page ids of redirected pages using the API, but we're not sure if using the creation date of these page ids would be accurate. Also, what's the difference between redirects and page moves if there is one?
We may stick to including redirects without trying to avoid overcounting as this appears to be a more complicated issue that we thought. We are working to collect pageviews within a specific time frame, so relative dates isn't quite what we're looking for.
Thanks again!
Jackie, James, Junyi, Kirby
On Mon, Feb 24, 2020 at 10:52 AM MusikAnimal musikanimal@gmail.com wrote:
We attempted to use the wmflabs.org tool, but it only shows data from a
certain date
I'm assuming you want relative dates, not exact dates? You can do this by using the range=latest-N URL parameter (where N is the number of days). See https://tools.wmflabs.org/pageviews/url_structure/ and < https://tools.wmflabs.org/redirectviews/url_structure/%3E for Redirect Views. This mirrors the pvipdays parameter of the action API.
I'm sorry there is no backend for these tools, so if you need automation you'll have to scrape it or re-implement it's logic yourself.
In the end, we are trying to get an accurate count of view for a certain
page no matter the source.
Keep in mind that redirects can change, and historically may have not been the "same" page. For instance, if I create the article Foo, and someone else creates Bar, and some months later Foo is redirected to Bar. To accurately get the views of just Bar, you'll need to somehow exclude the time when Foo was a different article. Page moves can also cause unexpected results (Foo is moved to Baz, Bar is moved to Foo, etc.). Finally, page IDs can change too, say if I delete Foo, then move Bar to Foo. There isn't a foolproof solution, it seems, but simply including redirects is usually enough to give you what you want.
~ MA
On Mon, Feb 24, 2020 at 9:18 AM James Gardner via Wikitech-l < wikitech-l@lists.wikimedia.org> wrote:
Hi all,
Thanks for all the help and advice with this issue, especially with the wmflabs tool with the redirect view tool. We'll try using that tool to download the pageview data we need and manually filter by dates to map redirects to the page. We'll also look into the REST API that Wiki has to see if it can help us as well.
Thanks again,
Jackie, James, Junyi, Kirby
On Sun, Feb 23, 2020 at 10:58 PM Gergo Tisza gtisza@wikimedia.org wrote:
On Sun, Feb 23, 2020 at 4:17 PM James Gardner via Wikitech-l < wikitech-l@lists.wikimedia.org> wrote:
We attempted to use the wmflabs.org tool, but it only shows data from
a
certain date. (Example link:
https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=a... https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&start=2019-07-01&end=2020-01-25&pages=2019%E2%80%9320_Hong_Kong_protests%7CChina
<
https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=a...
<
https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=a...
)
There's a redirectview tool (see the "redirects" links at the bottom of the page you linked) but it can't be filtered by date so it probably
can't
help you.
Then we attempted to use the redirects of a page and using the old page ids to grab the pageview data, but there was no data returned. When we attempted to grab data for a page that we knew would have a long past,
but
the parameter of "pvipcontinue" did not appear (
https://www.mediawiki.org/w/api.php?action=help&modules=query%2Bpageview...
). (Example:
https://www.mediawiki.org/wiki/Special:ApiSandbox#action=query&format=js...
)
That API displays a limited set of metrics and is focused on caching and being backend-agnostic. There is no way to get old data, pvicontinue is
for
fetching data about more pages. If you need something more specific, you should use the Analytics Query Service (which the other APIs rely on) directly: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews
I think you'll have to piece the data together using the MediaWiki redirects API and AQS.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Unfortunately there's no proper log of redirect changes (I recently filed < https://phabricator.wikimedia.org/T240065%3E for this). There are change tags https://www.mediawiki.org/wiki/Help:Tags that identify redirect changes -- "mw-new-redirect" and "mw-changed-redirect-target", specifically -- but I am not sure if this is easily searchable via the action API. Someone on this list might know.
Redirects can be created directly, say as an alternate name or misspelling of an article (i.e. "Barak Obama" redirects to "Barack Obama", it was never an article on its own). Usually when a page is moved a redirect is left behind at the old location ("20th Century Fox" was recently renamed to "20th Century Studios"), but sometimes the redirects are suppressed. So you could focus only on page moves, in which case you could query the page move log using the logevents API https://www.mediawiki.org/wiki/API:Logevents, specifically with letype=move. From that you could piece together the pageviews. You wouldn't be including traffic originating from redirects that were never the target article, but from my experience this is the minority, since Google and the like usually link to the target and not redirects.
There's also a task to make the Toolforge tool go by the page move log automatically https://phabricator.wikimedia.org/T141332, since it is such a common need. The same caveats exist though; say articles Foo and Bar have been moved back and forth a few times, you might need to check the move logs of both and not just Foo. It can be quite tricky!
Overall I would say including all redirects is probably your best bet. Allow me to clarify the Redirect Views tool does offer date filtering < https://tools.wmflabs.org/redirectviews/%3E just as the main Pageviews tool does. If you do need automation, you could write a script to the query the redirects API https://www.mediawiki.org/wiki/API:Redirects and then the REST API https://w.wiki/J8K, which is all that that tool does.
Hope this helps!
~ MA
On Mon, Feb 24, 2020 at 6:40 PM James Gardner gardnerj2@carleton.edu wrote:
Thanks for the clarification of how redirects work, and what we should keep in mind when trying to count pageviews. Do you know if there's a way to find the date(s) when a page is redirected using the API? We know we can get the 'old' page ids of redirected pages using the API, but we're not sure if using the creation date of these page ids would be accurate. Also, what's the difference between redirects and page moves if there is one?
We may stick to including redirects without trying to avoid overcounting as this appears to be a more complicated issue that we thought. We are working to collect pageviews within a specific time frame, so relative dates isn't quite what we're looking for.
Thanks again!
Jackie, James, Junyi, Kirby
On Mon, Feb 24, 2020 at 10:52 AM MusikAnimal musikanimal@gmail.com wrote:
We attempted to use the wmflabs.org tool, but it only shows data from
a certain date
I'm assuming you want relative dates, not exact dates? You can do this by using the range=latest-N URL parameter (where N is the number of days). See https://tools.wmflabs.org/pageviews/url_structure/ and < https://tools.wmflabs.org/redirectviews/url_structure/%3E for Redirect Views. This mirrors the pvipdays parameter of the action API.
I'm sorry there is no backend for these tools, so if you need automation you'll have to scrape it or re-implement it's logic yourself.
In the end, we are trying to get an accurate count of view for a
certain page no matter the source.
Keep in mind that redirects can change, and historically may have not been the "same" page. For instance, if I create the article Foo, and someone else creates Bar, and some months later Foo is redirected to Bar. To accurately get the views of just Bar, you'll need to somehow exclude the time when Foo was a different article. Page moves can also cause unexpected results (Foo is moved to Baz, Bar is moved to Foo, etc.). Finally, page IDs can change too, say if I delete Foo, then move Bar to Foo. There isn't a foolproof solution, it seems, but simply including redirects is usually enough to give you what you want.
~ MA
On Mon, Feb 24, 2020 at 9:18 AM James Gardner via Wikitech-l < wikitech-l@lists.wikimedia.org> wrote:
Hi all,
Thanks for all the help and advice with this issue, especially with the wmflabs tool with the redirect view tool. We'll try using that tool to download the pageview data we need and manually filter by dates to map redirects to the page. We'll also look into the REST API that Wiki has to see if it can help us as well.
Thanks again,
Jackie, James, Junyi, Kirby
On Sun, Feb 23, 2020 at 10:58 PM Gergo Tisza gtisza@wikimedia.org wrote:
On Sun, Feb 23, 2020 at 4:17 PM James Gardner via Wikitech-l < wikitech-l@lists.wikimedia.org> wrote:
We attempted to use the wmflabs.org tool, but it only shows data
from a
certain date. (Example link:
https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=a... https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&start=2019-07-01&end=2020-01-25&pages=2019%E2%80%9320_Hong_Kong_protests%7CChina
<
https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=a...
<
https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=a...
)
There's a redirectview tool (see the "redirects" links at the bottom of the page you linked) but it can't be filtered by date so it probably
can't
help you.
Then we attempted to use the redirects of a page and using the old
page
ids to grab the pageview data, but there was no data returned. When we attempted to grab data for a page that we knew would have a long
past, but
the parameter of "pvipcontinue" did not appear (
https://www.mediawiki.org/w/api.php?action=help&modules=query%2Bpageview...
). (Example:
https://www.mediawiki.org/wiki/Special:ApiSandbox#action=query&format=js...
)
That API displays a limited set of metrics and is focused on caching
and
being backend-agnostic. There is no way to get old data, pvicontinue
is for
fetching data about more pages. If you need something more specific,
you
should use the Analytics Query Service (which the other APIs rely on) directly: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews
I think you'll have to piece the data together using the MediaWiki redirects API and AQS.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Tue, Feb 25, 2020 at 1:27 AM MusikAnimal musikanimal@gmail.com wrote:
Unfortunately there's no proper log of redirect changes (I recently filed < https://phabricator.wikimedia.org/T240065%3E for this). There are change tags https://www.mediawiki.org/wiki/Help:Tags that identify redirect changes -- "mw-new-redirect" and "mw-changed-redirect-target", specifically -- but I am not sure if this is easily searchable via the action API. Someone on this list might know.
You can do https://en.wikipedia.org/w/api.php?action=query&titles=2019%E2%80%9320%2... or https://en.wikipedia.org/w/api.php?action=query&titles=2019%E2%80%9320%2... (You cannot do both in one query, you can only specify one tag at a time). Furthermore, it looks like given a revision id, you would have to determine where it redirects yourself, which is unfortunate. I suppose you could look at https://en.wikipedia.org/w/api.php?action=parse&oldid=941491141&form... (taking the oldid as the revid from the other query) and either try and parse the html, or just assume if there is only one main namespace link, that that is the right one.
Also keep in mind, really old revisions won't have those tags.
-- Brian
There are two hard problems here. One is historical page titles. You can get those from our new dataset (docs here: https://dumps.wikimedia.org/other/mediawiki_history/readme.html) by downloading the months you're interested in from https://dumps.wikimedia.org/other/mediawiki_history/2020-01/enwiki/, and looking at the history of the pages you're interested in [1]. As others have mentioned, page histories can sometimes be very complicated, do let us know if we didn't get it right for the pages you're interested in, we worked really hard at vetting the data but there may be exceptions left unaccounted for.
The second problem is historical redirects. Sadly, there is no historical information about redirect status in the databases, only whether or not the page is a redirect right now. To find historical information, we have to parse the wikitext itself, that's why the answers above are complicated. We are starting to do this but don't yet have the compute power.
To clarify something from above, the flow of data is like this:
0. Historical aggregate data from 2007-2015, kept for reference but uses a slightly different counter so not directly comparable 1. Webrequest log flowing in through Kafka --> pageviews found in the log --> aggregate data simplified and pushed to the dumps referenced by bawolff --> aggregate data loaded into the Pageview API (a part of AQS referenced by Gergo) --> mediawiki API queries this to respond to action API queries about pageviews --> wmflabs pageviews tool does some crazy sophisticated stuff on top of the API 2. Wikitext dumps --> processed and loaded into Hadoop --> [FUTURE] parsed for content like historical redirects and published as an API or set of dumps files
[1] As a quick intro, each line is an "event" in this wiki, that is performed on a particular "entity" in {page, user, revision}. The first three fields are wiki, entity, and event type, so in your case you'd be interested in looking for lines starting with enwiki--->page--->move ... <page id you care about>. Each line has the page id, title of the page as of today, and title of the page as of the timestamp on that line. So this way you can collect all titles for a particular page id or page title.
(if this is useful maybe I should put it on the Phab task about historical redirects)
On Mon, Feb 24, 2020 at 9:50 PM bawolff bawolff+wn@gmail.com wrote:
On Tue, Feb 25, 2020 at 1:27 AM MusikAnimal musikanimal@gmail.com wrote:
Unfortunately there's no proper log of redirect changes (I recently
filed <
https://phabricator.wikimedia.org/T240065%3E for this). There are change tags https://www.mediawiki.org/wiki/Help:Tags that identify redirect
changes
-- "mw-new-redirect" and "mw-changed-redirect-target", specifically --
but
I am not sure if this is easily searchable via the action API. Someone on this list might know.
You can do
https://en.wikipedia.org/w/api.php?action=query&titles=2019%E2%80%9320%2... https://en.wikipedia.org/w/api.php?action=query&titles=2019%E2%80%9320%20Wuhan%20coronavirus%20outbreak&prop=revisions&rvprop=timestamp%7Ctags%7Cids%7Ccontent&rvlimit=max&rvtag=mw-new-redirect&formatversion=2&rvslots=main or
https://en.wikipedia.org/w/api.php?action=query&titles=2019%E2%80%9320%2... https://en.wikipedia.org/w/api.php?action=query&titles=2019%E2%80%9320%20Wuhan%20coronavirus%20outbreak&prop=revisions&rvprop=timestamp%7Ctags%7Cids%7Ccontent&rvlimit=max&rvtag=mw-changed-redirect-target&formatversion=2&rvslots=main (You cannot do both in one query, you can only specify one tag at a time). Furthermore, it looks like given a revision id, you would have to determine where it redirects yourself, which is unfortunate. I suppose you could look at
https://en.wikipedia.org/w/api.php?action=parse&oldid=941491141&form... (taking the oldid as the revid from the other query) and either try and parse the html, or just assume if there is only one main namespace link, that that is the right one.
Also keep in mind, really old revisions won't have those tags.
-- Brian _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org