Dear Analytics Mailing List,
Recently while querying pageviews of various pages, I discovered that
the page whose title is a single hyphen character (i.e. with the title
"-", with URL <https://en.wikipedia.org/wiki/->, which redirects to
<https://en.wikipedia.org/wiki/Hyphen-minus>) receives an unusually high
number of pageviews under the Pageview API. Taking October 2015 as an
example, the page received 5.4 million pageviews during that month
according to the API:
<
https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedi…
>.
However, according the stats.grok.se (which was still operational in the
same month), the page received only 1209 pageviews:
<http://stats.grok.se/en/201510/->.
Looking at the tabulation of pageviews on Wikipedia Views, the increase
in pageviews for this page coincides with the change to the Pageview
API in July 2015:
<
http://wikipediaviews.org/displayviewsformultiplemonths.php?page=-&allmonth…
>.
As I understand, page titles must be URL-encoded before the query,
but the URL-encoding of "-" is itself.
I looked at the API documentation but did not see this behavior listed,
so I am wondering where these numbers are coming from.
Best regards,
Issa
We're starting to wrap up the calendar year, here's what we've accomplished
so far with Wikistats. We're really excited to have some data in our
production Hive database for people to play with. We worked really hard to
clean up and present an intuitive interface to all of mediawiki history.
The results are captured in the tables mentioned below, which we'll cover
more in an upcoming tech talk. Documentation for the project is here
<https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake>.
Our goals so far and progress breakdown:
1. [done] Build pipeline to process and analyze *pageview* data
2. [done] Load pageview data into an *API*
3. [ ] *Sanitize* pageview data with more dimensions for public
consumption
4. [ beta] Build pipeline to process and analyze *editing* data
5. [ beta] Load editing data into an *API*
6. [ ] *Sanitize* editing data for public consumption
7. [ ] *Design* UI to organize dashboards built around new data
8. [ ] Build enough *dashboards* to replace the main functionality
of stats.wikipedia.org
9. [ ] Officially Replace stats.wikipedia.org with *(maybe)
analytics.wikipedia.org
<http://analytics.wikipedia.org/>*
***. [ ] Bonus: *replace dumps generation* based on the new data
pipelines
4 & 5. Since our last update, we've finished the pipeline that imports
data from mediawiki databases, cleans it up as best as possible, reshapes
it in a analytics-friendly way, and makes it easily queryable. I'm marking
these goals as "beta" because we're still tweaking the algorithm for
performance and productionizing the jobs. This will be completed early
next quarter, but in the meantime we have data for people to play with
internally. Sadly we haven't sanitized it yet so we can't publish it. For
those with internal access:
* https://pivot.wikimedia.org/#edit-history-test is the full history across
all wikis. It's a bit hard to understand how to slice and dice, so we will
host a tech talk and present it at the January metrics meeting if we can.
* In hive, you can access this data in the wmf database, the tables are:
- wmf.mediawiki_history: denormalized full history with this schema
<https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Mediawiki_history>
- wmf.mediawiki_page_history: the sequence of states of each wiki page (
schema
<https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Mediawiki_page_hist…>
)
- wmf.mediawiki_user_history: the sequence of states of each user
account (schema
<https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Mediawiki_user_hist…>
)
6. Sanitizing has not moved forward, as we need DBA time and they've been
overloaded. We will attempt to restart this effort in Q3.
7. We have begun the design process, we'll share more about this as we go.
Our goals and planning for next quarter support us finishing 4, 5, 7, and
8, so basically putting a UI on top of the data pipeline we have in place,
and updating it weekly. We also hope to have good progress on 6, but that
depends on collaboration with the DBA team and is harder than we originally
imagined.
And remember, voice your opinions about important reports in the current
Wikistats here:
https://www.mediawiki.org/wiki/Analytics/Wikistats/DumpReports/Future_per_r…
(thank you so so much to the many people who already chimed in).
Hi all.
Firstly, apologies for eventual duplicates or posting the question in
the wrong mailing list.
Secondly, could anybody kindly explain to me if some Wikipedia pages
changed their IDs from the past ? Or if so point to me where this might
be documented ?
I have Wikipedia pages-articles XML dumps from the years 2006 and 2008
and when I was parsing those dumps I ran across some situations
such as the following one. In the dumps from 2006 and 2008 I found that
the South Africa page has the ID 68854, while in the most current
Wikipedia pages-articles XML dump (i.e. 2016) the same article has the
ID 17416221.
I am trying to match some Wiki pages by IDs across time, but the example
above is not helping.
Much appreciated in advance for any help.
--
Renato Stoffalette Joao
- PhD Student -
L3S Research Center / Leibniz Uni.
15th Floor, Room:1519
Appelstraße 9a
30167 Hannover, Germany
+49.511.762-17759