Hi all,
Last week I wrapped up a research project which investigates the 100 most
frequent article section headings in the English, French, German, Italian
and Spanish Wikipedias.
Below are the top 10 English section headings, along with the number of
English articles each heading appears in at least once, and the total
percentage of all English articles it appears in. For more information
(including a comparison with frequently used section titles in other
languages and a link to the full dataset consisting of all section headings
from all articles) and documentation, see the meta page
<https://meta.wikimedia.org/wiki/Research:Investigate_frequency_of_section_t…>
.
number_of_articles
section_title
article_percentage
1
4125018
References
78.19
2
2338348
External links
44.33
3
1134624
See also
21.51
4
533444
History
10.11
5
283206
Notes
5.37
6
176458
Career
3.34
7
152442
Biography
2.89
8
148218
Further reading
2.81
9
145087
Track listing
2.75
10
122415
Bibliography
2.32
Zareen Farooqui
Hi all,
the webrequest and pageview_hourly tables on Hive contain the very
useful user_agent_map field, which stores the following data extracted
from the raw user agent (still available as a separate field):
device_family, browser_family, browser_major, os_family, os_major,
os_minor and wmf_app_version. (The Analytics Engineering team has
built a dashboard that uses this data and last month published a
popular blog post about it.) I understand it is mainly based on the
ua-parser library (http://www.uaparser.org/ ) .
In contrast, the event capsule in our EventLogging tables only
contains the raw, unparsed user agent.
* Does anyone on this list have experience in parsing user agents in
EventLogging data for the purpose of detecting browser family, version
etc, and would like to share advice on how to do this most
efficiently? (In the past, I have written some expressions in MySQL to
extract the app version number for the Wikipedia apps. But it seems a
bit of a pain to do that for classifying browsers in general. One
option would be to export the data and use the Python version of
ua-parser, however doing it directly in MySQL would fit better into
existing workflows.)
* Assuming it is technically possible to add such a pre-parsed
user_agent_map field to the EventLogging tables, would other analysts
be interested in using it too?
This came up recently with the Reading web team, for the purpose of
investigating whether certain issues are caused by certain browsers
only. But I imagine it has arisen in other places as well.
--
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB
Hi all!
We are replacing the server that hosts stats.wikimedia.org,
analytics.wikimedia.org, datasets.wikimedia.org, and other various sites
this week. The new server is ready to go, so I’d like to do this tomorrow
December 13th around 15:00 UTC.
There should be no noticeable downtime for these sites. If you notice any
discrepancies like data not being where you expect, let us know.
Thanks!
-Andrew + Analytics
This will happen today (Friday December, 9 2016) in about one hour 13:30
UTC, and will take another hour to restart all servers, during which some
queries may be temporarily unavailable. No data will be lost for
eventlogging, because insertion will be temporarily stopped with the kind
help of Joseph.
I will report back again when the work has finished, and everything is back
to normal.
--
Jaime Crespo
<http://wikimedia.org>
Hi,
I would like to do a quick maintenance on the eventlogging mysql servers
for security and certificates upgrades. This means that queries done sent
there will not be temporarily available. I mentioned the need for
maintenance to several members of Analytics team some time ago, and would
like to coordinate the stop of the eventlogging schema inserts and the
queries sent there (I know not only analytics, but research, mobile and
more use those databases).
Only a mere rolling restart of the database servers is needed this time,
which should not cause more than 20 minutes of unavailability and will be
done on each server alternatively.
Probably this Friday is a good moment to do it, before we enter the
deployment freeze window announced by release engineering, but I would like
to confirm this with you and have someone from Analytics available, in case
something goes wrong.
--
Jaime Crespo
<http://wikimedia.org>
Alexander,
Please ask questions such us this one on our public list (cc-ed)
We do not have public data available when it comes to pages per country
other than what is available here: https://stats.wikimedia.
org/wikimedia/squids/SquidReportPageViewsPerCountryBreakdown.htm
You can however set up a collaboration with our research team to have
access to private data, but please be aware that our research team can
handle so many collaborations at any one time.
>This data can be made available without losing confidentiality by using
either only first IP-address numbers or by publishing only the country of
users, as well as >aggregating by the category.
No, not really. Countries like San Marino or Andorra are not much bigger
than cities thus anonymizing by country might leak a lot of information.
Specially if you include pageview titles.
Thanks,
Nuria
On Wed, Dec 7, 2016 at 7:04 AM, Alexander Ugarov <augarov(a)email.uark.edu>
wrote:
> Dear Ms. Ruiz,
>
> I am conducting the research project on the international determinants of
> education quality. In my view, Wikimedia statistics is the priceless
> resource of information on how much learning people do on private. The
> statistics you made available was very precious for me so far. I will
> greatly appreciate your help with getting the access to the data which is
> not readily available on Wikimedia Foundation website.
>
> I would like to access the data on Wikipedia pageviews by country,
> language and content area to measure the private learning in different
> countries. My previous empirical results suggest that Wikipedia pageviews
> are highly correlated with education quality. Unfortunately, the available
> data does not allow to separate the educational pageviews from the pure
> entertainment purposes (for example, celebrities biographies).
>
> I will appreciate if you answer two specific questions:
> 1) If is it potentially possible to extract the information on pageviews
> by country and subject from your publicly available data? I can program and
> extract the information as soon as it is there.
> 2) If this information can not be extracted from the publicly available
> dataset, then if is it possible to make it available for me or researchers
> in general? This data can be made available without losing confidentiality
> by using either only first IP-address numbers or by publishing only the
> country of users, as well as aggregating by the category.
>
> I am looking forward to hear from you on availability of this data. I am
> sure that many social scientists will also benefit from using such
> information (if you make it availlable) and will produce some
> policy-relevant research.
>
> Best regards,
> Alexander Ugarov,
> Sam M. Walton College of Business
> Department of Economics
> University of Arkansas.
>
Are you a user, or do you know anyone who is a user of the sql formatted
dumps of the private tables that are available via /mnt/data on
stat1002/1003? Given that this data is available to folks in other
formats, we'd like to stop producing these. [1]
If you can help to check in with the researchers who use those servers, I
would really appreciate it.
Thanks a lot,
Ariel
[1] https://phabricator.wikimedia.org/T152021
Hi Michael,
Please consider using The Analytics mailing list (in copy) for your
questions, we prefer that to direct messages :)
The new pageview dataset includes (almost) all WMF projects (there are some
corner cases), so wiktionnaries pageviews are normally in the dump files.
You'll find them when the project column value ends with ".d" (like en.d,
fr.d).
Cheers
Joseph
---------- Forwarded message ----------
From: Michael Douma <michael.douma(a)idea.org>
Date: Tue, Dec 6, 2016 at 8:03 PM
Subject: Wiktionary page view counts?
To: jallemandou(a)wikimedia.org
Hi Joseph,
I saw your posting about the Pagecount Datasets.
I'm working on a language-related project, a new dictionary/thesaurus,
using Wiktionary data. I need to rank the words (there are several ways do
to this).
Is human pageview data available for Wiktionary? I only see it for
Wikipedia here
https://dumps.wikimedia.org/other/pageviews/
I'd love to basically have a list of all ~5 million Wiktionary URLs, plus
their view count for a certain time period, e.g. all of 2016, etc.
Is this available? if so, can you direct me?
The type of data in pageviews-20161201-000000 would be fine, if it included
Wiktionary.
*en Furniture_Brands_International 1 0en
George_Coventry,_9th_Earl_of_Coventry 2 0en George_Palaiologos 1 0en
Leningrad_(song) 2 0en Olivet_Discourse 9 0*
Thanks,
Michael Douma
www.idea.org
--
*Joseph Allemandou*
Data Engineer @ Wikimedia Foundation
IRC: joal
Forwarding to Analytics, Research, and Wikimetrics in case this is of
interest to people who aren't subscribed to the Labs mailing list.
Pine
---------- Forwarded message ----------
From: Bryan Davis <bd808(a)wikimedia.org>
Date: Tue, Dec 6, 2016 at 9:28 AM
Subject: [Labs-l] Tell us about the SQL that you can't get to work
To: labs-l <labs-l(a)lists.wikimedia.org>
In early January there is going to be a Developer Summit in San
Francisco [0]. Chase and I are in charge of scheduling talks on the
topic "Building on Wikimedia services: APIs and Developer Resources".
One of the more interesting to me talks that has been proposed for
this is "Labsdbs for WMF tools and contributors: get more data,
faster" by Jamie Crespo [1].
I know that most of you won't be able to attend in person, but if we
can show that there is enough interest in this topic we can get the
talk scheduled in a main room and recorded so anyone can watch it
later.
An idea I just had for showing interest is to get Tool Labs
maintainers and other Labs users to describe questions that they have
tried and failed to answer using SQL queries. We can look at the kinds
of questions that come up and ask Jamie (and others) if there are some
general recommendations that can be made about how to improve
performance or understand how the bits and pieces of our data model
fit together.
To kick things off, here's an example I tried to help with over the
weekend. A Quarry user was adapting a query they had used before to
find non-redirect File namespace pages not paired with binary files on
Commons. The query they had come up with was:
SELECT DISTINCT page_title, img_name
FROM (
SELECT DISTINCT page_title
FROM page WHERE page_namespace = 6
AND page_is_redirect = 0
) AS page
LEFT JOIN (
SELECT DISTINCT img_name
FROM image
) AS image ON page_title=img_name
WHERE img_name IS NULL;
The performance of this is horrible for several reasons including the
excessive use of DISTINCT. The query was consistently killed by the 30
minute runtime limit. MaxSem and I both came up with about the same
optimization that eliminated the sub-queries and use of DISTINCT:
SELECT page_title, img_name
FROM page LEFT OUTER JOIN image ON page_title=img_name
WHERE page_namespace = 6
AND page_is_redirect = 0
AND img_name IS NULL;
This new query is not fast in any sense of the word, but it does
finish without timing out. There is still some debate about whether
the 906 rows it returned are correct or not [2].
[0]: https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit
[1]: https://phabricator.wikimedia.org/T149624
[2]: https://quarry.wmflabs.org/query/14501
Bryan
--
Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org>
[[m:User:BDavis_(WMF)]] Sr Software Engineer Boise, ID USA
irc: bd808 v:415.839.6885 x6855
_______________________________________________
Labs-l mailing list
Labs-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/labs-l