Hi all,
For all Hive users using stat1002/1004, you might have seen a deprecation
warning when you launch the hive client - that claims it's being replaced
with Beeline. The Beeline shell has always been available to use, but it
required supplying a database connection string every time, which was
pretty annoying. We now have a wrapper
<https://github.com/wikimedia/operations-puppet/blob/production/modules/role…>
script
setup to make this easier. The old Hive CLI will continue to exist, but we
encourage moving over to Beeline. You can use it by logging into the
stat1002/1004 boxes as usual, and launching `beeline`.
There is some documentation on this here:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline.
If you run into any issues using this interface, please ping us on the
Analytics list or #wikimedia-analytics or file a bug on Phabricator
<http://phabricator.wikimedia.org/tag/analytics>.
(If you are wondering stat1004 whaaat - there should be an announcement
coming up about it soon!)
Best,
--Madhu :)
We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia
http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770>
This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015.
This data can be used for various purposes:
• determining the most frequent links people click on for a given article
• determining the most common links people followed to an article
• determining how much of the total traffic to an article clicked on a link in that article
• generating a Markov chain over English Wikipedia
We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream>
Ellery and Dario
(including analytics@ public list)
Rafael:
As I think we have mention before please be so kind to e-mail analytics@
rather than individual people.
>Again, we are exclusively looking for the absolute number of Wikipedia
updates per year per county.
I think you mean "edits" to wikipedia. If so we are currently working on a
project that sets up to deliver an estimate (likely an interval) of edits
happening a country. we do not have an ETA for such a deliverable but the
major project that you can follow is this one:
https://phabricator.wikimedia.org/T130256
You can ping us again by the end of next quarter (April 2017) by which we
can probably give you more specific information.
Thanks,
Nuria
On Wed, Nov 2, 2016 at 12:53 PM, Rafael Escalona Reynoso <re32(a)cornell.edu>
wrote:
> Dear Dan,
>
> I hope you are doing fine and that you remember me. I am the lead
> researcher at The Global Innovation Index (GII). I contacted you last year
> searching for data on Wikipedia uploads per country. I believe that this
> request got assigned a task number at some point. Here is what I know:
>
>
>
>
>
> Can you please let me know of this request’s status?
>
>
>
> Also, if any legal issues seem to be obstructing the compilation of this
> data, can you please refer us to someone from your legal department to
> explore the possibility of tailoring a contract/confidentiality agreement
> between the GII and Wikimedia?
>
>
>
> Again, we are exclusively looking for the absolute number of Wikipedia
> updates per year per county. These used to be available via Wikimedia here:
>
> https://stats.wikimedia.org/wikimedia/squids/
>
>
>
> Hope to hear from you soon.
>
>
>
> Sincerely,
>
> Rafael Escalona Reynoso, PhD, MPA.
>
> Lead Researcher at The Global Innovation Index
>
> Samuel Curtis Johnson Graduate School of Management
>
> 207 Sage Hall
> Cornell University
> Ithaca, NY 14853-6201
>
> Phone: +1 (607) 262-0983
>
> Email: re32(a)cornell.edu <soumitra.dutta(a)cornell.edu>
>
> http://www.johnson.cornell.edu
>
> [image: cid:image001.png@01CB662F.A467E740]
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Mon, Apr 4, 2016 at 10:58 PM, Dan Andreescu <dandreescu(a)wikimedia.org>
> wrote:
>
> Yes, once this data is properly anonymized, it should continue to be
> released in the same shape. We just have to make sure it's properly safe
> first.
>
>
>
> *From: *Rafael Escalona Reynoso
>
> *Sent: *Monday, April 4, 2016 19:21
>
> *To: *Dan Andreescu
>
> *Cc: *Jordan Litner; sacha.wunschvincent(a)wipo.int
>
> *Subject: *RE: On Wikipedia edits archive per county.
>
>
>
> Dan,
>
> Thank you for the update. This is kind of what we were expecting. I have a
> follow-up question: Would the data be collected in the same fashion for
> subsequent years (2016, 2017, etc.)? Or will this be a single time
> exercise? Do let me know whenever you can.
>
>
>
> Best,
>
>
>
>
>
> Rafael Escalona Reynoso, PhD, MPA.
>
> Lead Researcher at The Global Innovation Index
>
> Samuel Curtis Johnson Graduate School of Management
>
> 207 Sage Hall
>
> Cornell University
>
> Ithaca, NY 14853-6201
>
>
>
> Phone 1: +1 (607) 262-0983
>
> Phone 2: +1 (607) 255-9245
>
> Email: re32(a)cornell.edu
>
> http://www.johnson.cornell.edu
>
>
>
> [image: cid:image001.png@01CB662F.A467E740]
>
>
>
> See www.globalinnovationindex.org
>
>
>
>
>
>
>
> *From:* Dan Andreescu [mailto:dandreescu@wikimedia.org]
> *Sent:* Monday, April 04, 2016 6:13 PM
>
> *To:* Rafael Escalona Reynoso
> *Cc:* Jordan Litner; sacha.wunschvincent(a)wipo.int
> *Subject:* Re: On Wikipedia edits archive per county.
>
>
>
> Hey Rafael,
>
>
>
> We haven't been able to prioritize this work yet. It's been moved here:
>
>
>
> https://phabricator.wikimedia.org/T131280
>
>
>
> It has two stakeholders but no resources to get it done due to privacy
> issues. So we won't be able to get the 2015 data cleaned up before your
> deadline. But we're meeting about it again tomorrow and we will still do
> it so you can have this data for either next year's report or an amendment.
>
>
>
> On Mon, Apr 4, 2016 at 11:49 AM, Rafael Escalona Reynoso <re32(a)cornell.edu>
> wrote:
>
> Dear Dan,
>
> Hope you are doing fine. Just is quick note to follow up on the Wikipedia
> data. When do you think this data will be available? We are about to close
> the model and would very much like to have 2015 data included. Let me know.
>
>
>
> Best,
>
>
>
> Rafael Escalona Reynoso, PhD, MPA.
>
> Lead Researcher at The Global Innovation Index
>
> Samuel Curtis Johnson Graduate School of Management
>
> 207 Sage Hall
>
> Cornell University
>
> Ithaca, NY 14853-6201
>
>
>
> Phone 1: +1 (607) 262-0983
>
> Phone 2: +1 (607) 255-9245
>
> Email: re32(a)cornell.edu
>
> http://www.johnson.cornell.edu
>
>
>
> [image: cid:image001.png@01CB662F.A467E740]
>
>
>
> See www.globalinnovationindex.org
>
>
>
>
>
>
>
> *From:* Dan Andreescu [mailto:dandreescu@wikimedia.org]
> *Sent:* Thursday, February 18, 2016 8:19 PM
>
>
> *To:* Rafael Escalona Reynoso
> *Cc:* Jordan Litner; sacha.wunschvincent(a)wipo.int
> *Subject:* Re: On Wikipedia edits archive per county.
>
>
>
> That's perfect, I added it to the request: https://phabricator.
> wikimedia.org/T127409
>
> On Thursday, February 18, 2016, Rafael Escalona Reynoso <re32(a)cornell.edu>
> wrote:
>
> Dan,
>
> These were my thoughts exactly. Let me then elaborate on the value of the
> report, the index and why we feel that Wikipedia data is essential.
>
>
>
> The report is co-published by Cornell University, INSEAD, and the World
> Intellectual Property Organization (WIPO, a specialized agency of the
> United Nations), with the collaboration of three Knowledge Partners: the
> Confederation of Indian Industry, du, and A.T. Kearney and IMP³rove –
> European Innovation Management Academy. Now in its ninth edition, the
> report has established itself as a premier reference among innovation
> metrics and as a tool to facilitate public-private dialogue and
> evidence-based policymaking.
>
>
>
> The Global Innovation Index (GII) is a ranking of 141 economies in terms
> of their innovation capabilities and results. A total of 79 metrics in the
> form of data-based indicators are at its core. These rich metrics can be
> used —on the level of the index, the sub-indices, or as individual
> variables—to monitor performance over time and to benchmark developments
> against their peers. These can also help study country profiles over time,
> and to identify their relative strengths and weaknesses from the rich and
> unique GII dataset.
>
>
>
> Each year the GII results are presented within the framework of a
> top-level international event:
>
> • 2013 Geneva, Switzerland at the Opening Session of
> the United Nations Economic and Social Council (ECOSOC) High-Level Segment,
> organized by WIPO;
>
> • 2014 Sydney, Australia in the context of the B20/G20
> preparations; and
>
> • 2015 London, United Kingdom before the Minister of
> Innovation and Industry.
>
>
>
> This year the launch is scheduled for the summer in Beijing, China
> preceding the preparations for the 2016 G20 summit.
>
>
>
> Recognizing the need for a broad horizontal vision of innovation
> applicable to developed and emerging economies alike, the GII includes
> indicators that go beyond the traditional measures such as expenditure in
> research and development. That said, an area that is of great relevance and
> limited to the GII is that of creative outputs. Within it, *Wikipedia
> monthly page edits (per million population 15-69 y/o)* is a key metric.
> This indicator, along with others that measure the number of generic
> top-level and country-code top-level domains and video uploads in YouTube,
> helps capture what we define as online creativity.
>
>
>
> Lastly, we believe that the GII can be an important vehicle to signal that
> Wikipedia is a critical lever to innovation and a factor contributing to a
> new understanding of the digital information landscape and innovation
> globally.
>
>
>
> Based on all the above, we would like to request that our petition to
> collect data on Wikipedia monthly page edits per country, reported
> quarterly per year be given priority within your tasks.
>
>
>
> Sincerely,
>
>
>
>
>
> Rafael Escalona Reynoso, PhD, MPA.
>
> Lead Researcher at The Global Innovation Index
>
> Samuel Curtis Johnson Graduate School of Management
>
> 207 Sage Hall
>
> Cornell University
>
> Ithaca, NY 14853-6201
>
>
>
> Phone 1: +1 (607) 262-0983
>
> Phone 2: +1 (607) 255-9245
>
> Email: re32(a)cornell.edu
>
> http://www.johnson.cornell.edu
>
>
>
> [image: cid:image001.png@01CB662F.A467E740]
>
>
>
> See www.globalinnovationindex.org
>
>
>
>
>
>
>
> *From:* Dan Andreescu [mailto:dandreescu@wikimedia.org
> <dandreescu(a)wikimedia.org>]
> *Sent:* Thursday, February 18, 2016 11:17 AM
> *To:* Rafael Escalona Reynoso
> *Cc:* Jordan Litner; sacha.wunschvincent(a)wipo.int
> *Subject:* Re: On Wikipedia edits archive per county.
>
>
>
> Where the request is coming from, with all due respect, does not matter.
> We aim to be neutral in how we make knowledge available (namely, we try to
> make it available to everyone, for free).
>
>
>
> But, we have to prioritize somehow, and that process definitely takes into
> consideration the value our work has to the world. So, if you tell me more
> about what this data could help you accomplish, we could use that to argue
> that prioritizing your request might save lives, serve the mission of open
> knowledge, etc.
>
>
>
> But to answer your other question directly, yes, a letter to Jimmy Wales
> would not have any effect on this priority process and might be seen by the
> community we serve as an attempt to circumvent our planning process.
>
>
>
> On Thu, Feb 18, 2016 at 10:54 AM, Rafael Escalona Reynoso <
> re32(a)cornell.edu> wrote:
>
> Dan,
>
> Let me share with you the following thought. I just had a call with the
> Dean at the business school here at Cornell, who is the creator of the
> Global Innovation Index (and my direct boss). I explained the situation
> with the Wikipedia uploads data and how methodological changes are now
> making it impossible for us to collect it in the fashion that we were used
> to. He mentioned that he is an acquaintance with Jimmy Wales and offered to
> direct him a letter explaining what we need and the importance of the
> indicator for our index. My notion here is that the issue has more to do
> with a shortage of labor hand and quite large a backlog than with where the
> request is coming from. Also, I do not want the letter to come across as an
> imposition or to give the wrong message. Based on the above, would this
> letter help prioritize the collection of this data?
>
>
>
> Let me know what you think.
>
>
>
> Best,
>
>
>
> Rafael Escalona Reynoso, PhD, MPA.
>
> Lead Researcher at The Global Innovation Index
>
> Samuel Curtis Johnson Graduate School of Management
>
> 207 Sage Hall
>
> Cornell University
>
> Ithaca, NY 14853-6201
>
>
>
> Phone 1: +1 (607) 262-0983
>
> Phone 2: +1 (607) 255-9245
>
> Email: re32(a)cornell.edu
>
> http://www.johnson.cornell.edu
>
>
>
> [image: cid:image001.png@01CB662F.A467E740]
>
>
>
> See www.globalinnovationindex.org
>
>
>
>
>
> *From:* Dan Andreescu [mailto:dandreescu@wikimedia.org
> <dandreescu(a)wikimedia.org>]
> *Sent:* Wednesday, February 17, 2016 2:57 PM
> *To:* Rafael Escalona Reynoso
> *Cc:* Jordan Litner; sacha.wunschvincent(a)wipo.int
> *Subject:* Re: On Wikipedia edits archive per county.
>
>
>
> As much as I love to help a fellow Cornellian, we are too small of a team
> to create one-off solutions like that. We either publish it for everyone
> or no-one. But even if we did that, we'd still have a lot of work to check
> whether cross-referencing that data with other data wouldn't hurt privacy.
>
>
>
> What would help is if you filed a task in Phabricator and tagged it with
> the "Analytics" project, and described very precisely what data you need,
> at what time granularity, and what you need it for. We'll use that as
> proof that we need to prioritize the work sooner than later.
>
>
>
> On Wed, Feb 17, 2016 at 2:45 PM, Rafael Escalona Reynoso <re32(a)cornell.edu>
> wrote:
>
> Dan,
>
> One last thing. We also report scaled data from Google on YouTube uploads
> and, as you mention, they have to protect privacy. However, we prepare for
> them an Excel sheet where they simply need to upload the totals for each
> country (which we never get to see) and they report back to us exclusively
> the normalized scores (0-100) and rankings for all countries we request
> information for. Using this procedure it becomes impossible to
> reverse-engineer the raw values used to obtain these totals and – again –
> we never get to see the actual data. Is there a chance that we could
> establish a similar type of arrangement with Wikimedia? Let me know.
>
>
>
> Best,
>
>
>
> Rafael.
>
>
>
> *From:* Dan Andreescu [mailto:dandreescu@wikimedia.org
> <dandreescu(a)wikimedia.org>]
> *Sent:* Wednesday, February 17, 2016 2:23 PM
> *To:* Rafael Escalona Reynoso
> *Subject:* Re: On Wikipedia edits archive per county.
>
>
>
> Thank you for this, again. Sorry to pester you with this again but, do you
> know of any other data (from a source different from Google) where online
> activity could be measured? Any leads would be quite appreciated.
>
>
>
> mmm, not geolocated that I know of, and it's unlikely that you'd find
> that. Because either
>
>
>
> * an organization is for-profit, in which case they would sell that data
>
> * an organization is non-profit, in which case they'd likely need to
> protect their users and bump up against the same *hard* problems we did
>
>
>
> But I could be wrong, good luck in your search and do report back if you
> find any, and especially if you find approaches that help to protect
> privacy.
>
>
>
>
>
>
>
>
>
>
>
analytics-store was brought down at 6am, and then again at 9am UTC 25 Dec
due to multiple executions of long running queries (some of them 2 days
long) such as:
SELECT LEFT(timestamp, 8) AS yearmonthday, timestamp, userAgent, clientIp,
webHost, COUNT(*) AS copies FROM log.PageContentSaveComplete ...
SELECT COUNT(*) AS count, term_entity_type, term_type, term_language FROM
wikidatawiki.wb_terms ...
select date('20161218000000') as day, actions, count(*) as repeated from
(select group_concat(event_action order by timestamp, action_order.ord
separator '-') as actions from (select ...
I would urge you to setup a per-user/per-service query resource limits,
otherwise poorly performant queries will affect all users (and in cases
like this, create downtime). I have set up query limits for all
research/analytics users temporarily until 3rd January.
--
Jaime Crespo
<http://wikimedia.org>
Hi all,
I'm Adrian Bielefeldt an writing on behalf of Research:Understanding
Wikidata Queries
<https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries>,
a research project aimed at gaining an insight into the way the query
service <https://query.wikidata.org/> of wikidata is being used. We have
some questions and are wondering if any of you could provide us with an
answer.
1. Is there a unique key for the query log? The log I am refering to is
the *wdqs_extract* table**from the hive database wmf.**We would like to
be able to permanently link our own computed data with the log entry we
computed it from.
2. Is it possible to find out if a query in a given log entry was
accepted by the sparql endpoint as valid?
3. Is there any other database system besides hive installed on the server?
And finally a question on conventions for this mailing list: Am I
correct in sending one mail for multiple questions or should I send
separate mails for each question?
Greetings,
Adrian Bielefeldt
**
Sherry,
Questions such as this one will get a better answer if posted to analytics@
public list. From the discussion is not clear what users wish to measure
but I am providing some links below that might help.
Alexa ranks sites according to other sites, which we obviously do not do.
If a ranking fluctuates a lot that tells you Alexa's methodology doesn't
work for your site (likely you get too little traffic).
We have pageviews and unique devices measures for all projects of a certain
size. Referral information ("where does traffic come from?") is also
available but a bit harder to get. Looking at sources of publicly available
data you can see wikivoyage is pretty static when it comes to growth.
English wikivoyage pageviews are mostly constant, around 60.000 daily,
mobile makes about 20.000 of those. If you zoom you can see that number is
slightly higher in August but really within the same order of magnitude.
https://analytics.wikimedia.org/dashboards/vital-signs/#projects=enwikivoya…
English wikivoyage daily unique devices, they have about 15.000 devices per
day.
https://analytics.wikimedia.org/dashboards/vital-signs/#projects=enwikivoya…
About 300.000 devices per month:
https://analytics.wikimedia.org/dashboards/vital-signs/#projects=enwikivoya…
Thanks,
Nuria
On Thu, Dec 22, 2016 at 12:51 PM, Whatamidoing (WMF)/Sherry Snyder <
ssnyder(a)wikimedia.org> wrote:
> Community members at the English Wikivoyage are interested in their Alexa
> rankings. There are two current discussions about it (start here, plus the
> one immediately after it: https://en.wikivoyage.org/
> wiki/Wikivoyage:Travellers%27_pub#2016_in_review )
>
> Do you have any information that would help them?
>
> --
> Sherry Snyder (WhatamIdoing)
> Community Liaison, Wikimedia Foundation
>
Hi Everyone,
The next Research Showcase will be live-streamed this Wednesday,
December 21, 2016 at 11:30 AM (PST) 18:30 (UTC).
YouTube stream: https://www.youtube.com/watch?v=nmrlu5qTgyA
As usual, you can join the conversation on IRC at #wikimedia-research. And,
you can watch our past research showcases here
<https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#December_2016>.
The December 2016 Research Showcase includes:
English Wikipedia Quality Dynamics and the Case of WikiProject Women
ScientistsBy *Aaron Halfaker
<https://meta.wikimedia.org/wiki/User:Halfak_(WMF)>*With every productive
edit, Wikipedia is steadily progressing towards higher and higher quality.
In order to track quality improvements, Wikipedians have developed an
article quality assessment rating scale that ranges from "Stub" at the
bottom to "Featured Articles" at the top. While this quality scale has the
promise of giving us insights into the dynamics of quality improvements in
Wikipedia, it is hard to use due to the sporadic nature of manual
re-assessments. By developing a highly accurate prediction model (based on
work by Warncke-Wang et al.), we've developed a method to assess an
articles quality at any point in history. Using this model, we explore
general trends in quality in Wikipedia and compare these trends to those of
an interesting cross-section: Articles tagged by WikiProject Women
Scientists. Results suggest that articles about women scientists were lower
quality than the rest of the wiki until mid-2013, after which a dramatic
shift occurred towards higher quality. This shift may correlate with (and
even be caused by) this WikiProjects initiatives.
Privacy, Anonymity, and Perceived Risk in Open Collaboration. A Study of
Tor Users and WikipediansBy *Andrea Forte*In a recent qualitative study to
be published at CSCW 2017, collaborators Rachel Greenstadt, Naz Andalibi,
and I examined privacy practices and concerns among contributors to open
collaboration projects. We collected interview data from people who use the
anonymity network Tor who also contribute to online projects and from
Wikipedia editors who are concerned about their privacy to better
understand how privacy concerns impact participation in open collaboration
projects. We found that risks perceived by contributors to open
collaboration projects include threats of surveillance, violence,
harassment, opportunity loss, reputation loss, and fear for loved ones. We
explain participants’ operational and technical strategies for mitigating
these risks and how these strategies affect their contributions. Finally,
we discuss chilling effects associated with privacy loss, the need for open
collaboration projects to go beyond attracting and educating participants
to consider their privacy, and some of the social and technical approaches
that could be explored to mitigate risk at a project or community level.
--
Sarah R. Rodlund
Senior Project Coordinator-Engineering, Wikimedia Foundation
srodlund(a)wikimedia.org
Hello,
I found "Page view statistics for Wikimedia projects":
http://dumps.wikimedia.org/other/pagecounts-raw and from this source I
can construct time series of http requests on a hourly base. Based on
these time series I can estimate a model for a cloud computing system.
However, this hourly rate of the requests is not quite suitable for my
intended model. I am aiming to a model able to react at the level of
seconds or even faster and for this goal I need time series of the http
requests (pagecount, traces, and so on) at the resolution of
milliseconds. I am interested only on the number of the requests on the
time unit (ms) and not on the actual source or the destination of these
http requests.
Would it be possible to find the above mentioned time series
resolution at millisecond (ms) ?
Thank you very much,
Laurentiu
Dear members of the Analytics Team!
Please, consider my request for information or collaboration. I am
conducting the research project on the international determinants of
education quality. In my view, Wikimedia statistics is the priceless
resource of information on how much learning people do outside of
educational institutions.
I would like to access the data on Wikipedia pageviews by country, language
and content area to measure the private learning in different countries. My
previous empirical results suggest that Wikipedia pageviews are highly
correlated with education quality. Unfortunately, the available data does
not allow to separate the educational pageviews from the pure entertainment
pageviews (for example, celebrities biographies).
I am aware that the data currently is not the part of the publicly
available dataset. Please, consider two options. First, I am ready to
collaborate with you on making this data available as other researchers
have done in the past. I would appreciate if you let me know which steps I
need to take in order to work with you on this task. Second, you can
consider making this data available after achieving the necessary level of
confidentiality. For example, you can group request types so that each
group has at least 1000 unique IP-addresses.
I am looking forward to hear from you on my opportunities to use this data.
I think that it is going to be very interesting to know how much people
learn from Wikipedia, for example, in India versus Brazil and Egypt. Do
people in Indonesia learn less than people in Germany due to poor quality
school systems or low private incentives for learning? I am also sure that
many social scientists will also benefit from using such information (if
you make it available) and will produce some policy-relevant research.
Best regards,
Alexander Ugarov,
Ph. D. Candidate.
Sam M. Walton College of Business
Department of Economics
University of Arkansas
Office: ECOB260
E-mail: augarov(a)uark.edu.
(This is a little note I have meant to write for a while. Sending it both a
heads-up for other people who work with this data - many may have
encountered some of these issues, but not everybody may be aware of all of
them - and a contribution to the discussion about the Analytics team's
"operational excellence" quarterly goal
<https://www.mediawiki.org/wiki/Wikimedia_Engineering/2016-17_Q3_Goals#Analy…>
for
Q3.)
So, EventLogging has been a highly useful part of our analytics
infrastructure for years now, critical for the work of many teams. However,
over the course of this year there have been several longstanding issues
that make me wonder if we are giving it enough attention
infrastructure-wise.
1. https://phabricator.wikimedia.org/T146840 Major loss of events in many
different schemas, apparently differing by browser family. This affected
e.g. one of the main metrics we've been using to evaluate hovercards (page
previews) in the reading Web team and was the reason we had to restrict the
analysis of recent A/B tests there to Firefox only. It also created
confusion for users of the Discovery department's mobile search dashboard
and affected the Edit schema as well. No reaction on the task from
Analytics since September 28.
2. https://phabricator.wikimedia.org/T142667 Duplicate (spurious)
EventLogging rows, a longterm issue first observed, independently, by
people from the Reading web team and myself around April/May. The effect on
query results is small in most cases, but significant in some, and in any
case does not raise confidence in the quality of the data - we would at
least like to know what the most likely explanations are. No reaction from
Analytics since August, despite four "The World Burns" tokens by other data
analysts and a reminder from Reading management.
3. "ERROR 2013 (HY000): Lost connection to MySQL server during query" and
"ERROR 2006 (HY000): MySQL server has gone away" when trying to query EL
data from stat1003. Happening infrequently but often enough to be a major
nuisance at times. (I haven't filed a Phabricator task for this yet, but
brought it up on IRC various times. Arguably a more database/service
quality issue, but I'm not certain it can't affect query results as well.)
There are various other EL issues I have been encountering more
sporadically (and in some cases still need to file Phabricator tasks for),
but these are some of the most important.
I am wondering whether this list may be a better venue for raising
awareness when things get stale on Phabricator.
--
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB