Hello volunteer developers & technical contributors!
The Wikimedia Foundation is asking for your feedback in a survey. We want
to know how well we are supporting your contributions on and off wiki, and
how we can change or improve things in the future. The opinions you
share will directly affect the current and future work of the Wikimedia
Foundation. To say thank you for your time, we are giving away 20 Wikimedia
T-shirts to randomly selected people who take the survey. The survey is
available in various languages and will take between 20 and 40 minutes.
Use this link to take the survey now:
You can find more information about this project here. This survey is
hosted by a third-party service and governed by this privacy statement.
Please visit our frequently asked questions page to find more information
about this survey. If you need additional help or have questions about
this survey, send an email to surveys(a)wikimedia.org.
Survey Specialist, Community Engagement
 This survey is primarily meant to get feedback on the Wikimedia
Foundation's current work, not long-term strategy.
Legal stuff: No purchase necessary. Must be the age of majority to
participate. Sponsored by the Wikimedia Foundation located at 149 New
Montgomery, San Francisco, CA, USA, 94105. Ends January 31, 2017. Void
where prohibited. Click here for contest rules.
 About this survey:
 Privacy statement: https://wikimediafoundation.
> 1. Is there a unique key for the query log? The log I am refering to
> is the *wdqs_extract* table**from
> the hive database wmf.**We would like to be able to
> permanently link our own computed data with the log entry we
> computed it from.
I think you can use hostname+sequence (from
those are preserved in wdqs_extract) as a key.
> 2. Is it possible to find out if a query in a given log entry was
> accepted by the sparql endpoint as valid?
If it wasn't, the result code should be 400.
> 3. Is there any other database system besides hive installed on the
I think the currently recommended interface is beeline, not sure about
other DB systems.
> And finally a question on conventions for this mailing list: Am I
> correct in sending one mail for multiple questions or should I send
> separate mails for each question?
I think it's ok. For the questions regarding data and other WDQS
specifics you may also CC me or discovery(a)lists.wikimedia.org.
Do check latest blogpost by Andrew, the question of how to import plain json into hadoop from kafka comes up frequently and he explains how to do it step by step:
Yours truly should not be an author as I just proof read it. Just saying.
analytics-store was brought down at 6am, and then again at 9am UTC 25 Dec
due to multiple executions of long running queries (some of them 2 days
long) such as:
SELECT LEFT(timestamp, 8) AS yearmonthday, timestamp, userAgent, clientIp,
webHost, COUNT(*) AS copies FROM log.PageContentSaveComplete ...
SELECT COUNT(*) AS count, term_entity_type, term_type, term_language FROM
select date('20161218000000') as day, actions, count(*) as repeated from
(select group_concat(event_action order by timestamp, action_order.ord
separator '-') as actions from (select ...
I would urge you to setup a per-user/per-service query resource limits,
otherwise poorly performant queries will affect all users (and in cases
like this, create downtime). I have set up query limits for all
research/analytics users temporarily until 3rd January.