Hi all,
For all Hive users using stat1002/1004, you might have seen a deprecation
warning when you launch the hive client - that claims it's being replaced
with Beeline. The Beeline shell has always been available to use, but it
required supplying a database connection string every time, which was
pretty annoying. We now have a wrapper
<https://github.com/wikimedia/operations-puppet/blob/production/modules/role…>
script
setup to make this easier. The old Hive CLI will continue to exist, but we
encourage moving over to Beeline. You can use it by logging into the
stat1002/1004 boxes as usual, and launching `beeline`.
There is some documentation on this here:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline.
If you run into any issues using this interface, please ping us on the
Analytics list or #wikimedia-analytics or file a bug on Phabricator
<http://phabricator.wikimedia.org/tag/analytics>.
(If you are wondering stat1004 whaaat - there should be an announcement
coming up about it soon!)
Best,
--Madhu :)
Forwarding this question to the public Analytics list, where it's good to
have these kinds of discussions. If you're interested in this data and how
it changes over time, do subscribe and watch for updates, notices of
outages, etc.
Ok, so on to your question. You'd like the *total # of articles for each
wiki*. I think the simplest way right now is to query the AQS (Analytics
Query Service) API, documented here:
https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2
To get the # of articles for a wiki, let's say en.wikipedia.org, you can
get the timeseries of new articles per month since the beginning of time:
*https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/all-page-types/monthly/2001010100/2018032900
<https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org…>*
And to get a list of all wikis, to plug into that URL instead of "
en.wikipedia.org", the most up-to-date information is here:
https://meta.wikimedia.org/wiki/Special:SiteMatrix in table form or via the
mediawiki API:
https://meta.wikimedia.org/w/api.php?action=sitematrix&formatversion=2&form….
Sometimes new sites won't have data in the AQS API for a month or two until
we add them and start crunching their stats.
The way I figured this out is to look at how our UI uses the API:
https://stats.wikimedia.org/v2/#/en.wikipedia.org/contributing/new-pages.
So if you were interested in something else, you can browse around there
and take a look at the XHR requests in the browser console. Have fun!
On Thu, Mar 29, 2018 at 12:54 AM, Zainan Zhou (a.k.a Victor) <zzn(a)google.com
> wrote:
> Hi Dan,
>
> How are you! This is Victor, It's been a while since we meet at the 2018
> Wikimedia Dev Summit. I hope you are doing great.
>
> As I mentioned to you, my team works on extracting the knowledge from
> Wikipedia. Currently it's undergoing a project that expands language
> coverage. My teammate Yuan Gao(cc'ed here) is tech leader of this
> project.She plans to *monitor the list of all the current available
> wikipedia's sites and the number of articles for each language*, so that
> we can compare with our extraction system's output to sanity-check if there
> is a massive breakage of the extraction logic, or if we need to add/remove
> languages in the event that a new wikipedia site is introduced to/remove
> from the wikipedia family.
>
> I think your team at Analytics at Wikimedia probably knows the best where
> we can find this data. Here are 4 places we already know, but doesn't seem
> to have the data.
>
>
> - https://en.wikipedia.org/wiki/List_of_Wikipedias. has the
> information we need, but the list is manually edited, not automatic
> - https://stats.wikimedia.org/EN/Sitemap.htm, has the full list, but
> the information seems pretty out of date(last updated almost a month ago)
> - StatsV2 UI: https://stats.wikimedia.org/v2/#/all-projects, I can't
> find the full list nor the number of articles
> - API https://wikimedia.org/api/rest_v1/ suggested by elukey on
> #wikimedia-analytics channel, it doesn't seem to have # of article
> information
>
> Do you know what is a good place to find this information? Thank you!
>
> Victor
>
>
>
> * • **Zainan Zhou(**周载南**) a.k.a. "Victor" * <http://who/zzn>
> * • *Software Engineer, Data Engine
> * •* Google Inc.
> * • *zzn(a)google.com <ecarmeli(a)google.com> - 650.336.5691
> * • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043
>
> ---------- Forwarded message ----------
> From: Yuan Gao <gaoyuan(a)google.com>
> Date: Wed, Mar 28, 2018 at 4:15 PM
> Subject: Monitor the number of Wikipedia sites and the number of articles
> in each site
> To: Zainan Victor Zhou <zzn(a)google.com>
> Cc: Wenjie Song <wenjies(a)google.com>, WikiData <wikidata(a)google.com>
>
>
> Hi Victor,
> as we discussed in the meeting, I'd like to monitor:
> 1) the number of Wikipedia sites
> 2) the number of articles in each site
>
> Can you help us to contact with WMF to get a realtime or at least daily
> update of these numbers? What we can find now is
> https://en.wikipedia.org/wiki/List_of_Wikipedias, but the number of
> Wikipedia sites is manually updated, and possibly out-of-date.
>
>
> The monitor can help us catch such bugs.
>
> --
> Yuan Gao
>
>
Hello Dear Analytics team,
I have a question about WikiMetrics from WMUK. I have users e.g. user Sian
EJ and user Llywelyn2000, if I ran reports on these usernames in
WikiMetrics with central outh authentication WikiData would not appear
amongst the projects for them at all while if we check their contribution
on WikiData, there is clearly activity for the same period: e.g.
https://www.wikidata.org/w/index.php?title=Special:
Contributions/Sian_EJ&offset=&limit=500&target=Sian+EJ
Do you happen to know about a potential problem with authenticating the
wikidata accounts through central auth? For other users we do not have the
same problem so it might be connected to something else. Can you please
look into it for us?
Thank you very much for your opinion on this matter!!!
Wishing a lovely day,
Agnes
--
Best,
*Agnes Bruszik - *Programme Evaluation Assistant
Wikimedia UK
+44 2033720769
*Wikimedia UK* is the national chapter for the global Wikimedia open
knowledge movement. We rely on donations from individuals to support our
work to make knowledge open for all. Have you considered supporting
Wikimedia? https://donate.wikimedia.org.uk
Wikimedia UK is a Company Limited by Guarantee registered in England and
Wales, Registered No. 6741827. Registered Charity No.1144513. Registered
Office 4th Floor, Development House, 56-64 Leonard Street, London EC2A 4LT.
United Kingdom. Wikimedia UK is the UK chapter of a global Wikimedia
movement. The Wikimedia projects are run by the Wikimedia Foundation (who
operate Wikipedia, amongst other projects).
*Wikimedia UK is an independent non-profit charity with no legal control
over Wikipedia nor responsibility for its contents.*
--
Best,
*Agnes Bruszik - *Programme Evaluation Assistant
Wikimedia UK
+44 2033720769
*Wikimedia UK* is the national chapter for the global Wikimedia open
knowledge movement. We rely on donations from individuals to support our
work to make knowledge open for all. Have you considered supporting
Wikimedia? https://donate.wikimedia.org.uk
Wikimedia UK is a Company Limited by Guarantee registered in England and
Wales, Registered No. 6741827. Registered Charity No.1144513. Registered
Office 4th Floor, Development House, 56-64 Leonard Street, London EC2A 4LT.
United Kingdom. Wikimedia UK is the UK chapter of a global Wikimedia
movement. The Wikimedia projects are run by the Wikimedia Foundation (who
operate Wikipedia, amongst other projects).
*Wikimedia UK is an independent non-profit charity with no legal control
over Wikipedia nor responsibility for its contents.*
Hello all,
The candidate slates has been deemed finalized.
Thanks to the new candidates joining the committee to support
this important mission, sustaining our code of conduct.
We didn't receive a lot of candidatures this session, so we'd like
to ask you to think for the next months if you would like to serve
in the 2019 half I committee, and to be ready in October to join
the boat.
[ Candidates URL ]
https://www.mediawiki.org/wiki/Code_of_Conduct/Committee/Candidates/2018-I
This page will be moved to the members page when the new committee
will enter in function after a training period for new members.
[ Procedure ]
We haven't received any feedback to techconduct (at) twikimedia.org
related to one or another candidate.
In addition to the procedure the Code of Conduct provides, we also
received additional comments at [[mw:Talk:Code of
Conduct/Committee/Candidates/2018-I]] and consider these comments
party addressed and partly to solve in the next weeks, and unrelated
to the current candidates.
--
Sébastien Santoro aka Dereckson
http://www.dereckson.be/
[Crossposting to Research and Analytics lists]
Most Wikipedia articles with a weekly periodicity show more pageviews
on a typical weekday than a weekend. Some articles associated with
weekends (e.g. articles associated with a variety of hobbies) will
show relatively fewer pave views on weekdays.
Suppose I wanted to plot a heatmap with colors corresponding to the
strength of the weekly periodicity of the pageviews of articles shown
in different geographic locations.
(1) Has anyone done anything like this before?
(2) Is sufficient information available from the current logging regime?
Finally, I would also like to ask for review of this summarization, please:
https://www.mediawiki.org/w/index.php?title=Wikimedia_Technology%2FAnnual_P…
Best regards,
Jim
Hi everyone,
This is our last reminder for you to complete the Wikimedia Communities &
Contributors survey.
* To those of you who have taken the survey - thank you so much! We really
appreciate your responses. *
*This survey is closing in less than three days on Sunday 22 April 2018.*
*If you are volunteer developer, and have contributed code to any pieces of
MediaWiki, gadgets, or tools, please complete the survey. The opinions you
share will affect the work of the Wikimedia Foundation. *
*Follow this link to take the
survey: https://wikimedia.qualtrics.com/jfe/form/SV_5ABs6WwrDHzAeLr?aud=DEV
<https://wikimedia.qualtrics.com/jfe/form/SV_5ABs6WwrDHzAeLr?aud=DEV>If you
have already seen a similar message on Phabricator, Mediawiki.org,
Discourse, or other platforms for volunteer developers, please don't take
the survey twice. It is available in various languages and will take about
20 minutes to complete.You can find more information about
this survey on the project page
<https://meta.wikimedia.org/wiki/Community_Engagement_Insights/About_CE_> and
see how your feedback helps the Wikimedia Foundation support contributors
like you. This survey is hosted by a third-party service and governed
by this privacy statement
<https://wikimediafoundation.org/wiki/Community_Engagement_Insights_2018_Sur…>.
Please visit our frequently asked questions page
<https://meta.wikimedia.org/wiki/Community_Engagement_Insights/Frequently_as…>
to
find more information about this survey. Feel free to email me directly
with any questions you may have.Thank you!Edward Galvez from the Community
Engagement departmentWikimedia Foundation*
--
Edward Galvez
Evaluation Strategist, Surveys
Learning & Evaluation
Community Engagement
Wikimedia Foundation
Hi Everyone,
The next Research Showcase will be live-streamed this Wednesday, April 18,
2018 at 11:30 AM (PDT) 18:30 UTC.
YouTube stream: https://www.youtube.com/watch?v=Z1pa-pr6xis
As usual, you can join the conversation on IRC at #wikimedia-research. And,
you can watch our past research showcases here.
<https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#Upcoming_Showcase>
The Critical Relationship of Volunteer Created Wikipedia Content to
Large-Scale Online CommunitiesBy *Nate TeBlunthuis*The extensive Wikipedia
literature has largely considered Wikipedia in isolation, outside of the
context of its broader Internet ecosystem. Very recent research has
demonstrated the significance of this limitation, identifying critical
relationships between Google and Wikipedia that are highly relevant to many
areas of Wikipedia-based research and practice. In this talk, I will
present a study which extends this recent research beyond search engines to
examine Wikipedia’s relationships with large-scale online communities,
Stack Overflow and Reddit in particular. I will discuss evidence of
consequential, albeit unidirectional relationships. Wikipedia provides
substantial value to both communities, with Wikipedia content increasing
visitation, engagement, and revenue, but we find little evidence that these
websites contribute to Wikipedia in return. Overall, these findings
highlight important connections between Wikipedia and its broader ecosystem
that should be considered by researchers studying Wikipedia. Overall, this
talk will emphasize the key role that volunteer-created Wikipedia content
plays in improving other websites, even contributing to revenue generation.
The Rise and Decline of an Open Collaboration System, a Closer LookBy *Nate
TeBlunthuis*Do patterns of growth and stabilization found in large peer
production systems such as Wikipedia occur in other communities? This study
assesses the generalizability of Halfaker etal.’s influential 2013 paper on
“The Rise and Decline of an Open Collaboration System.” We replicate its
tests of several theories related to newcomer retention and norm
entrenchment using a dataset of hundreds of active peer production wikis
from Wikia. We reproduce the subset of the findings from Halfaker and
colleagues that we are able to test, comparing both the estimated signs and
magnitudes of our models. Our results support the external validity of
Halfaker et al.’s claims that quality control systems may limit the growth
of peer production communities by deterring new contributors and that norms
tend to become entrenched over time.
Kindest regards,
Sarah R. Rodlund
Senior Project Coordinator-Product & Technology, Wikimedia Foundation | Hic
sunt leones
srodlund(a)wikimedia.org
Hi, Nuria:
I reviewed the closest data to what I am looking for, phabricator
T128132, from https://analytics.wikimedia.org/datasets/archive/publi
c-datasets/analytics/caching/
and the *webrequest* datasets : https://wikitech.wikimedia.org/wik
i/Analytics/Data_Lake/Traffic/Webrequest. I still have a few questions.
1. Is `hashed_host_path' (in the cache dataset) the `hostname' or ` uri_host
'? Phabricator T128132 shows the two fields. However, the available data
only shows ` hashed_host_path'.
2. There are 6 fields - hashed_host_path, uri_query,
content_type, response_size, time_firstbyte, and x_cache - in the caching
dataset, as shown in the attachment screen snapshot.
Does the caching dataset not include page_id? The *webrequest* dataset
seems to contain page_id.
3. I didn't find the sequence field in the caching dataset. I learned that
sequence replaces time stamp. Is ` sequence' the file name of downloads in
the caching dataset?
4. Does `dt' (in the *webrequest* dataset) mean a timestamp with ISO 8601
<https://en.wikipedia.org/wiki/en:ISO_8601> format ? Probably, the
*webrequest* dataset might be what I am looking for, if it can provide
access traces per-second.
5. According the the descriptions in the *webrequest* webpage, the
*webrequest
*datasets should contain at least `hostname', `page_id', and `dt'. If true,
the *webrequest *datasets seem to cover most of my requirements. Is
there any download link available for the *webrequest *datasets ?
--
Sincerely,
TA-YUAN
Greetings group,
Saw a couple of days ago something about a server migration, so perhaps
that is related, but the data path:
https://dumps.wikimedia.org/other/pagecounts-ez/merged/2018/
Doesn't have any April data. Strangely enough, my ingestion process was
able to collect data up through APR-03, implying that the data/path did
exist before something changed. Thanks, -AGW