Please join for the following tech talk:
*Tech Talk**:* New readership data: Some things we've been learning
recently about how Wikipedia is read
*Presenter:* Tilman Bayer
*Date:* March 18th, 2016
*Time: *18:00 UTC
<http://www.timeanddate.com/worldclock/fixedtime.html?msg=Tech+Talk%3A+New+r…>
Link to live YouTube stream <http://www.youtube.com/watch?v=Qo4XIzCJZVs>
*IRC channel for questions/discussion:* #wikimedia-office
*Summary: *This talk will highlight various recent insights and new sources
of data on how readers read Wikipedia, going beyond the familiar pageview
numbers (that tell us which topics are popular and how overall traffic is
developing, but not e.g. which parts of articles are being read). While we
are still only beginning to understand some of these aspects, we now know
more than a year or two ago. The presentation is centered around data
analysis done by the Reading team, but will also include findings by other
WMF teams and by external researchers.
Hi all!
Over the last couple of months, I have worked on introducing a dependency
injection mechanism into MediaWiki core (don't fear, no auto-wiring). My
proposal is described in detail at <https://phabricator.wikimedia.org/T124792>
(yea, TL;DR - just read the top and search the rest if you have a question).
Before we discuss this again on IRC at the RFC meeting on Wednesday (March 23,
2pm PST / 22:00 CEST due to daylight confusion), I would like to invite you to
review the proposal as well as the patches that are up on gerrit. In particular,
any feedback would be appreciated on:
* Introduce top level service locator
<https://gerrit.wikimedia.org/r/#/c/264403/29>.
* Allow reset of global services <https://gerrit.wikimedia.org/r/#/c/270020/>
* WIP: Make storage layer services injectable.
<https://gerrit.wikimedia.org/r/#/c/267692/>
Perhaps also have a look at the documentation included in the change, in
particular the migration part:
<https://gerrit.wikimedia.org/r/#/c/264403/29/docs/injection.txt>
Before commenting on design choices on gerrit, please have a look at T124792 and
see whether I have written something about the issue in question there. I would
like to focus conceptual discussion on the RFC ticket on phabricator, rather
than on gerrit. On gerrit, we can talk about the implementation.
I very much want this to move forward. Perhaps we can even get the first bits of
this merged at the hackathon. So, criticize away!
Thanks for your help!
-- daniel
PS: phabricator event page (still blank, we'll fix that soon):
<https://phabricator.wikimedia.org/E66/27>
Hello all,
Google announced the start of accepting proposals for GSoC 2016 few hours
ago. Interested and eligible candidates should submit their proposals at
http://g.co/gsoc before the deadline of Friday, March 25 at 19:00 UTC.
Wikimedia evaluates your proposal Phabricator task, but it is required that
you have a copy of the same in the GSoC portal too, to make sure it gets a
slot ( if eligible ). By March 25th, every possible application should have *2
mentors* connected with it, and should have a proposal copy in Phabricator,
as well as the GSoC portal. Please make sure you mention the phab task
details in your proposal, for convenience. If you are planning to apply,
you should be looking at
Life_of_a_successful_project#Coming_up_with_a_proposal
<https://www.mediawiki.org/wiki/Outreach_programs/Life_of_a_successful_proje…>
As of today, we have *8* projects featured for this round ( strong idea + 2
mentors connected ), and *13* projects missing one among the two mentors.
Interested in mentoring ? see
https://phabricator.wikimedia.org/tag/possible-tech-projects/ and add
yourself as one.
The Outreachy round May - August 2016 is open, with a deadline of *March
22, 2016 *and eligible applicants are advised to apply for *both* GSoC and
Outreachy, so that the project can still make it, in case we are missing a
slot with a strong applicant.
Thinking of motivating someone in your locality to take part in ? Find
flyers and presentations here
<https://developers.google.com/open-source/gsoc/resources/media#logos_and_ar…>
for
GSoC 2016 round!
Thanks,
Tony Thomas <https://www.mediawiki.org/wiki/User:01tonythomas>
Home <http://www.thomastony.me> | Blog <http://blog.thomastony.me> |
ThinkFOSS <http://www.thinkfoss.com>
On Wed, Mar 23, 2016 at 1:06 PM, Federico Leva (Nemo) <nemowiki(a)gmail.com>
wrote:
> Dan Andreescu, 23/03/2016 15:58:
>
>>
>> *Clean-up:* Analytics data on dumps was crammed into /other with
>> unrelated datasets. We made a new page to receive current and future
>> datasets [3] and linked to it from /other and /. Please let us know if
>> anything there looks confusing or opaque and I'll be happy to clarify.
>>
>
> I assume the old URLs will redirect to the new ones, right?
>
Good question, we didn't change any old URLs actually, so if you're trying
to get to other/pagecounts-ez, other/pagecounts-raw and all that, they're
all still there, just linked-to from /analytics. We did it this way
because we figured people had scripts that depended on those URLs. We
thought about moving and symlinking but it's probably unlikely that we'll
ever be able to delete the other/** location.
So mainly we just have a new page where we can do a better job of focusing
on the analytics datasets.
cc-ing our friends in research and wikitech (sorry I forgot initially)
We're happy to announce a few improvements to Analytics data releases on
> dumps.wikimedia.org:
>
> * We are releasing a new dataset, an estimate of Unique Devices accessing
> our projects [1]
> * We are officially making available a better Pageviews dataset [2]
> * We are deprecating two older pageview statistics datasets
> * We moved Analytics data from /other to /analytics [3]
>
> Details follow:
>
>
> *Unique Devices:* Since 2009, the Wikimedia Foundation used comScore to
> report data about unique web visitors. In January 2016, however, we
> decided to stop reporting comScore numbers [4] because of certain
> limitations in the methodology, these limitations translated into
> misreported mobile usage. We are now ready to replace comscore numbers with
> the Unique Devices Dataset [5][1]. While unique devices does not equal
> unique visitors, it is a good proxy for that metric, meaning that a major
> increase in the number of unique devices is likely to come from an increase
> in distinct users. We understand that counting uniques raises fairly big
> privacy concerns and we use a very private conscious way to count unique
> devices, it does not include any cookie by which your browser history can
> be tracked [6].
>
> We invite you to explore this new dataset and hope it’s helpful for the
> Wikimedia community in better understanding our projects. This data can
> help measurethe reach of wikimedia projects on the web.
>
> *Pageviews:* This [2] is the best quality data available for counting the
> number of pageviews our projects receive at the article and project level.
> We've upgraded from pagecounts-raw to pagecounts-all-sites, and now to
> pageviews, in order to filter out more spider traffic and measure something
> closer to what we think is a real user viewing content. A short history
> might be useful:
>
> * pagecounts-raw: was maintained by Domas Mituzas originally and taken
> over by the analytics team. It was and still is the most used dataset,
> though it has some majore problems. It does not count access to the mobile
> site, it does not filter out spider or bot traffic, and it suffers from
> unknown loss due to logging infrastructure limitations.
> * pagecounts-all-sites: uses the same pageview definition as
> pagecounts-raw, and so also does not filter out spider or bot traffic. But
> it does include access to mobile and zero sites, and is built on a more
> reliable logging infrastructure.
> * pagecounts-ez: is derived from the best data available at the time.
> So until December 2015, it was based on pagecounts-raw and
> pagecounts-all-sites, and now it's based on pageviews. This dataset is
> great because it compresses very large files without losing any
> information, still providing hourly page and project level statistics.
>
> So the new dataset, pageviews, is what's behind our pageview API and is
> now available in static files for bulk download back to May 2015. But the
> multiple ways to download pageview data is confusing for consumers, so
> we're keeping only pageviews and pagecounts-ez and deprecating the other
> two. If you'd like to read more about the current pageview definition,
> details are on the research page [7].
>
> *Deprecating:* We are deprecating the pagecounts-raw and
> pagecounts-all-sites datasets in May 2016 (discussion here:
> https://phabricator.wikimedia.org/T130656 ). This data suffers from many
> artifacts, lack of mobile data, and/or infrastructure problems, and so is
> not comparable to the new way we track pageviews. It will remain here
> because we have historical data that may be useful, but it will not be
> maintained or updated beyond May 2016.
>
> *Clean-up:* Analytics data on dumps was crammed into /other with
> unrelated datasets. We made a new page to receive current and future
> datasets [3] and linked to it from /other and /. Please let us know if
> anything there looks confusing or opaque and I'll be happy to clarify.
>
>
> [1] http://dumps.wikimedia.org/other/unique_devices
> [2] http://dumps.wikimedia.org/other/pageviews
> [3] http://dumps.wikimedia.org/analytics/
> [4] https://meta.wikimedia.org/wiki/ComScore/Announcement
> [5] https://meta.wikimedia.org/wiki/Research:Unique_Devices
> [6]
> https://meta.wikimedia.org/wiki/Research:Unique_Devices#How_do_we_count_uni…
> [7] https://meta.wikimedia.org/wiki/Research:Page_view
>
Hi wikitech-l,
After the discussion in analytics-l [1][2] and Phabricator [3], the
Analytics team added a small amendment [4] to Wikimedia's user-agent policy
[5] with the intention of improving the quality of WMF's pageview
statistics.
The amendment asks Wikimedia bot/framework maintainers to optionally add
the word *bot* (case insensitive) to their user-agents. With that, the
analytical jobs that process request data into pageview statistics will be
capable of better identifying traffic generated by bots, and thus of better
isolating traffic originated by humans (corresponding code is already in
production [6]). The convention is optional, because modifications to the
user-agent can be a breaking change.
Targets of this convention are: bots/frameworks that can generate Wikimedia
pageviews [7] to Wikimedia sites and/or API and are not for in-situ human
consumption. Not targets: bots/frameworks used to assist in-situ human
consumption, and bots/frameworks that are otherwise well known and
recognizable like WordPress, Scrapy, etc. Note that there are many editing
bots that also generate pageviews, like when trying to copy content from
one page to another the source page is requested and the corresponding
pageview is generated.
Cheers!
[1] https://lists.wikimedia.org/pipermail/analytics/2016-January/004858.html
[2]
https://lists.wikimedia.org/pipermail/analytics/2016-February/004882.html
[3] https://phabricator.wikimedia.org/T108599
[4]
https://meta.wikimedia.org/w/index.php?title=User-Agent_policy&type=revisio…
[5] https://meta.wikimedia.org/wiki/User-Agent_policy
[6]
https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery…
[7] https://meta.wikimedia.org/wiki/Research:Page_view
--
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
Hello,
Here is the Discovery department's weekly status update.
* The completion suggester left beta and is now the default
search-as-you-type for all wikis (except Wikidata).
**
http://blog.wikimedia.org/2016/03/17/completion-suggester-find-what-you-nee…
* Last week we enabled Kartographer extension for Wikivoyage sites,
allowing users to add maps to wiki pages without any additional wmf labs
and JavaScript tricks.
** A demo of Kartographer and VisualEditor integration can be found here:
http://vem3.wmflabs.org/wiki/Main_Page
This is our second week summarizing our work in this way and our first week
sharing it with wikitech-l. Feedback and suggestions are welcome.
Read the full update at the following link.
https://www.mediawiki.org/wiki/Discovery/Status_updates/2016_03_18
--
Yours,
Chris Koerner
Community Liaison - Discovery
Wikimedia Foundation
>
>
>
Hi Linxuan,
Thank you for your question:
>... What does the "reputation score" in the description refer to?
I've asked Priyanka to reply with her current design, but here is some
of the advice I gave her:
"Each reviewer needs, at a minimum, data indicating the number and
proportion of reviewers who have agreed with them. However, the third level
of tie-breaking review introduces an extra bit for each disagreement which
determines whether agreement or disagreement should be counted in their
favor. So, even if a given reviewer only agrees with 50% of the other
reviewers, the determination of the tie breaker in each case of
disagreement controls whether their reputation score ranges from 0% to
100%. (As too does the agreement proportion, which is unlikely to be
exactly 50%.)
"Do you want the reviewers to know their agreement ratios and reputation
scores? How might their behavior change if they are and aren't told those?
Could there ever be a case when you might want to withhold them? Would
there ever be a benefit from distorting them? How about displaying them as
a range instead of distorting or withholding them? That last possibility
seems superior to me. You might want to do that when you are unsure that
the precision of the mathematical values is near the accuracy of the
knowledge they represent. Do you want to be able to tell each reviewer the
responses which have contributed to defects in their reputation scores,
i.e., do you want them to know which disagreements were tie-broken against
their favor?"
Her reply at the time was:
"In case of two reviewers agreeing, we add a +1 to the reputation. In case
of disagreement, we seek the opinion of a 3rd reviewer. If A says Yes, B
says No and C says Yes to an edit, A and C will have an agreement ratio of
50% and reputation of 100%, whereas B will have an agreement ratio of 0
and reputation 0%? This would of course change as more edits are reviewed
by them."
I believe that is still an accurate description of the current design.
Finally, I regret that the GSoC program doesn't allow more than one student
per
Best regards,
Jim Salsman
Hi, I'm Li Linxuan, a second-year student from Peking University, China. I'm familiar with using C/C++ and have experience of using Python. I have also participated in some projects including making games and game-bots.
I am interested in the "Accuracy Review" project, but there is a note saying the estimated time for a senior contributor is 3 weeks. Other projects in the idea list have the similar estimated time. So should we complete more than one project during the three-month internship or just one?
Thank you.
Sincerely,
Li Linxuan