Hi all
I'm doing some work with colleagues from the education sector at UNESCO to
look at improving some of the most viewed education articles on English
language Wikipedia.
I'm trying to use TreeViews to get information on what are the most viewed
articles in Category:Education, unfortunately such large categories just
crash my browser, it means I will have to split the query up into at least
50-100 smaller queries.
Does anyone know of a less manual way around this? Ideally the output would
be spreadsheet of the article title and the number of page views of the
article for a 30, 60 or 90 period in the recent past. I will use Treeviews
if it is the only way but I'd really love to save myself from half a day of
data entry. I imagine this would also be useful for people working with
other organisations for other subjects.
Thanks
John
On Wed, Apr 20, 2016 at 12:39 AM, <alexhinojo(a)gmail.com> wrote:
> Hi, as some of you may know, the Wikipedia gender indicator [1] tells us how many articles are biographies about women x language/country/culture.
>
> In order to compare these numbers...Does anyone knows if there is an existing comparison with gender balance in classical encyclopedias? (Britannica, Larousse...) or, if not, could someone prepare a WD query about it?
>
> I think it could be a good argument for us to use: e.g "at cawiki 12% of bios are about women, compared to 5% in GEC, Our most famous encyclopedia".
>
> We could compare it also for temathic encyclopedias or other databases existing in projects like Mix and match.
>
> Can someone help? thanks in advance
>
>
> [1]http://wigi.wmflabs.org/
>
>
> Àlex Hinojo
> User:Kippelboy
> Amical Wikimedia Programme manager
Interesting question. There may be more suitable venues for it, e.g.
the research mailing list (CCed). Anyway, to start with two examples:
http://reagle.org/joseph/pelican/social/gender-bias-in-wikipedia-and-britan…https://meta.wikimedia.org/wiki/Research:Newsletter/2015/May#Notable_women_…
Comparison of Wikipedia with, among other sources, "Human
Accomplishment", a 2003 "ranking of geniuses throughout the ages and
around the world based on their prominence in contemporary
encyclopedias" (NYT)
--
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB
Here's another useful link to a form that helps you construct the API call:
https://wikimedia.org/api/rest_v1/?doc#!/Unique_devices_data/get_metrics_un…
On Tue, Apr 19, 2016 at 12:17 PM, Nuria Ruiz <nuria(a)wikimedia.org> wrote:
> Hello!
>
> The analytics team is happy to announce that the Unique Devices data is
> now available to be queried programmatically via an API.
>
> This means that getting the daily number of unique devices [1] for English
> Wikipedia for the month of February 2016, for all sites (desktop and
> mobile) is as easy as launching this query:
>
>
> https://wikimedia.org/api/rest_v1/metrics/unique-devices/en.wikipedia.org/a…
>
> You can get started by taking a look at our docs:
> https://wikitech.wikimedia.org/wiki/Analytics/Unique_Devices#Quick_Start
>
> If you are not familiar with the Unique Devices data the main thing you
> need to know is that
> is a good proxy metric to measure Unique Users, more info below.
>
> Since 2009, the Wikimedia Foundation used comScore to report data about
> unique web visitors. In January 2016, however, we decided to stop
> reporting comScore numbers [2] because of certain limitations in the
> methodology, these limitations translated into misreported mobile usage. We
> are now ready to replace comscore numbers with the Unique Devices Dataset .
> While unique devices does not equal unique visitors, it is a good proxy for
> that metric, meaning that a major increase in the number of unique devices
> is likely to come from an increase in distinct users. We understand that
> counting uniques raises fairly big privacy concerns and we use a very
> private conscious way to count unique devices, it does not include any
> cookie by which your browser history can be tracked [3].
>
>
> [1] https://meta.wikimedia.org/wiki/Research:Unique_Devices
> [2] [https://meta.wikimedia.org/wiki/ComScore/Announcement
> [3]
> https://meta.wikimedia.org/wiki/Research:Unique_Devices#How_do_we_count_uni…
> devices.3F
>
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
I'd like to announce an IEG proposal I'm working on titled "Learning from
article revision histories" [1]. If anyone who has studied the evolution of
Wikipedia articles (the extent to which articles always improve in quality)
is interested in the project, please consider getting in touch with me as
I'd love to hear your thoughts. I'm excited about the possible usefulness
of this sort of research for the Wikipedia community, but I'm new to
Wikipedia and I know I am not the first one to ask some of these questions.
If more experienced Wikipedians would like to weigh in on the usefulness of
addressing the efficiency of the collaborative editing process, I've posted
some discussion topics on the IEG proposal's talk page [2].
Pierce Edmiston
[1]:
https://meta.wikimedia.org/wiki/Grants:IEG/Learning_from_article_revision_h…
[2]:
https://meta.wikimedia.org/wiki/Grants_talk:IEG/Learning_from_article_revis…
Hey folks, we have a couple of announcements for you today. First is that
ORES has a large set of new functionality that you might like to take
advantage of. We'll also want to talk about a *BREAKING CHANGE on April
7th.*
Don't know what ORES is? See
http://blog.wikimedia.org/2015/11/30/artificial-intelligence-x-ray-specs/
*New functionality*
*Scoring UI*
Sometimes you just want to score a few revisions in ORES and remembering
the URL structure is hard. So, we've build a simple scoring user-interface
<https://ores.wmflabs.org/ui/> that will allow you to more easily score a
set of edits.
*New API version*
We've been consistently getting requests to include more information in
ORES' responses. In order to make space for this new information, we needed
to change the structure of responses. But we wanted to do this without
breaking the tools that are already using ORES. So, we've developed a
versioning scheme that will allow you to take advantage of new
functionality when you are ready. The same old API will continue to be
available at https://ores.wmflabs.org/scores/, but we've added two
additional paths on top of this.
- https://ores.wmflabs.org/v1/scores/ is a mirror of the old scoring API
which will henceforth be referred to as "v1"
- https://ores.wmflabs.org/v2/scores/ implements a new response format
that is consistent between all sub-paths and adds some new functionality
*Swagger documentation*
Curious about the new functionality available in "v2" or maybe what the
change was from "v1"? We've implemented a structured description of both
versions of the scoring API using swagger -- which is becoming a defacto
stanard for this sort of thing. Visit https://ores.wmflabs.org/v1/ or
https://ores.wmflabs.org/v2/ to see the Swagger user-interface.
Visithttps://ores.wmflabs.org/v1/spec/ or https://ores.wmflabs.org/v2/spec/
to get the specification in a machine-readable format.
*Feature values & injection*
Have you wondered what ORES uses to make it's predictions? You can now ask
ORES to show you the list of "feature" statistics it uses to score
revisions. For example,
https://ores.wmflabs.org/v2/scores/enwiki/wp10/34567892/?features will
return the score with a mapping of feature values used by the "wp10"
article quality model in English Wikipedia to score oldid=34567892
<https://en.wikipedia.org/wiki/Special:Diff/34567892>. You can also
"inject" features into the scoring process to see how that affects the
prediction. E.g.,
https://ores.wmflabs.org/v2/scores/enwiki/wp10/34567892?features&feature.wi…
*Breaking change -- new models*
We've been experimenting with new learning algorithms to make ORES work
better and we've found that we get better results with gradient boosting
<https://en.wikipedia.org/wiki/Gradient_boosting> and random forest
<https://en.wikipedia.org/wiki/Random_forest> strategies than we do with
the current linear svc
<https://en.wikipedia.org/wiki/Support_vector_machine> models. We'd like to
get these new, better models deployed as soon as possible, but with the new
algorithm comes a change in the range of probabilities returned by the
model. So, when we deploy this change, any tools that uses hard-coded
thresholds on ORES' prediction probabilities will suddenly start behaving
strangely. Regretfully, we haven't found a way around this problem, so
we're announcing the change now and we plan to deploy this *BREAKING CHANGE
on April 7th*. Please subscribe to the AI mailinglist
<https://lists.wikimedia.org/mailman/listinfo/ai> or watch our project page
[[:m:ORES <https://meta.wikimedia.org/wiki/ORES>]] to catch announcements
of future changes and new functionality.
In order to make sure we don't end up in the same situation the next time
we want to change an algorithm, we've included a suite of evaluation
statistics with each model. The filter_rate_at_recall(0.9),
filter_rate_at_recall(0.75), and recall_at_fpr(0.1) thresholds represent
three critical thresholds (should review, needs review, and definitely
damaging -- respectively) that can be used to automatically configure your
wiki tool. You can find out these thresholds for your model of choice by
adding the ?model_info parameter to requests. So, come breaking change, we
strongly recommend basing your thresholds on these statistics in the
future. We'll be working to submit patches to tools that use ORES in the
next week to implement this flexibility. Hopefully, all you'll need to do
is worth with us on those.
-halfak & The Revision Scoring team
<https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service>
Cross-posting this request to wiki-research-l. Anyone have data on
frequently used section titles in articles (any language), or know of
datasets/publications that examined this?
I'm not aware of any off the top of my head, Amir.
- Jonathan
---------- Forwarded message ----------
From: Amir E. Aharoni <amir.aharoni(a)mail.huji.ac.il>
Date: Sat, Jul 11, 2015 at 3:29 AM
Subject: [Wikitech-l] statistics about frequent section titles
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
Hi,
Did anybody ever try to collect statistics about frequent section titles in
Wikimedia projects?
For Wikipedia, for example, titles such as "Biography", "Early life",
"Bibliography", "External links", "References", "History", etc., appear in
a lot of articles, and their counterparts appear in a lot of languages.
There are probably similar things in Wikivoyage, Wiktionary and possibly
other projects.
Did anybody ever try to collect statistics of the most frequent section
titles in each language and project?
--
Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
http://aharoni.wordpress.com
“We're living in pieces,
I want to live in peace.” – T. Moore
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
--
Jonathan T. Morgan
Senior Design Researcher
Wikimedia Foundation
User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>
Hello all, I'm prepared to participate in Individual Engagement Grant (IEG) and has an idea closely linked to the Accuracy Review Project raised by James Salsman. Here is a brief summary of my proposal: Out-of-date information and references are common in Wikipedia articles, especially in Chinese Wikipedia. Therefore, I would like to evaluate some existed solutions of identifying those out-of-date contents, and create a new bot to identify the information based on the results of testing. More detailed tests will be arranged after that by selected articles from Wikipedia and the cases that we compile. And here is the URL of the project proposal: https://meta.wikimedia.org/wiki/Grants:IdeaLab/Searching_for_out-of-date_in… And please comment on the proposal in the discussion board of it: https://meta.wikimedia.org/wiki/Grants_talk:IdeaLab/Searching_for_out-of-da… Li Linxuan