Hi everyone,
Does anyone know if there's a straightforward (ideally
language-independent) way of identifying stub articles in Wikipedia?
Whatever works is ok, whether it's publicly available data or data
accessible only on the WMF cluster.
I've found lists for various languages (e.g., Italian
<https://it.wikipedia.org/wiki/Categoria:Stub> or English
<https://en.wikipedia.org/wiki/Category:All_stub_articles>), but the lists
are in different formats, so separate code is required for each language,
which doesn't scale.
I guess in the worst case, I'll have to grep for the respective stub
templates in the respective wikitext dumps, but even this requires to know
for each language what the respective template is. So if anyone could point
me to a list of stub templates in different languages, that would also be
appreciated.
Thanks!
Bob
--
Up for a little language game? -- http://www.unfun.me
We're starting to wrap up Q1, so it's time for another wikistats update.
First, a quick reminder:
-----
If you currently use the existing reports, PLEASE give feedback in the
section(s) at
https://www.mediawiki.org/wiki/Analytics/Wikistats/DumpRepor
ts/Future_per_report
Bonus points for noting what you use, how you use it, and explaining what
elements you most appreciate or might want added.
-----
Ok, so this is our list of high level goals, and as we were saying before,
we're focusing on taking a vertical slice through 4, 5, and 6 so we can
deliver functionality and iterate.
1. [done] Build pipeline to process and analyze *pageview* data
2. [done] Load pageview data into an *API*
3. [ ] *Sanitize* pageview data with more dimensions for public
consumption
4. [ ] Build pipeline to process and analyze *editing* data
5. [ ] Load editing data into an *API*
6. [ ] *Sanitize* editing data for public consumption
7. [ ] *Design* UI to organize dashboards built around new data
8. [ ] Build enough *dashboards* to replace the main functionality
of stats.wikipedia.org
9. [ ] Officially Replace stats.wikipedia.org with *(maybe)
analytics.wikipedia.org
<http://analytics.wikipedia.org/>*
***. [ ] Bonus: *replace dumps generation* based on the new data
pipelines
So here's the progress since last time by high level goal:
4. We can rebuild most all page and user histories from logging, revision,
page, archive, and user mediawiki tables. The scala / spark algorithm
scales well and can process english wikipedia in less than an hour. Once
history is rebuilt, we want to join it into a denormalized schema. We have
an algorithm that works on simplewiki rather quickly, but we're *still
working on scaling* it to work with english wiki. For that reason, our
vertical slice this quarter may include *only simplewiki*. In addition to
denormalizing the data to make it very simple for analysts and researchers
to work with, we're also computing columns like "this edit was reverted at
X timestamp" or "this page was deleted at X timestamp". These will all be
available in one flat schema.
5. We loaded the simplewiki data into Druid and put Pivot on top of it.
It's fantastically fun, I had to close that tab or I would've lost a day
browsing around. For a small db like simplewiki, Druid should have no
problem maintaining an updated version of the computed columns mentioned
above. (I say updated because "this edit was reverted" is a fact that can
change from false to true at some point in the future). We're still not
100% sure whether Druid can do that with the much larger enwiki data, but
we're testing that. And we're also testing ClickHouse, another highly
performant OLAP big data columnar store, just in case. In short, we can
update *once a week* already, and we're working on seeing how feasible it
is to update more often than that.
6. We ran into a *problem* when thinking about sanitizing the data. Our
initial idea was to filter out the same columns that are filtered out when
data is replicated to labsdb. But we found rows are also filtered and the
process for doing that filtering is in need of a lot of love and care. So
we may side-track to see if we can help out our fellow DBAs and labs ops in
the process, maybe unifying the edit data sanitization.
Steps remaining for having simplewiki data in Druid / Pivot by the end of
Q1:
* vet data with Erik
* finish productionizing our Pivot install so internal/NDA folks can play
with it
Hi all & thanks, Tilman!
We did not analyze Arabic WP, but the tools released alongside the paper
could be used to produce the analysis.
One challenge w working w redirects (and a motivation for the short paper)
is that redirects are actually quite dynamic. They may exist for a time and
then be re-routed as the coverage of a topic changes and moves. Thus the
discussion in the paper of redirect "spells" and all the work to do
something more than just drop them from the analysis.
Also relevant to Reem's original concern and the subsequent discussion
here: page views accrue to redirects even when the *content* that is viewed
exists on the page that is the target of the redirect. Thus even a page
that gets few/zero edits may be viewed via a redirect in a way that the
usual page view data does not account for very precisely.
Not surprisingly, I agree with Tilman that it would be very interesting to
see how some of the comparisons/analyses discussed in this thread might
change w more precise accounting of redirects :)
later,
Aaron
On Thu, Sep 15, 2016 at 11:55 PM, Tilman Bayer <tbayer(a)wikimedia.org> wrote:
> To Andrew's point about excluding redirects, see also this paper by
> Benjamin Mako Hill and Aaron Shaw (CCed): https://mako.cc/copyri
> ghteous/consider-the-redirect
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__mako.cc_copyrighteous_…>
> (don't know if they have data for Arabic Wikipedia too)
>
> In short, the distribution of edits is very different for redirects and
> articles. In light of this, and to address Reem's original question, it's
> probably worth looking at the actual histogram before relying on the
> average or other statistical moments.
>
> Also interesting in this regard, although the data is not current:
> https://meta.wikimedia.org/wiki/Wikipedia_article_depth
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__meta.wikimedia.org_wik…>
>
> On Thu, Sep 15, 2016 at 7:00 AM, Dan Andreescu <dandreescu(a)wikimedia.org>
> wrote:
>
>> Good point, updated to *exclude redirects* and rerun:
>>
>> total_namespace_0_revisions: 457,574,404
>> total_namespace_0_pages: 5,236,104
>>
>> per namespace 0 non-redirect article:
>>
>> standard deviation of edits: *324.45*
>> *average* edits: *87.54*
>> standard deviation of days between first and last edit: *1360.16*
>> *average* days between first and last edit: *2316.37*
>>
>> So you were right, Andrew, numbers change, but I think the nature of the
>> data is roughly the same. It's interesting that average difference between
>> first and last edit is smaller than two standard deviations. That suggests
>> that curve is also slightly lopsided, with perhaps lots of more recently
>> created articles and few long lived ones. But that "recent" could be the
>> spike in the 2007-2011 period. It may be interesting to play with these
>> metrics more, and I'll keep this in mind as we build the new infrastructure
>> (making these queries as fast as possible and easy to dig into).
>>
>> On Wed, Sep 14, 2016 at 6:18 PM, Andrew Gray <andrew.gray(a)dunelm.org.uk>
>> wrote:
>>
>>> Hi Dan,
>>>
>>> Thanks for running these!
>>>
>>> I'm struck by the figure of 12.8m pages in ns0 - it looks like this
>>> includes redirects (there are ~7.6m ns0 redirects on enwiki, and ~5.2m
>>> articles). This will probably skew things a lot, as the majority of
>>> those will probably be edited once and never touched again, barring
>>> the target page being moved,. Given they're ~60% of the pages, this
>>> will introduce a lot of extra weight for "articles with very few
>>> edits" and "articles that get edited very infrequently".
>>>
>>> It might be worth trying to filter out redirects - I suspect this
>>> would have a noticeable effect on both the distribution and the mean
>>> time between edits.
>>>
>>> Andrew.
>>>
>>> On 14 September 2016 at 22:01, Dan Andreescu <dandreescu(a)wikimedia.org>
>>> wrote:
>>> > Quick follow up 'cause I was curious. I calculated the average and
>>> standard
>>> > deviation for edits per namespace 0 article on enwiki. I tried to do
>>> it on
>>> > the research db replicas but it took forever so I did it on the hadoop
>>> > cluster. Including archived pages isn't useful, doesn't change the
>>> results
>>> > almost at all. Including pages outside namespace 0 increases the
>>> standard
>>> > deviation and decreases the average. Here are the results:
>>> >
>>> > 484,170,218 edits on namespace 0
>>> > 12,756,342 pages in namespace 0
>>> >
>>> > standard deviation for edits per page: 213.58
>>> > average edits per page: 38.02
>>> > average days between first and last edit per page: 1215.27
>>> >
>>> > So considering the standard deviation is much larger than the mean, I'm
>>> > pretty confident to answer yes, I think the vast majority of articles
>>> in
>>> > namespace 0 on enwiki get very few edits. The dataset we're working on
>>> > releasing as part of wikistats 2.0 will allow these kinds of questions
>>> to be
>>> > answered really easily and really quickly. Stay tuned over the next
>>> few
>>> > quarters :)
>>> >
>>> > And the queries:
>>> > https://gist.github.com/milimetric/8b5f447e3ef09b6fe4384e0f75cc0b34
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_milime…>
>>> >
>>> > If you want to edit those queries to find something else out, I'm
>>> happy to
>>> > run them one or two more times, but then I really have to get back to
>>> my
>>> > real job :)
>>> >
>>> > On Wed, Sep 7, 2016 at 12:42 PM, Andrew Gray <
>>> andrew.gray(a)dunelm.org.uk>
>>> > wrote:
>>> >>
>>> >> Hi Reem,
>>> >>
>>> >> Here's some rough estimates.
>>> >>
>>> >> English - https://stats.wikimedia.org/EN/TablesWikipediaEN.htm
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__stats.wikimedia.org_EN…>
>>> >>
>>> >> English has ~5.2 million articles, with an average of ~92 edits per
>>> >> article, not counting deleted edits (or deleted articles). Note that
>>> 80% of
>>> >> those articles are more than three years old, so they've had plenty
>>> of time
>>> >> to build up the 92 edits.
>>> >>
>>> >> [The page does not explicitly say that only article edits are counted
>>> in
>>> >> the tables, but this is easy to confirm -
>>> >> https://en.wikipedia.org/wiki/Wikipedia:Statistics
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_…>
>>> has 847m edits]
>>> >>
>>> >> Arabic - https://stats.wikimedia.org/EN/TablesWikipediaAR.htm
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__stats.wikimedia.org_EN…>
>>> >>
>>> >> Arabic has ~437k articles, ~31 edits/article - but only half of these
>>> are
>>> >> more than three years old, so they're on average a lot younger than
>>> the
>>> >> English ones.
>>> >>
>>> >> As of July there are 3.3m edits/month in English - this is equal to an
>>> >> average of 0.63 edits/article/month - and 226k edits/month in Arabic,
>>> equal
>>> >> to 0.52 edits/article/month. July was a slow month for Arabic, and
>>> March had
>>> >> more than twice as many edits, 487k, across 415k articles.
>>> >>
>>> >> These are plain averages. The distribution is going to be very
>>> skewed, so
>>> >> high-edit articles get most of the attention, and the other articles
>>> easily
>>> >> go months without attention. If we assume an 80:20 distribution -
>>> which is a
>>> >> wild guess but sounds plausible - then the "long tail" of 80% of
>>> articles
>>> >> would get 20% of the edits. In this case, a plausible average would
>>> be:
>>> >>
>>> >> * English long tail, 4.16m articles and 660k edits/month = average of
>>> six
>>> >> months between each edit
>>> >> * Arabic (July) long tail, 350k articles and 45k edits/month =
>>> average of
>>> >> seven or eight months between each edit
>>> >> * Arabic (March) long tail, 332k articles and 97k edits/month =
>>> average of
>>> >> three and a half months between each edit
>>> >>
>>> >> This is a broad range, but it feels more or less right for all those
>>> >> unloved pages...
>>> >>
>>> >> Andrew.
>>> >>
>>> >>
>>> >> On 7 September 2016 at 14:52, Reem Al-Kashif <reemalkashif(a)gmail.com>
>>> >> wrote:
>>> >> > Hi,
>>> >> >
>>> >> > I always hear people saying that most of the articles usually
>>> receive
>>> >> > little
>>> >> > to no edits (and that is used to encourage participants to make sure
>>> >> > their
>>> >> > articles are good enough). I would like to know if there are
>>> statistics
>>> >> > that
>>> >> > support this for the English and Arabic Wikipedia.
>>> >> >
>>> >> > Best,
>>> >> > Reem
>>> >> >
>>> >> > --
>>> >> > Kind regards,
>>> >> > Reem Al-Kashif
>>> >> >
>>> >> > _______________________________________________
>>> >> > Analytics mailing list
>>> >> > Analytics(a)lists.wikimedia.org
>>> >> > https://lists.wikimedia.org/mailman/listinfo/analytics
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.wikimedia.org_ma…>
>>> >> >
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> - Andrew Gray
>>> >> andrew.gray(a)dunelm.org.uk
>>> >>
>>> >>
>>> >> --
>>> >> - Andrew Gray
>>> >> andrew.gray(a)dunelm.org.uk
>>> >>
>>> >> _______________________________________________
>>> >> Analytics mailing list
>>> >> Analytics(a)lists.wikimedia.org
>>> >> https://lists.wikimedia.org/mailman/listinfo/analytics
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.wikimedia.org_ma…>
>>> >>
>>> >
>>> >
>>> > _______________________________________________
>>> > Analytics mailing list
>>> > Analytics(a)lists.wikimedia.org
>>> > https://lists.wikimedia.org/mailman/listinfo/analytics
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.wikimedia.org_ma…>
>>> >
>>>
>>>
>>>
>>> --
>>> - Andrew Gray
>>> andrew.gray(a)dunelm.org.uk
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> Analytics(a)lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.wikimedia.org_ma…>
>>>
>>
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics(a)lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.wikimedia.org_ma…>
>>
>>
>
>
> --
> Tilman Bayer
> Senior Analyst
> Wikimedia Foundation
> IRC (Freenode): HaeB
>
Hi,
I always hear people saying that most of the articles usually receive
little to no edits (and that is used to encourage participants to make sure
their articles are good enough). I would like to know if there are
statistics that support this for the English and Arabic Wikipedia.
Best,
Reem
--
*Kind regards,Reem Al-Kashif*
The Wikimedia Foundation's Discovery and Research teams recently hosted an
introductory workshop on the SPARQL query language and the Wikidata Query
Service.
We made the video stream <https://www.youtube.com/watch?v=NaMdh4fXy18> and
materials
<https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/2016_SPARQL_Wor…>
(demo
queries, slidedecks) from this workshop publicly available.
Guest speakers:
- Ruben Verborgh, *Ghent University* and *Linked Data Fragments*
- Benjamin Good, *Scripps Research Institute* and *Gene Wiki*
- Tim Putman, *Scripps Research Institute* and *Gene Wiki*
- Lucas, *@WikidataFacts*
Dario and Stas
*Dario Taraborelli *Head of Research, Wikimedia Foundation
wikimediafoundation.org • nitens.org • @readermeter
<http://twitter.com/readermeter>
Jonathan,
Do send your questions to analytics@ to get a better/faster response.
>I recently discovered the wonderful Wikimedia Analytics/Pageview API and
was wondering whether there are any plans to extend it to include pageviews
by >country? It would be a great way to find out what people are interested
in in a specific country.
No, there are no plans to this effect as data needs to be sanitized before
we can do this. We release pageview country reports on another format here:
https://stats.wikimedia.org/wikimedia/squids/SquidReportPageViewsPerCountry…
Thanks,
Nuria
On Tue, Sep 13, 2016 at 4:04 AM, Jonathan Van Parys <jvanparys(a)gmail.com>
wrote:
> Dear Nuria,
>
> I recently discovered the wonderful Wikimedia Analytics/Pageview API and
> was wondering whether there are any plans to extend it to include pageviews
> by country? It would be a great way to find out what people are interested
> in in a specific country.
>
> Many thanks in advance,
>
> Jonathan
FYI
---------- Forwarded message ----------
From: Ariel Glenn WMF <ariel(a)wikimedia.org>
Date: Mon, Sep 12, 2016 at 9:07 AM
Subject: [Research-Internal] Fwd: Dumps Rewrite getting underway (help
needed!)
To: research-internal(a)lists.wikimedia.org
---------- Forwarded message ----------
From: Ariel Glenn WMF <ariel(a)wikimedia.org>
Date: Mon, Sep 5, 2016 at 2:35 PM
Subject: Dumps Rewrite getting underway (help needed!)
To: Wikipedia Xmldatadumps-l <Xmldatadumps-l(a)lists.wikimedia.org>
Hello folks,
I know a number of you have subscribed to the Dumps Rewrite project (
https://phabricator.wikimedia.org/tag/dumps-rewrite/) but I bet none of you
actually watch it or any of its tasks. So here's a heads up.
I'm getting started on work on the job scheduler/workflow manager piece;
this would accept lists of dump tasks (in the current setup, "dump stubs
for el wikipedia"), call a callback to turn each of them into small jobs
that can be completed in less than an hour, submit and monitor these jobs
with retries, dependencies etc, call a callback to recombine the outputs of
the jobs, and notify some caller on success of te whole operation.
First up is evaluating existing packages and choosing one to use as a
foundation. Please contribute! See the following tasks:
https://phabricator.wikimedia.org/T143205: Draft usage scenarios for
job/workflow manager <https://phabricator.wikimedia.org/T143205>
https://phabricator.wikimedia.org/T143206: List requirements needed for
task/job/workflow manager <https://phabricator.wikimedia.org/T143206>
https://phabricator.wikimedia.org/T143207: Evaluate software packages for
job/task/workflow management <https://phabricator.wikimedia.org/T143207>
Also, can someone please forward this on to analytics-l and research-l?
I'm not on those lists but they will no doubt have a lot of useful
expertise here.
Thanks!
Ariel
_______________________________________________
Research-Internal mailing list
Research-Internal(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/research-internal
Hello Analytics,
A while ago I asked about the existence of any A/B-Testing Framework. I got
to know (Thanks, Nuria!) that https://phabricator.wikimedia.org/T135762 is
in preparation. However, I assume, that until this is in place, we need to
use custom solutions which utilize EventLogging.
Event logging itself is pretty clear to me, but not the splitting/cookie
logic.
Could anybody link me some examples for such a self implemented way to show
users their assigned content and, if they are not assigned to a group yet,
to assign users to A/B… bins?
Jan
--
Jan Dittrich
UX Design/ User Research
Wikimedia Deutschland e.V. | Tempelhofer Ufer 23-24 | 10963 Berlin
Phone: +49 (0)30 219 158 26-0
http://wikimedia.de
Imagine a world, in which every single human being can freely share in the
sum of all knowledge. That‘s our commitment.
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/029/42207.
I'm noticing curious spikes occasionally in the usage stats ( via
tools.wmflabs.org/pageviews/ ) for a Wikibook I wrote and maintain. I
would guess that many of the visitors are coming via a search engine.
Some blogs provide authors with a sanitized subset of HTTP referer {sic}
header information, specifically the search engine search terms. I'm
looking for that or something similar for Wikibooks.
How may I go about getting a sanitized list of search terms used to
enter that Wikibook or its chapters?
Regards,
Lars