Erik,
Thanks a lot for the appreciation.
As Sajjad mentioned, we have already obtained a edit-per-location
dataset from Evan (Rosen) that has the following column structure:
*language,country,city,start,end,fraction,ts*
*start* and *end* denote the beginning and ending date for counting the
number of edits, and *ts* is time stamp.
The *fraction*, however, gives a national ratio of edit activity, that
is it gives the ratio of 'total edits from that city for that language
Wikipedia project' divided 'total edits from that country for that
language Wikipedia project'. Hence, it cannot be used to understand
global edit contributions to a Wikipedia project (for a time period).
It seems that the original data (from where this dataset is extracted)
should also have the global fractions -- total edit from a city divided
by total edit from the whole world, for a project, for a time period.
Would you know if the global fractions can also be derived from the XML
dumps? Or, even better, is the relevant raw data available in CSV form
somewhere else?
Bests,
sumandro
-------------
sumandro
ajantriks.net
On Wednesday 15 May 2013 12:32 AM, analytics-request(a)lists.wikimedia.org
wrote:
> Send Analytics mailing list submissions to
> analytics(a)lists.wikimedia.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.wikimedia.org/mailman/listinfo/analytics
> or, via email, send a message with subject or body 'help' to
> analytics-request(a)lists.wikimedia.org
>
> You can reach the person managing the list at
> analytics-owner(a)lists.wikimedia.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Analytics digest..."
>
> ----------------------------------------------------------------------
>
>
> Date: Tue, 14 May 2013 19:40:00 +0200
> From: "Erik Zachte" <ezachte(a)wikimedia.org>
> To: "'A mailing list for the Analytics Team at WMF and everybody who
> has an interest in Wikipedia and analytics.'"
> <analytics(a)lists.wikimedia.org>
> Subject: Re: [Analytics] Visualizing Indic Wikipedia projects.
> Message-ID: <016f01ce50ca$0fe736b0$2fb5a410$(a)wikimedia.org>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Awesome work! I like the flexibility of the charts, easy to switch metrics
> and presentation mode.
>
>
>
> 1. WMF has never captured ip->geo data on city level, but afaik this is
> going to change with Kraken.
>
>
>
> 2. Total edits per article per year can be derived from the xml dumps. I may
> have some csv data that come in handy.
>
> For edit wars you need track reverts on an per article basis, right? That
> can also be derived from dumps.
>
> For long history you need full archive dumps and need to calc checksum per
> revision text. (stub dumps have checksum but only for last year or two)
>
>
>
> Erik Zachte
>
>
>
Hi everyone!
Has anyone tried to observer how different wikipedias use the
templates: how often, what's the average depth of template calls, etc?
-----
Yury Katkov, WikiVote
[Reposted from private discussion after Dario's request]
My problem is that of exploring the graph structure of Wikipedia
1) easily;
2) reproducibly;
3) in a way that does not depend on parsing artifacts.
Presently, when people wants to do this they either do their own parsing of the dumps, or they use the SQL data, or they download a dataset like
http://law.di.unimi.it/webdata/enwiki-2013/
which has everything "cooked up".
My frustration in the last few days was when trying to add the category links. I didn't realize (well, it's not very documented) that bliki extracts all links and render them in HTML *except* for the category links, that are instead accessible programmatically. Once I got there, I was able to make some progress.
Nonetheless, I think that the graph of Wikipedia connections (hyperlinks and category links) is really a mine of information and it is a pity that a lot of huffing and puffing is necessary to do something as simple as a reverse visit of the category links from "People" to get, actually, all people pages (this is a bit more complicated--there are many false positives, but after a couple of fixes worked quite well).
Moreover, one has continuously this feeling of walking on eggshells: a small change in bliki, a small change in the XML format and everything might stop working is such a subtle manner that you realize it only after a long time.
I was wondering if Wikimedia would be interested in distributing in compressed form the Wikipedia graph. That would be the "official" Wikipedia graph--the benefits, in particular for people working on leveraging semantic information from Wikipedia, would be really significant.
I would (obviously) propose to use our Java framework, WebGraph, which is actually quite standard in distributing large (well, actually much larger) graphs, such as ClueWeb09 http://lemurproject.org/clueweb09/, ClueWeb12 http://lemurproject.org/clueweb12/ and the recent Common Web Crawl http://webdatacommons.org/hyperlinkgraph/index.html. But any format is OK, even a pair of integers per line. The advantage of a binary compressed form is reduced network utilization, instantaneous availability of the information, etc.
Probably it would be useful to actually distribute several graphs with the same dataset--e.g., the category links, the content link, etc. It is immediate, using WebGraph, to build a union (i.e., a superposition) of any set of such graphs and use it transparently as a single graph.
In my mind the distributed graph should have a contiguous ID space, say, induced by the lexicographical order of the titles (possibly placing template pages at the start or at the end of the ID space). We should provide graphs, and a bidirectional node<->title map. All such information would use about 300M of space for the current English Wikipedia. People could then associate pages to nodes using the title as a key.
But this last part is just rambling. :)
Let me know if you people are interested. We can of course take care of the process of cooking up the information once it is out of the SQL database.
Ciao,
seba
Greetings!
I'm a Professor at Macalester College in Minnesota, and I have been
collaborating with Brent Hecht and many students to develop a Java
framework for extracting multilingual knowledge from Wikipedia [1]. The
framework is pre-alpha now, but we hope to offer a stable release in the
next month.
Given a phrase (e.g. "apple"), our library must identifying articles
associated with a phrase. This is a probabilistic question. How likely is
the phrase "apple" to refer to the article about the fruit vs the company?
This simple task (often called Wikification or disambiguation) forms the
basis of many NLP algorithms.
Google and Stanford have released an awesome dataset to support this task
[2]. It contains the *text* of all internet hyperlinks to Wikipedia
articles. This dataset makes the problem much easier, but it has two
serious deficiencies. First, it only contains links to articles in English
Wikipedia. Second, it was generated once by Google, and it is unlikely
Google will update it.
The WMF could create a similar dataset by publishing the most common
inbound search queries for all WP pages across all language editions. This
dataset would enable individuals, researchers and small companies (not just
Google and Microsoft) to harness Wikipedia data for their applications.
Does this seem remotely possible? I've thought a little about engineering
and privacy issues related to the dataset. Neither are trivial, but I think
they are feasible, and I'd be happy to volunteer my engineering effort.
If you think the idea has legs, how do we develop a more formal proposal
about the dataset?
Thanks for your feedback!
-Shilad
[1] https://github.com/shilad/wikAPIdia
[2]
http://googleresearch.blogspot.com/2012/05/from-words-to-concepts-and-back.…
--
Shilad W. Sen
Assistant Professor
Mathematics, Statistics, and Computer Science Dept.
Macalester College
ssen(a)macalester.edu
651-696-6273
+analytics
On Mon, Dec 2, 2013 at 11:27 AM, Jon Robson <jrobson(a)wikimedia.org> wrote:
> Several theories:
> 1) I guess this now reflects reversions/deletions of edits?
> 2) We purge data collected before a certain date
>
> I suspect the former?
>
> On Wed, Nov 27, 2013 at 11:26 AM, Kenan Wang <kwang(a)wikimedia.org> wrote:
> > Hey everyone. The monthly successful edits graph no longer reflects a
> drop
> > in edits between August and September...
> >
> > Is it possible that we had data issues before that were causing us to
> see an
> > issue that wasn't there?
> >
> > OR
> >
> > Is there some sort of data issue now that is causing us to be missing
> data
> > that we used to be able to see?
> >
> >
> > On Thu, Oct 17, 2013 at 3:27 PM, Arthur Richards <
> arichards(a)wikimedia.org>
> > wrote:
> >>
> >> I guess it's tough to draw a baseline considering we only have a couple
> >> months worth of data - but perhaps a good metric for our goal is
> returning
> >> to or exceeding our peak?
> >>
> >>
> >> On Thu, Oct 17, 2013 at 1:52 PM, Kenan Wang <kwang(a)wikimedia.org>
> wrote:
> >>>
> >>> Thanks to Juliusz we now have a graph of mobile editors that became
> >>> active editors. We can see a drop in the number of editors that became
> >>> active:
> >>>
> >>>
> http://mobile-reportcard.wmflabs.org/graphs/edits-monthly-5plus-editors
> >>>
> >>>
> >>> On Tue, Oct 15, 2013 at 2:31 PM, Kenan Wang <kwang(a)wikimedia.org>
> wrote:
> >>>>
> >>>> Well, looks like IndiaSummer95 got blocked today... but I've reached
> out
> >>>> to Nick1372 and KurtWags3 via their user talk pages.
> >>>>
> >>>>
> >>>> On Wed, Oct 9, 2013 at 9:43 PM, Jon Robson <jrobson(a)wikimedia.org>
> >>>> wrote:
> >>>>>
> >>>>> I hope this is not too stalker-y but I think it is interesting.
> >>>>>
> >>>>> I looked briefly at the top 5 editors for both August [A] and
> >>>>> September [B], the most prolific editor "Indiasummer95", 520 days
> >>>>> registered, [1] made 2908 edits in August but only 694 in September.
> >>>>> The second most profilic editor "La Avatar Korra", registered for 88
> >>>>> days [2] in September made 954 edits in August but only 379 in
> >>>>> September. Number 3, Nick1372, member for 306d days [3] had 762 edits
> >>>>> in August but only 445 in September, #4 was 이민혁, registered for 70
> >>>>> days, [4] (558 rising to 623), and #5 KurtWags3, registered for 84
> >>>>> days [5] (520 dropping to 102)
> >>>>>
> >>>>> Out of these Indiasummer95 was the most interesting so I looked at
> the
> >>>>> edits per page for August [C]:
> >>>>> Tell_Mama_UK 66
> >>>>> Abel_Xavier 60
> >>>>> Melilla 56
> >>>>> Vitória_S.C. 44
> >>>>> History_of_Roman_Catholicism_in_Portugal 40
> >>>>> Knattspyrnufélag_Reykjavíkur 40
> >>>>> Robot_Wars_(TV_series) 38
> >>>>> F.C._Paços_de_Ferreira 32
> >>>>> Symbols_of_Portugal 32
> >>>>> Politics_of_Toledo_(Spain) 32
> >>>>>
> >>>>> The same in September [D]:
> >>>>> Alfonso_I_of_Asturias 21
> >>>>> Indiasummer95 13
> >>>>> Vitālijs_Astafjevs 12
> >>>>> A.C._Perugia_Calcio 10
> >>>>> Morangos_com_Açúcar 9
> >>>>> Azad_Ali 9
> >>>>> Million_Muslim_March 8
> >>>>> Islam_in_the_Czech_Republic 7
> >>>>> Szilárd_Németh 7
> >>>>> Matteo_Ferrari 6
> >>>>>
> >>>>> Inspecting closing on the dates of his/her edits to Tell_Mama_UK
> (page
> >>>>> id 39683645) [E]
> >>>>> 2013-08-11 4
> >>>>> 2013-08-15 56
> >>>>> 2013-08-16 6
> >>>>>
> >>>>> However if I look at the history page I don't see 56 edits from
> >>>>> him/her on that page (I count 36 total on the entire page):
> >>>>>
> >>>>>
> https://en.m.wikipedia.org/w/index.php?title=Tell_Mama_UK&action=history
> >>>>>
> >>>>> So.. that makes me wonder...
> >>>>> 1) was something broken around 15th August in EventLogging - were we
> >>>>> reporting successes that weren't and thus the data is skewed (e.g.
> >>>>> AbuseFilter related) ? Is there a logical explanation for this or is
> >>>>> something indeed weird - can anyone suggest a reason?
> >>>>> 2) Is this people getting used to the editor/wikitext - making lots
> of
> >>>>> mistakes early on and then getting better at using it?
> >>>>> 3) Is it indeed seasonal / people getting bored after initial
> >>>>> excitement of editing?
> >>>>>
> >>>>> [1]
> >>>>>
> https://en.m.wikipedia.org/wiki/Special:UserProfile/Indiasummer95?mobileact…
> >>>>> [2]
> >>>>>
> https://es.m.wikipedia.org/wiki/Special:UserProfile/La%20Avatar%20Korra?mob…
> >>>>> [3]
> >>>>>
> https://en.m.wikipedia.org/wiki/Special:UserProfile/Nick1372?mobileaction=b…
> >>>>> [4]
> >>>>> https://ko.m.wikipedia.org/wiki/특수기능:UserProfile/
> 이민혁?mobileaction=beta
> >>>>> [5]
> >>>>>
> https://en.m.wikipedia.org/wiki/Special:UserProfile/KurtWags3?mobileaction=…
> >>>>>
> >>>>> [A] SELECT event_username, count(*) from MobileWebEditing_5644223
> >>>>> where event_action = 'success' and timestamp LIKE '201308%' group by
> >>>>> event_username
> >>>>> [B] SELECT event_username, count(*) from MobileWebEditing_5644223
> >>>>> where event_action = 'success' and timestamp LIKE '201309%' group by
> >>>>> event_username
> >>>>> [C] SELECT enwiki.page.page_title, count(*) from
> >>>>> MobileWebEditing_5644223 INNER JOIN enwiki.page on
> enwiki.page.page_id
> >>>>> = event_pageId where wiki = 'enwiki' and event_action = 'success' and
> >>>>> event_userName = 'Indiasummer95' and timestamp LIKE '201308%' group
> by
> >>>>> event_pageId
> >>>>> [D] SELECT enwiki.page.page_title, count(*) from
> >>>>> MobileWebEditing_5644223 INNER JOIN enwiki.page on
> enwiki.page.page_id
> >>>>> = event_pageId where wiki = 'enwiki' and event_action = 'success' and
> >>>>> event_userName = 'Indiasummer95' and timestamp LIKE '201309%' group
> by
> >>>>> event_pageId
> >>>>> [E] SELECT DATE(timestamp), count(*) from MobileWebEditing_5644223
> >>>>> where wiki = 'enwiki' and event_action = 'success' and event_userName
> >>>>> = 'Indiasummer95' and timestamp LIKE '201308%' and event_pageId =
> >>>>> 39683645 group by DATE(timestamp)
> >>>>>
> >>>>>
> >>>>> On Wed, Oct 9, 2013 at 8:09 PM, Kenan Wang <kwang(a)wikimedia.org>
> wrote:
> >>>>> > I'll try to block out some time to look at creating those two
> graphs
> >>>>> > tomorrow. Let me know if you want to help.
> >>>>> >
> >>>>> >
> >>>>> > On Wed, Oct 9, 2013 at 8:09 PM, Kenan Wang <kwang(a)wikimedia.org>
> >>>>> > wrote:
> >>>>> >>
> >>>>> >> Adding Dario to this thread.
> >>>>> >>
> >>>>> >> So the main question is why is the number of edits going down from
> >>>>> >> august
> >>>>> >> to september while the number of unique editors is going up?
> >>>>> >>
> >>>>> >> Few theories:
> >>>>> >>
> >>>>> >> 1) Problem with data - seems unlikely per Jon's analysis. Also,
> both
> >>>>> >> numbers come from the same data source.
> >>>>> >>
> >>>>> >> 2) Change in the makeup of editors - not conclusive from the data
> >>>>> >> but it
> >>>>> >> seems that the number of edits from new editors and experienced
> >>>>> >> editors are
> >>>>> >> following the same general pattern.
> >>>>> >>
> >>>>> >> 3) Change in edits per session - no analysis done towards this
> yet.
> >>>>> >>
> >>>>> >> I'd like to take a look at:
> >>>>> >>
> >>>>> >> 1) a daily graph over the same two month period on unique daily
> >>>>> >> editors
> >>>>> >>
> >>>>> >> 2) a daily graph of edits per session over the same two month
> >>>>> >> period
> >>>>> >>
> >>>>> >> Few things to keep in mind:
> >>>>> >>
> >>>>> >> -It seems that there is not a seasonality difference in desktop
> >>>>> >> edits, but
> >>>>> >> as pointed out by Steven Walling it's not necessarily true that
> >>>>> >> desktop
> >>>>> >> edits and mobile edits will follow the same seasonality patterns.
> >>>>> >>
> >>>>> >> -Seasonality also wouldn't account for the skew between unique
> >>>>> >> editors and
> >>>>> >> raw edits.
> >>>>> >>
> >>>>> >> -This effect seems fairly centered on enwiki, however mobile
> captcha
> >>>>> >> support was introduced and may affect numbers for other projects
> in
> >>>>> >> some off
> >>>>> >> setting fashion
> >>>>> >>
> >>>>> >> -The spike in number of edits occurs roughly between Aug 12 - 30,
> >>>>> >> particularly 23-30. In that time section level editing was
> >>>>> >> introduced on
> >>>>> >> August 27.
> >>>>> >>
> >>>>> >>
> >>>>> >> On Wed, Oct 9, 2013 at 6:08 PM, Jon Robson <jrobson(a)wikimedia.org
> >
> >>>>> >> wrote:
> >>>>> >>>
> >>>>> >>> If you look at my graphs this theory doesn't seem to hold....
> >>>>> >>>
> >>>>> >>> On 9 Oct 2013 12:24, "Tomasz Finc" <tfinc(a)wikimedia.org> wrote:
> >>>>> >>>>
> >>>>> >>>> On Tue, Oct 8, 2013 at 9:32 AM, Maryana Pinchuk
> >>>>> >>>> <mpinchuk(a)wikimedia.org>
> >>>>> >>>> wrote:
> >>>>> >>>> > One theory: most of our enwiki power users tried mobile
> editing
> >>>>> >>>> > early
> >>>>> >>>> > on and
> >>>>> >>>> > made their customarily high number of edits per session in
> >>>>> >>>> > August, but
> >>>>> >>>> > they
> >>>>> >>>> > weren't retained on mobile and didn't stick around through
> >>>>> >>>> > September.
> >>>>> >>>> > Meanwhile, new users continue to flood in, but they only make
> an
> >>>>> >>>> > edit
> >>>>> >>>> > or two
> >>>>> >>>> > and leave. If this is the case, it makes a strong argument for
> >>>>> >>>> > working
> >>>>> >>>> > on
> >>>>> >>>> > things like mobile GettingStarted/KeepGoing to bump up early
> >>>>> >>>> > engagement for
> >>>>> >>>> > newbies, and article histories/contrib views and other power
> >>>>> >>>> > user
> >>>>> >>>> > features
> >>>>> >>>> > for oldbies :)
> >>>>> >>>>
> >>>>> >>>> This could be an interesting theory. Kenan, what do you think
> >>>>> >>>> about
> >>>>> >>>> reaching out to a handful of those editors that dropped and ask
> >>>>> >>>> them
> >>>>> >>>> why?
> >>>>> >>>>
> >>>>> >>>> --tomasz
> >>>>> >>
> >>>>> >>
> >>>>> >>
> >>>>> >>
> >>>>> >> --
> >>>>> >>
> >>>>> >> Kenan Wang
> >>>>> >> Product Manager, Mobile
> >>>>> >> Wikimedia Foundation
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>> > --
> >>>>> >
> >>>>> > Kenan Wang
> >>>>> > Product Manager, Mobile
> >>>>> > Wikimedia Foundation
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>>
> >>>> Kenan Wang
> >>>> Product Manager, Mobile
> >>>> Wikimedia Foundation
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>>
> >>> Kenan Wang
> >>> Product Manager, Mobile
> >>> Wikimedia Foundation
> >>
> >>
> >>
> >>
> >> --
> >> Arthur Richards
> >> Software Engineer, Mobile
> >> [[User:Awjrichards]]
> >> IRC: awjr
> >> +1-415-839-6885 x6687
> >
> >
> >
> >
> > --
> >
> > Kenan Wang
> > Product Manager, Mobile
> > Wikimedia Foundation
>
--
Arthur Richards
Software Engineer, Mobile
[[User:Awjrichards]]
IRC: awjr
+1-415-839-6885 x6687
I am wondering whether Wikidata is included in this table?
http://reportcard.wmflabs.org/graphs/active_editors
It looks like it is not, since it has >3.5k active editors but is not in
the list on the left, but I would like to be sure.
If it is not, why not? The project has been active for over a year.
All – I added a bunch of events to the Team Analytics calendar: please add any other events that are relevant to the team.
We need to start talking travel plans/schedule for 2014 asap. This is particularly sensitive for me: I just got my H1 visa extension and I’ll have to schedule an appointment at a US embassy to get a visa stamp in order to re-enter the US upon my first international trip.
Dario
Hi,
just a quick heads up that the replication lag on analytics' s5 slave
(s5-analytics-slave.eqiad.wmnet, db73.pmtpa.wmnet) slowly rose to
>2 hours since 2013-12-10 ~12:00 UTC.
I filed RT ticket 6487:
https://rt.wikimedia.org/Ticket/Display.html?id=6487
Best regards,
Christian
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Gruendbergstrasze 65a Email: christian(a)quelltextlich.at
4040 Linz, Austria Phone: +43 732 / 26 95 63
Fax: +43 732 / 26 95 63
Homepage: http://quelltextlich.at/
---------------------------------------------------------------