Hi,
it seems ops received a request to add the negotiated cipher suite to
the cache logs for https requests.
Would it hurt any of our tools or would we expect breakage if for
example
$ssl_cipher
from
http://nginx.org/en/docs/http/ngx_http_ssl_module.html#variables
got append as field #17 to the format currently described at
https://wikitech.wikimedia.org/wiki/Cache_log_format
?
Do we know of other reasons to veto such a change?
(If I do not hear of problems until 2014-01-22, I'll let ops know that
appending $ssl_cipher is ok for us)
Best regards,
Christian
P.S.:
* Webstatscollector ignores additional fields, hence should be safe.
* Wikipedia zero ignores additional fields, hence should be safe.
* The mobile jobs that we moved out of Hadoop some time ago and I
guess are currently unused ignore additional fields, hence should be
safe.
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Gruendbergstrasze 65a Email: christian(a)quelltextlich.at
4040 Linz, Austria Phone: +43 732 / 26 95 63
Fax: +43 732 / 26 95 63
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
OpenPGP key transition from 0xEF78CCDE to 0x13C1072F:
http://quelltextlich.at/openpgp-transition-0xEF78CCDE-to-0x13C1072F.txt
So I had been replying via email to the bugzilla-daemon, for *months*, and
thinking that it would just update bugzilla. It does *not*. So
1. sorry for my apparent silence on the 8 bugs affected, 56030, 42259,
58416, 58633, 59846, 58450, 60095, 58208
2. keep this in mind in case you, too, are tempted to talk to soul-less
uncompassionate machines
http://istrategylabs.com/2014/01/3-million-teens-leave-facebook-in-3-years-…
True, people have different motivations (!) for using Wikipedia and FB.
But if other sites are strongly impacted by changing patterns in use by
different age groups, maybe it's worth investigating whether something
similar is happening to Wikimedia sites, too? Do we have data on page
views/user activity by age group?
Cheers,
Andrew
According to the stats, the last 12 months have seen a decline in overall
page views by about 9-10%.
As far as I can tell these numbers do not include access to Wikipedia and
sister projects through the API. And if I understand correctly, this means
that e.g. access through the Wikipedia app is ignored.
Considering that mobile access through the browser is about 15-20% of the
overall access (if I read the graphs correctly), is it possible that the
current page view numbers are underreported?
If a reasonable portion of mobile users use the app instead of the browser,
then these numbers should somehow be included in the page view statistics,
or wrong conclusions might be drawn.
Any ideas?
Denny
Just saw that my tool was mentioned on this list.
I have a running commons category intersection available as a gadget
on commons now.
https://commons.wikimedia.org/w/index.php?title=Help:FastCCI&withJS=MediaWi…
It is less of an analysis tool, and more of a user interface enhancement.
Right now the intersection is only used to dig for Featured pictures,
Quality images, and Valued images. Results are delivered quite fast
through a websocket connection, which allows the streaming of progress
updates.
Cheers,
Daniel
Just saw that my tool was mentioned on this list.
I have a running commons category intersection available as a gadget
on commons now.
https://commons.wikimedia.org/w/index.php?title=Help:FastCCI&withJS=MediaWi…
It is less of an analysis tool, and more of a user interface enhancement.
Right now the intersection is only used to dig for Featured pictures,
Quality images, and Valued images. Results are delivered quite fast
through a websocket connection, which allows the streaming of progress
updates.
Cheers,
Daniel
Hello Analytics people!
I have a specific analytics question about how the category tree on
commons is used. To get started I have drafted a schema at
https://meta.wikimedia.org/wiki/Schema:CommonsCategoryTreeUse
The description on the talk page of that schema is copied below.
Thanks for considering it!
Best,
Daniel - [[User:Dschwen]]
Question
How are anonymous users using the commons category tree to find
images, compared to logged in users. Is the category tree being used
to discover images?The proposed schema should emit events on page view
and on category link clicks. The event data should contain the log in
status (logged in/not logged in), and the current namespace number.
Analysis
The following analysis on the dataset would be performed:
Category page visitation frequency compared to image page visitation
frequency for logged in and logged out users.How much relative "time"
is each group spending the category namespace? This could indicate if
categories are a significant path for the discovery of images (as
opposed to direct jumps to image pages from internal/external
search).Category link click rates in category and image
namespaces.These metrics (again for each logged in and logged out
users) would indicate if the category tree is actively browsed (rather
than stumble upon).
* Category link clicks in the image namespace are an indicator for the
effectiveness of categories to find similar content.
* Category link clicks in category namespace are an indicator for
browsing the category tree to find specific content
Rationale
Motivator for this study is finding out the significance of the
category tree in content discovery on wikimedia commons. This directly
impacts decisions for gadget default deployment, such as the FastCCI
Gadget which would benefit anonymous users (if the category tree is a
significant funnel for content discovery. The schema is designed to
collect a minimum amount of data in a maximally anonymized way.
The data to be logged should be considered inexpensive (standard
identifiers isAnaon and pageNS in the schema). I have no clue how the
link click action will be logged, but determining the namespace from
the link target should be rather trivial (using mw.title for example).
[Reposted from private discussion after Dario's request]
My problem is that of exploring the graph structure of Wikipedia
1) easily;
2) reproducibly;
3) in a way that does not depend on parsing artifacts.
Presently, when people wants to do this they either do their own parsing of the dumps, or they use the SQL data, or they download a dataset like
http://law.di.unimi.it/webdata/enwiki-2013/
which has everything "cooked up".
My frustration in the last few days was when trying to add the category links. I didn't realize (well, it's not very documented) that bliki extracts all links and render them in HTML *except* for the category links, that are instead accessible programmatically. Once I got there, I was able to make some progress.
Nonetheless, I think that the graph of Wikipedia connections (hyperlinks and category links) is really a mine of information and it is a pity that a lot of huffing and puffing is necessary to do something as simple as a reverse visit of the category links from "People" to get, actually, all people pages (this is a bit more complicated--there are many false positives, but after a couple of fixes worked quite well).
Moreover, one has continuously this feeling of walking on eggshells: a small change in bliki, a small change in the XML format and everything might stop working is such a subtle manner that you realize it only after a long time.
I was wondering if Wikimedia would be interested in distributing in compressed form the Wikipedia graph. That would be the "official" Wikipedia graph--the benefits, in particular for people working on leveraging semantic information from Wikipedia, would be really significant.
I would (obviously) propose to use our Java framework, WebGraph, which is actually quite standard in distributing large (well, actually much larger) graphs, such as ClueWeb09 http://lemurproject.org/clueweb09/, ClueWeb12 http://lemurproject.org/clueweb12/ and the recent Common Web Crawl http://webdatacommons.org/hyperlinkgraph/index.html. But any format is OK, even a pair of integers per line. The advantage of a binary compressed form is reduced network utilization, instantaneous availability of the information, etc.
Probably it would be useful to actually distribute several graphs with the same dataset--e.g., the category links, the content link, etc. It is immediate, using WebGraph, to build a union (i.e., a superposition) of any set of such graphs and use it transparently as a single graph.
In my mind the distributed graph should have a contiguous ID space, say, induced by the lexicographical order of the titles (possibly placing template pages at the start or at the end of the ID space). We should provide graphs, and a bidirectional node<->title map. All such information would use about 300M of space for the current English Wikipedia. People could then associate pages to nodes using the title as a key.
But this last part is just rambling. :)
Let me know if you people are interested. We can of course take care of the process of cooking up the information once it is out of the SQL database.
Ciao,
seba
Hello Analytics people!
I have a specific analytics question about how the category tree on
commons is used. To get started I have drafted a schema at
https://meta.wikimedia.org/wiki/Schema:CommonsCategoryTreeUse
The description on the talk page of that schema is copied below.
Thanks for considering it!
Best,
Daniel - [[User:Dschwen]]
Question
How are anonymous users using the commons category tree to find
images, compared to logged in users. Is the category tree being used
to discover images?The proposed schema should emit events on page view
and on category link clicks. The event data should contain the log in
status (logged in/not logged in), and the current namespace number.
Analysis
The following analysis on the dataset would be performed:
Category page visitation frequency compared to image page visitation
frequency for logged in and logged out users.How much relative "time"
is each group spending the category namespace? This could indicate if
categories are a significant path for the discovery of images (as
opposed to direct jumps to image pages from internal/external
search).Category link click rates in category and image
namespaces.These metrics (again for each logged in and logged out
users) would indicate if the category tree is actively browsed (rather
than stumble upon).
* Category link clicks in the image namespace are an indicator for the
effectiveness of categories to find similar content.
* Category link clicks in category namespace are an indicator for
browsing the category tree to find specific content
Rationale
Motivator for this study is finding out the significance of the
category tree in content discovery on wikimedia commons. This directly
impacts decisions for gadget default deployment, such as the FastCCI
Gadget which would benefit anonymous users (if the category tree is a
significant funnel for content discovery. The schema is designed to
collect a minimum amount of data in a maximally anonymized way.
The data to be logged should be considered inexpensive (standard
identifiers isAnaon and pageNS in the schema). I have no clue how the
link click action will be logged, but determining the namespace from
the link target should be rather trivial (using mw.title for example).
Hi all,
The Foundation has released the first draft of the data retention
guidelines.
https://meta.wikimedia.org/wiki/Data_retention_guidelines
We'd like to solicit your feedback on these guidelines. You can use the
talk page on the above document.
We really appreciate your help in finding the right balance between our
values, privacy and the ability to improve the site and make data available
to the community.
Thanks,
-Toby