I'm noticing curious spikes occasionally in the usage stats ( via tools.wmflabs.org/pageviews/ ) for a Wikibook I wrote and maintain. I would guess that many of the visitors are coming via a search engine. Some blogs provide authors with a sanitized subset of HTTP referer {sic} header information, specifically the search engine search terms. I'm looking for that or something similar for Wikibooks.
How may I go about getting a sanitized list of search terms used to enter that Wikibook or its chapters?
Regards, Lars
To get this, you'd have to query some raw data we have. So it's a one-off task, and I personally have a long queue of such tasks that I haven't gotten to yet. So basically, either create a general interest research project, sign an NDA, and get access to the data yourself, or find someone else with such an NDA that's willing to run a simple query for you. Defining the query ahead of time would help, I could run it quickly and get you results if you do that. The data we publish is described here (let me know if we have gaps in documentation, I would fix those): https://wikitech.wikimedia.org/wiki/Analytics/Data
On Tue, Aug 30, 2016 at 1:34 AM, Lars Noodén lars.nooden@gmail.com wrote:
I'm noticing curious spikes occasionally in the usage stats ( via tools.wmflabs.org/pageviews/ ) for a Wikibook I wrote and maintain. I would guess that many of the visitors are coming via a search engine. Some blogs provide authors with a sanitized subset of HTTP referer {sic} header information, specifically the search engine search terms. I'm looking for that or something similar for Wikibooks.
How may I go about getting a sanitized list of search terms used to enter that Wikibook or its chapters?
Regards, Lars
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Public search metrics area available here: https://discovery.wmflabs.org/ as Dan mentioned your query is quite specific, and public data is provided in aggregate, not raw, form.
Analytics team does not provide data upon request other than already public datasets but please see the *Research FAQ on Meta https://meta.wikimedia.org/wiki/Research:FAQ* to understand who at WMF owns research-related processes and resources, and on where to find data or statistics about a specific Product Audience (such as editors or readers).
Thanks,
Nuria
On Tue, Aug 30, 2016 at 3:12 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
To get this, you'd have to query some raw data we have. So it's a one-off task, and I personally have a long queue of such tasks that I haven't gotten to yet. So basically, either create a general interest research project, sign an NDA, and get access to the data yourself, or find someone else with such an NDA that's willing to run a simple query for you. Defining the query ahead of time would help, I could run it quickly and get you results if you do that. The data we publish is described here (let me know if we have gaps in documentation, I would fix those): https://wikitech.wikimedia.org/wiki/Analytics/Data
On Tue, Aug 30, 2016 at 1:34 AM, Lars Noodén lars.nooden@gmail.com wrote:
I'm noticing curious spikes occasionally in the usage stats ( via tools.wmflabs.org/pageviews/ ) for a Wikibook I wrote and maintain. I would guess that many of the visitors are coming via a search engine. Some blogs provide authors with a sanitized subset of HTTP referer {sic} header information, specifically the search engine search terms. I'm looking for that or something similar for Wikibooks.
How may I go about getting a sanitized list of search terms used to enter that Wikibook or its chapters?
Regards, Lars
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Thanks, Dan and Nuria, for the responses.
I see that the 'webrequest' table [1] with the current schema would have the field with raw header containing a superset of the data I am looking for with regard to the Wikibook:
referer string Referer header of request
but I don't think I would be able to propose a generic database query that would produce sufficiently sanitized data. At this point, I'm looking for only the search strings.
I'm also not sure of the contents of uri_path or uri_query to know which one would restrict the search to specific Wikibooks.
So, what would be the process to request access to the raw data and what would be the conditions for such access? If I were to pursue that, as far as a general interest research project goes, the referred search terms could be grouped by Featured Book (plus the one, non-featured book I am aiming for). There are about 200 English language Featured Books [2] at the moment.
Regards, Lars
[1] https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest#Current_Schema
[2] https://en.wikibooks.org/wiki/Wikibooks:Featured_books#Featured_books
Lars,
I am not sure we have at the data you are looking for, the data we get from searches is only available for 60 days or less while it gets processed and deleted after that. Agreggated pageview data is kept long term, search data is not.
So, what would be the process to request access to the raw data and what would
be the conditions for such access? Access to raw data is normally restricted to research projects. You can perhaps do a request for a 1 time query but, as I was saying, the data you are looking for is not available long term.
You can read about data access here: https://meta.wikimedia.org/wiki/Research:FAQ
Thanks,
Nuria
On Sun, Sep 4, 2016 at 4:38 AM, Lars Noodén lars.nooden@gmail.com wrote:
Thanks, Dan and Nuria, for the responses.
I see that the 'webrequest' table [1] with the current schema would have the field with raw header containing a superset of the data I am looking for with regard to the Wikibook:
referer string Referer header of request
but I don't think I would be able to propose a generic database query that would produce sufficiently sanitized data. At this point, I'm looking for only the search strings.
I'm also not sure of the contents of uri_path or uri_query to know which one would restrict the search to specific Wikibooks.
So, what would be the process to request access to the raw data and what would be the conditions for such access? If I were to pursue that, as far as a general interest research project goes, the referred search terms could be grouped by Featured Book (plus the one, non-featured book I am aiming for). There are about 200 English language Featured Books [2] at the moment.
Regards, Lars
[1] https://wikitech.wikimedia.org/wiki/Analytics/Data/ Webrequest#Current_Schema
[2] https://en.wikibooks.org/wiki/Wikibooks:Featured_books#Featured_books
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On 09/05/2016 07:36 AM, Nuria Ruiz wrote:
Lars,
I am not sure we have at the data you are looking for, the data we get from searches is only available for 60 days or less while it gets processed and deleted after that. Agreggated pageview data is kept long term, search data is not.
Even the most recent 30 to 60 days worth would help. The pageview data shows what is used but gives no hint about why.
So, what would be the process to request access to the raw data and what would
be the conditions for such access? Access to raw data is normally restricted to research projects. You can perhaps do a request for a 1 time query but, as I was saying, the data you are looking for is not available long term.
I've made a request in phabricator, if I understand the request procedure properly.
https://phabricator.wikimedia.org/T144714
You can read about data access here: https://meta.wikimedia.org/wiki/Research:FAQ
Thanks. I'm wading through that one and the nearby pages.
Thanks,
Nuria
By the way, what about alternate, external methods such as subscribing that particular wikibook to Google Search Console? If it is allowed, I might try it to see if it is possible and what it yields.
Regards, Lars
By the way, what about alternate, external methods such as subscribing that particular wikibook to Google Search Console?
Our privacy policy prevent us from sending data to third party , so sending analytics data to google is not allowed.
Thanks,
Nuria
On Sun, Sep 4, 2016 at 11:26 PM, Lars Noodén lars.nooden@gmail.com wrote:
On 09/05/2016 07:36 AM, Nuria Ruiz wrote:
Lars,
I am not sure we have at the data you are looking for, the data we get
from
searches is only available for 60 days or less while it gets processed
and
deleted after that. Agreggated pageview data is kept long term, search
data
is not.
Even the most recent 30 to 60 days worth would help. The pageview data shows what is used but gives no hint about why.
So, what would be the process to request access to the raw data and
what would
be the conditions for such access? Access to raw data is normally restricted to research projects. You can perhaps do a request for a 1 time query but, as I was saying, the data
you
are looking for is not available long term.
I've made a request in phabricator, if I understand the request procedure properly.
https://phabricator.wikimedia.org/T144714
You can read about data access here: https://meta.wikimedia.org/wiki/Research:FAQ
Thanks. I'm wading through that one and the nearby pages.
Thanks,
Nuria
By the way, what about alternate, external methods such as subscribing that particular wikibook to Google Search Console? If it is allowed, I might try it to see if it is possible and what it yields.
Regards, Lars
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Lars,
This is Leila from WMF Research. Recently, we have been receiving a lot of requests about search queries. Here is a response we've given to another researcher few days ago, FYI, and hopefully it will be helpful.
Best, Leila
------------------
As you well know, access to the data you're asking for is not straightforward, and it's a topic that resurfaces every few months, as the editor community is also very interested in it. See for example a recent discussion in here https://lists.wikimedia.org/pipermail/wikimedia-l/2016-July/084745.html.
There are a few things the Research team (that I'm a member of) needs to know before we can say more:
* We need a proposal from you and your collaborators of your project explaining what the project is, a short description of the methodology you're proposing or approaches you want to try, and how the project can contribute to Wikimedia Foundation's mission/plans and/or Wikimedia/Wikipedia community. If there is something in our annual plan https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2016-2017/Final that catches your eyes as a potential alignment, please bring that up to us in your proposal.
You can create a page at https://meta.wikimedia.org/wiki/Research for your project and share a link with us. Note that the proposal shouldn't be long. See for example the proposal for this research https://meta.wikimedia.org/wiki/Research:Increasing_article_coverage (search for "Proposal" in the page).
* If you are under time constraints, please be explicit about it in your proposal. Looking at the current list of our collaborations https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations#Current_list_of_formal_collaborators, and knowing that there are few more in the process, you may have to wait for some time before one of us can work with you to make it happen, of course if your proposal is passed by the team.
* To learn more about our formal collaborations, which is the way such access to data can be made possible, please read here https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations.
Leila Zia Senior Research Scientist Wikimedia Foundation
On Mon, Sep 5, 2016 at 11:19 AM, Nuria Ruiz nuria@wikimedia.org wrote:
By the way, what about alternate, external methods such as subscribing that particular wikibook to Google Search Console?
Our privacy policy prevent us from sending data to third party , so sending analytics data to google is not allowed.
Thanks,
Nuria
On Sun, Sep 4, 2016 at 11:26 PM, Lars Noodén lars.nooden@gmail.com wrote:
On 09/05/2016 07:36 AM, Nuria Ruiz wrote:
Lars,
I am not sure we have at the data you are looking for, the data we get
from
searches is only available for 60 days or less while it gets processed
and
deleted after that. Agreggated pageview data is kept long term, search
data
is not.
Even the most recent 30 to 60 days worth would help. The pageview data shows what is used but gives no hint about why.
So, what would be the process to request access to the raw data and
what would
be the conditions for such access? Access to raw data is normally restricted to research projects. You can perhaps do a request for a 1 time query but, as I was saying, the data
you
are looking for is not available long term.
I've made a request in phabricator, if I understand the request procedure properly.
https://phabricator.wikimedia.org/T144714
You can read about data access here: https://meta.wikimedia.org/wiki/Research:FAQ
Thanks. I'm wading through that one and the nearby pages.
Thanks,
Nuria
By the way, what about alternate, external methods such as subscribing that particular wikibook to Google Search Console? If it is allowed, I might try it to see if it is possible and what it yields.
Regards, Lars
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On 09/06/2016 08:23 PM, Leila Zia wrote:
Hi Lars,
This is Leila from WMF Research. Recently, we have been receiving a lot of requests about search queries. Here is a response we've given to another researcher few days ago, FYI, and hopefully it will be helpful.
Best, Leila
Hi, Leila,
Thanks for the background info. The thread from July[1] is interesting. What I was hoping to look at is even simpler, more of a small task [2] than a Large Project. But looking at the Wikimedia Foundation Annual Plan for 2016-2017 [3] I could see that it could be made into a project if dealt with as a more generic interface to do for search terms (internal or external) what the Pageviews Analysis tool does for page views that fits in two places in the plan:
* The addition of such a utility would match "Technology" "Program 1: Improve tools that help us understand user needs" in meeting "Objective 1 – Improve tools for data display for the Foundation and community" In particular it will help understanding of what is going on as far as how visitors are discovering Wikibooks and their chapters.
* It would also match "Product" in "Program 2: Maintain and improve content discovery", specifically "Goal 3: Evolve content discovery and interactive tools on Wikimedia projects". In particular it would show how visitors are using search engines to find and use Wikibooks and their chapters.
I'm asking for this data for a specific Wikibook. However, expanding it to be more generic for all Wikibooks would be fine for my needs as well. The data I request is a subset of what is contained in two HTTP headers and easily extractable in such a way as to completely protect the privacy of the visitor, as only the destination chapter, the corresponding search terms, and date are saved in the database after being extracted from the raw stats.
The basis for this is the idea that a great many visitors enter the Wikibooks via search engine activity.
Regards, Lars
[1] https://lists.wikimedia.org/pipermail/wikimedia-l/2016-July/084745.html
[2] At least it was simple back in the late 90's when dealing with Apache to extract and index a subset of the HTTP_REFERER header.
[3] https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2016-2017/F...
The thread from July[1] is interesting. What I was hoping to look at is even simpler, more of a small task [2] than a Large Project.
I agree, I think your query is more of a small task and does not justify a full project. Also, Leila was talking about research projects and collaborations, not projects that have to do with spinning new infrastructure.
FYI that we have a task about compiling referral data, it is not scheduled to happen immediately but it is on our backlog: https://phabricator.wikimedia.org/T112284
On Wed, Sep 7, 2016 at 3:22 AM, Lars Noodén lars.nooden@gmail.com wrote:
On 09/06/2016 08:23 PM, Leila Zia wrote:
Hi Lars,
This is Leila from WMF Research. Recently, we have been receiving a lot
of
requests about search queries. Here is a response we've given to another researcher few days ago, FYI, and hopefully it will be helpful.
Best, Leila
Hi, Leila,
Thanks for the background info. The thread from July[1] is interesting. What I was hoping to look at is even simpler, more of a small task [2] than a Large Project. But looking at the Wikimedia Foundation Annual Plan for 2016-2017 [3] I could see that it could be made into a project if dealt with as a more generic interface to do for search terms (internal or external) what the Pageviews Analysis tool does for page views that fits in two places in the plan:
- The addition of such a utility would match "Technology" "Program 1:
Improve tools that help us understand user needs" in meeting "Objective 1 – Improve tools for data display for the Foundation and community" In particular it will help understanding of what is going on as far as how visitors are discovering Wikibooks and their chapters.
- It would also match "Product" in "Program 2: Maintain and improve
content discovery", specifically "Goal 3: Evolve content discovery and interactive tools on Wikimedia projects". In particular it would show how visitors are using search engines to find and use Wikibooks and their chapters.
I'm asking for this data for a specific Wikibook. However, expanding it to be more generic for all Wikibooks would be fine for my needs as well. The data I request is a subset of what is contained in two HTTP headers and easily extractable in such a way as to completely protect the privacy of the visitor, as only the destination chapter, the corresponding search terms, and date are saved in the database after being extracted from the raw stats.
The basis for this is the idea that a great many visitors enter the Wikibooks via search engine activity.
Regards, Lars
[1] https://lists.wikimedia.org/pipermail/wikimedia-l/2016- July/084745.html
[2] At least it was simple back in the late 90's when dealing with Apache to extract and index a subset of the HTTP_REFERER header.
[3] https://meta.wikimedia.org/wiki/Wikimedia_Foundation_ Annual_Plan/2016-2017/Final#Program_1:_Improve_tools_that_ help_us_understand_user_needs
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Sun, Sep 4, 2016 at 11:26 PM, Lars Noodén lars.nooden@gmail.com wrote:
On 09/05/2016 07:36 AM, Nuria Ruiz wrote:
Lars,
I am not sure we have at the data you are looking for, the data we get from searches is only available for 60 days or less while it gets processed and deleted after that. Agreggated pageview data is kept long term, search data is not.
Even the most recent 30 to 60 days worth would help. The pageview data shows what is used but gives no hint about why.
So, what would be the process to request access to the raw data and what would
be the conditions for such access? Access to raw data is normally restricted to research projects. You can perhaps do a request for a 1 time query but, as I was saying, the data you are looking for is not available long term.
I've made a request in phabricator, if I understand the request procedure properly.
I have replied at this Phabricator task; it's probably best to keep the further discussion centralized there.