Getting search engine terms for specific wikibook?

List overview All Threads
Download

newer

older

Split testing example...

Recent cross-language view stats

Lars Noodén

30 Aug 2016 30 Aug '16

7:34 a.m.

I'm noticing curious spikes occasionally in the usage stats ( via tools.wmflabs.org/pageviews/ ) for a Wikibook I wrote and maintain. I would guess that many of the visitors are coming via a search engine. Some blogs provide authors with a sanitized subset of HTTP referer {sic} header information, specifically the search engine search terms. I'm looking for that or something similar for Wikibooks.

How may I go about getting a sanitized list of search terms used to enter that Wikibook or its chapters?

Regards, Lars

Show replies by date

Dan Andreescu

31 Aug 31 Aug

12:12 a.m.

To get this, you'd have to query some raw data we have. So it's a one-off task, and I personally have a long queue of such tasks that I haven't gotten to yet. So basically, either create a general interest research project, sign an NDA, and get access to the data yourself, or find someone else with such an NDA that's willing to run a simple query for you. Defining the query ahead of time would help, I could run it quickly and get you results if you do that. The data we publish is described here (let me know if we have gaps in documentation, I would fix those): https://wikitech.wikimedia.org/wiki/Analytics/Data

On Tue, Aug 30, 2016 at 1:34 AM, Lars Noodén lars.nooden@gmail.com wrote:

...

I'm noticing curious spikes occasionally in the usage stats ( via tools.wmflabs.org/pageviews/ ) for a Wikibook I wrote and maintain. I would guess that many of the visitors are coming via a search engine. Some blogs provide authors with a sanitized subset of HTTP referer {sic} header information, specifically the search engine search terms. I'm looking for that or something similar for Wikibooks.

How may I go about getting a sanitized list of search terms used to enter that Wikibook or its chapters?

Regards, Lars

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Nuria Ruiz

5:51 p.m.

Public search metrics area available here: https://discovery.wmflabs.org/ as Dan mentioned your query is quite specific, and public data is provided in aggregate, not raw, form.

Analytics team does not provide data upon request other than already public datasets but please see the *Research FAQ on Meta https://meta.wikimedia.org/wiki/Research:FAQ* to understand who at WMF owns research-related processes and resources, and on where to find data or statistics about a specific Product Audience (such as editors or readers).

Thanks,

Nuria

On Tue, Aug 30, 2016 at 3:12 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...

To get this, you'd have to query some raw data we have. So it's a one-off task, and I personally have a long queue of such tasks that I haven't gotten to yet. So basically, either create a general interest research project, sign an NDA, and get access to the data yourself, or find someone else with such an NDA that's willing to run a simple query for you. Defining the query ahead of time would help, I could run it quickly and get you results if you do that. The data we publish is described here (let me know if we have gaps in documentation, I would fix those): https://wikitech.wikimedia.org/wiki/Analytics/Data

On Tue, Aug 30, 2016 at 1:34 AM, Lars Noodén lars.nooden@gmail.com wrote:

...
I'm noticing curious spikes occasionally in the usage stats ( via tools.wmflabs.org/pageviews/ ) for a Wikibook I wrote and maintain. I would guess that many of the visitors are coming via a search engine. Some blogs provide authors with a sanitized subset of HTTP referer {sic} header information, specifically the search engine search terms. I'm looking for that or something similar for Wikibooks.

How may I go about getting a sanitized list of search terms used to enter that Wikibook or its chapters?

Regards, Lars

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Lars Noodén

4 Sep 4 Sep

1:38 p.m.

Thanks, Dan and Nuria, for the responses.

I see that the 'webrequest' table [1] with the current schema would have the field with raw header containing a superset of the data I am looking for with regard to the Wikibook:

referer string Referer header of request

but I don't think I would be able to propose a generic database query that would produce sufficiently sanitized data. At this point, I'm looking for only the search strings.

I'm also not sure of the contents of uri_path or uri_query to know which one would restrict the search to specific Wikibooks.

So, what would be the process to request access to the raw data and what would be the conditions for such access? If I were to pursue that, as far as a general interest research project goes, the referred search terms could be grouped by Featured Book (plus the one, non-featured book I am aiming for). There are about 200 English language Featured Books [2] at the moment.

Regards, Lars

[1] https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest#Current_Schema

[2] https://en.wikibooks.org/wiki/Wikibooks:Featured_books#Featured_books

Nuria Ruiz

5 Sep 5 Sep

6:36 a.m.

Lars,

I am not sure we have at the data you are looking for, the data we get from searches is only available for 60 days or less while it gets processed and deleted after that. Agreggated pageview data is kept long term, search data is not.

...

So, what would be the process to request access to the raw data and what would

be the conditions for such access? Access to raw data is normally restricted to research projects. You can perhaps do a request for a 1 time query but, as I was saying, the data you are looking for is not available long term.

You can read about data access here: https://meta.wikimedia.org/wiki/Research:FAQ

Thanks,

Nuria

On Sun, Sep 4, 2016 at 4:38 AM, Lars Noodén lars.nooden@gmail.com wrote:

...

Thanks, Dan and Nuria, for the responses.

I see that the 'webrequest' table [1] with the current schema would have the field with raw header containing a superset of the data I am looking for with regard to the Wikibook:
    referer         string          Referer header of request
but I don't think I would be able to propose a generic database query that would produce sufficiently sanitized data. At this point, I'm looking for only the search strings.

I'm also not sure of the contents of uri_path or uri_query to know which one would restrict the search to specific Wikibooks.

So, what would be the process to request access to the raw data and what would be the conditions for such access? If I were to pursue that, as far as a general interest research project goes, the referred search terms could be grouped by Featured Book (plus the one, non-featured book I am aiming for). There are about 200 English language Featured Books [2] at the moment.

Regards, Lars

[1] https://wikitech.wikimedia.org/wiki/Analytics/Data/ Webrequest#Current_Schema

[2] https://en.wikibooks.org/wiki/Wikibooks:Featured_books#Featured_books

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Lars Noodén

8:26 a.m.

On 09/05/2016 07:36 AM, Nuria Ruiz wrote:

...

Lars,

I am not sure we have at the data you are looking for, the data we get from searches is only available for 60 days or less while it gets processed and deleted after that. Agreggated pageview data is kept long term, search data is not.

Even the most recent 30 to 60 days worth would help. The pageview data shows what is used but gives no hint about why.

...

...
So, what would be the process to request access to the raw data and what would

be the conditions for such access? Access to raw data is normally restricted to research projects. You can perhaps do a request for a 1 time query but, as I was saying, the data you are looking for is not available long term.

I've made a request in phabricator, if I understand the request procedure properly.

https://phabricator.wikimedia.org/T144714

...

You can read about data access here: https://meta.wikimedia.org/wiki/Research:FAQ

Thanks. I'm wading through that one and the nearby pages.

...

Thanks,

Nuria

By the way, what about alternate, external methods such as subscribing that particular wikibook to Google Search Console? If it is allowed, I might try it to see if it is possible and what it yields.

Regards, Lars

Nuria Ruiz

8:19 p.m.

...

By the way, what about alternate, external methods such as subscribing that particular wikibook to Google Search Console?

Our privacy policy prevent us from sending data to third party , so sending analytics data to google is not allowed.

Thanks,

Nuria

On Sun, Sep 4, 2016 at 11:26 PM, Lars Noodén lars.nooden@gmail.com wrote:

...

On 09/05/2016 07:36 AM, Nuria Ruiz wrote:

...
Lars,

I am not sure we have at the data you are looking for, the data we get

from

...
searches is only available for 60 days or less while it gets processed

and

...
deleted after that. Agreggated pageview data is kept long term, search

data

...
is not.

Even the most recent 30 to 60 days worth would help. The pageview data shows what is used but gives no hint about why.

...
...
So, what would be the process to request access to the raw data and

what would

...
be the conditions for such access? Access to raw data is normally restricted to research projects. You can perhaps do a request for a 1 time query but, as I was saying, the data

you

...
are looking for is not available long term.

I've made a request in phabricator, if I understand the request procedure properly.

https://phabricator.wikimedia.org/T144714

...
You can read about data access here: https://meta.wikimedia.org/wiki/Research:FAQ

Thanks. I'm wading through that one and the nearby pages.

...
Thanks,

Nuria

By the way, what about alternate, external methods such as subscribing that particular wikibook to Google Search Console? If it is allowed, I might try it to see if it is possible and what it yields.

Regards, Lars

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Leila Zia

6 Sep 6 Sep

7:23 p.m.

Hi Lars,

This is Leila from WMF Research. Recently, we have been receiving a lot of requests about search queries. Here is a response we've given to another researcher few days ago, FYI, and hopefully it will be helpful.

Best, Leila

------------------

As you well know, access to the data you're asking for is not straightforward, and it's a topic that resurfaces every few months, as the editor community is also very interested in it. See for example a recent discussion in here https://lists.wikimedia.org/pipermail/wikimedia-l/2016-July/084745.html.

There are a few things the Research team (that I'm a member of) needs to know before we can say more:

* We need a proposal from you and your collaborators of your project explaining what the project is, a short description of the methodology you're proposing or approaches you want to try, and how the project can contribute to Wikimedia Foundation's mission/plans and/or Wikimedia/Wikipedia community. If there is something in our annual plan https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2016-2017/Final that catches your eyes as a potential alignment, please bring that up to us in your proposal.

You can create a page at https://meta.wikimedia.org/wiki/Research for your project and share a link with us. Note that the proposal shouldn't be long. See for example the proposal for this research https://meta.wikimedia.org/wiki/Research:Increasing_article_coverage (search for "Proposal" in the page).

* If you are under time constraints, please be explicit about it in your proposal. Looking at the current list of our collaborations https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations#Current_list_of_formal_collaborators, and knowing that there are few more in the process, you may have to wait for some time before one of us can work with you to make it happen, of course if your proposal is passed by the team.

* To learn more about our formal collaborations, which is the way such access to data can be made possible, please read here https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations.

Leila Zia Senior Research Scientist Wikimedia Foundation

On Mon, Sep 5, 2016 at 11:19 AM, Nuria Ruiz nuria@wikimedia.org wrote:

...

...
By the way, what about alternate, external methods such as subscribing that particular wikibook to Google Search Console?

Our privacy policy prevent us from sending data to third party , so sending analytics data to google is not allowed.

Thanks,

Nuria

On Sun, Sep 4, 2016 at 11:26 PM, Lars Noodén lars.nooden@gmail.com wrote:

...
On 09/05/2016 07:36 AM, Nuria Ruiz wrote:

...
Lars,

I am not sure we have at the data you are looking for, the data we get

from

...
searches is only available for 60 days or less while it gets processed

and

...
deleted after that. Agreggated pageview data is kept long term, search

data

...
is not.

Even the most recent 30 to 60 days worth would help. The pageview data shows what is used but gives no hint about why.

...
...
So, what would be the process to request access to the raw data and

what would

...
be the conditions for such access? Access to raw data is normally restricted to research projects. You can perhaps do a request for a 1 time query but, as I was saying, the data

you

...
are looking for is not available long term.

I've made a request in phabricator, if I understand the request procedure properly.

https://phabricator.wikimedia.org/T144714

...
You can read about data access here: https://meta.wikimedia.org/wiki/Research:FAQ

Thanks. I'm wading through that one and the nearby pages.

...
Thanks,

Nuria

By the way, what about alternate, external methods such as subscribing that particular wikibook to Google Search Console? If it is allowed, I might try it to see if it is possible and what it yields.

Regards, Lars

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Lars Noodén

7 Sep 7 Sep

12:22 p.m.

On 09/06/2016 08:23 PM, Leila Zia wrote:

...

Hi Lars,

This is Leila from WMF Research. Recently, we have been receiving a lot of requests about search queries. Here is a response we've given to another researcher few days ago, FYI, and hopefully it will be helpful.

Best, Leila

Hi, Leila,

Thanks for the background info. The thread from July[1] is interesting. What I was hoping to look at is even simpler, more of a small task [2] than a Large Project. But looking at the Wikimedia Foundation Annual Plan for 2016-2017 [3] I could see that it could be made into a project if dealt with as a more generic interface to do for search terms (internal or external) what the Pageviews Analysis tool does for page views that fits in two places in the plan:

* The addition of such a utility would match "Technology" "Program 1: Improve tools that help us understand user needs" in meeting "Objective 1 – Improve tools for data display for the Foundation and community" In particular it will help understanding of what is going on as far as how visitors are discovering Wikibooks and their chapters.

* It would also match "Product" in "Program 2: Maintain and improve content discovery", specifically "Goal 3: Evolve content discovery and interactive tools on Wikimedia projects". In particular it would show how visitors are using search engines to find and use Wikibooks and their chapters.

I'm asking for this data for a specific Wikibook. However, expanding it to be more generic for all Wikibooks would be fine for my needs as well. The data I request is a subset of what is contained in two HTTP headers and easily extractable in such a way as to completely protect the privacy of the visitor, as only the destination chapter, the corresponding search terms, and date are saved in the database after being extracted from the raw stats.

The basis for this is the idea that a great many visitors enter the Wikibooks via search engine activity.

Regards, Lars

[1] https://lists.wikimedia.org/pipermail/wikimedia-l/2016-July/084745.html

[2] At least it was simple back in the late 90's when dealing with Apache to extract and index a subset of the HTTP_REFERER header.

[3] https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2016-2017/F...

Nuria Ruiz

6:14 p.m.

...

The thread from July[1] is interesting. What I was hoping to look at is even simpler, more of a small task [2] than a Large Project.

I agree, I think your query is more of a small task and does not justify a full project. Also, Leila was talking about research projects and collaborations, not projects that have to do with spinning new infrastructure.

FYI that we have a task about compiling referral data, it is not scheduled to happen immediately but it is on our backlog: https://phabricator.wikimedia.org/T112284

On Wed, Sep 7, 2016 at 3:22 AM, Lars Noodén lars.nooden@gmail.com wrote:

...

On 09/06/2016 08:23 PM, Leila Zia wrote:

...
Hi Lars,

This is Leila from WMF Research. Recently, we have been receiving a lot

of

...
requests about search queries. Here is a response we've given to another researcher few days ago, FYI, and hopefully it will be helpful.

Best, Leila

Hi, Leila,

Thanks for the background info. The thread from July[1] is interesting. What I was hoping to look at is even simpler, more of a small task [2] than a Large Project. But looking at the Wikimedia Foundation Annual Plan for 2016-2017 [3] I could see that it could be made into a project if dealt with as a more generic interface to do for search terms (internal or external) what the Pageviews Analysis tool does for page views that fits in two places in the plan:

The addition of such a utility would match "Technology" "Program 1:

Improve tools that help us understand user needs" in meeting "Objective 1 – Improve tools for data display for the Foundation and community" In particular it will help understanding of what is going on as far as how visitors are discovering Wikibooks and their chapters.

It would also match "Product" in "Program 2: Maintain and improve

content discovery", specifically "Goal 3: Evolve content discovery and interactive tools on Wikimedia projects". In particular it would show how visitors are using search engines to find and use Wikibooks and their chapters.

I'm asking for this data for a specific Wikibook. However, expanding it to be more generic for all Wikibooks would be fine for my needs as well. The data I request is a subset of what is contained in two HTTP headers and easily extractable in such a way as to completely protect the privacy of the visitor, as only the destination chapter, the corresponding search terms, and date are saved in the database after being extracted from the raw stats.

The basis for this is the idea that a great many visitors enter the Wikibooks via search engine activity.

Regards, Lars

[1] https://lists.wikimedia.org/pipermail/wikimedia-l/2016- July/084745.html

[2] At least it was simple back in the late 90's when dealing with Apache to extract and index a subset of the HTTP_REFERER header.

[3] https://meta.wikimedia.org/wiki/Wikimedia_Foundation_ Annual_Plan/2016-2017/Final#Program_1:_Improve_tools_that_ help_us_understand_user_needs

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Tilman Bayer

8 Sep 8 Sep

10:12 a.m.

On Sun, Sep 4, 2016 at 11:26 PM, Lars Noodén lars.nooden@gmail.com wrote:

...

On 09/05/2016 07:36 AM, Nuria Ruiz wrote:

...
Lars,

I am not sure we have at the data you are looking for, the data we get from searches is only available for 60 days or less while it gets processed and deleted after that. Agreggated pageview data is kept long term, search data is not.

Even the most recent 30 to 60 days worth would help. The pageview data shows what is used but gives no hint about why.

...
...
So, what would be the process to request access to the raw data and what would

be the conditions for such access? Access to raw data is normally restricted to research projects. You can perhaps do a request for a 1 time query but, as I was saying, the data you are looking for is not available long term.

I've made a request in phabricator, if I understand the request procedure properly.

https://phabricator.wikimedia.org/T144714

I have replied at this Phabricator task; it's probably best to keep the further discussion centralized there.

-- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB

3017

Age (days ago)

3026

Last active (days ago)

analytics@lists.wikimedia.org

10 comments

5 participants

tags (0)

participants (5)

Dan Andreescu
Lars Noodén
Leila Zia
Nuria Ruiz
Tilman Bayer