[Foundation-l] Google Webmaster Tools and Foundation projects

Fri Nov 2 22:39:05 UTC 2007

There is one portion of Google's Terms of Service which, if it were agreed
that releasing this data is not a violation of anyone's rights, might
require permission from Google:

5.5 Unless you have been specifically permitted to do so in a separate
> agreement with Google, you agree that you will not reproduce, duplicate,
> copy, sell, trade or resell the Services for any purpose.


On 11/2/07, Brian <Brian.Mingus at colorado.edu> wrote:
>
> Given that Google is the single largest contributor of traffic to all
> Wikimedia projects, it seems that the simple act of signing up to Google
> Webmaster Tools would provide a vast amount of data that Google is already
> collecting on the projects for free. This data is extremely interesting, and
> includes:
>
>    - The exact phrases used in external links to your site
>    - The top search queries used to access your site, in the following
>    csv format:
>
> >    Site Information,Location,Search Type,Top search queries,Top
> >    search query clicks
> >    http://en.wikipedia.org/wiki/Main_Page, (India) google.co.in,Web
> >    Search,"[wikipedia:1][universal access to knowledge:6]["the world's best
> >    encyclopedia":6]"
>
>    - This includes the position of your site in the results for that
>    query
>    - The PageRank of all of your pages; the distribution of the
>    PageRank of all of your pages; your page with the highest PageRank
>    - The number of people who have subscribed to the rss feeds on your
>    site using Google products that allow this.
>    - Something akin to the inverse document frequency of the words in
>    your site and the words used in external links to your site, as computed by
>    GoogleBot
>
> How this would work:
>
>    1.  A Wikimedia representative creates a Google account for the
>    Foundation
>    2. Each language version of each project is added. This may be a one
>    time labor intensive process, or it might be more straightforward. I have no
>    way of testing this right now.
>    3. Click the "Download data for all sites" button.
>    4. Profit ;)
>
> It seems that releasing this data does not violate the Foundation privacy
> policy. The search query data is collected under an agreement between the
> user and Google, before they ever enter the domain of a Foundation project.
> That point aside, there are no unique identifiers that link one user to a
> set of queries, only a relationship between a set of queries and a country.
> There is no specific information on when a query was performed.
>
> The community will no doubt come up with interesting visualizations and
> applications of this data. Articles that are receiving a relatively high
> amount of traffic but are of a relatively low level of quality can be
> targeted for improvement, for example.
>
> I would volunteer to automate as much of this as possible, including
> downloading the data at certain intervals.
>
> Please discuss! :)
> /Brian
>