[Foundation-l] Google Webmaster Tools and Foundation projects

Fri Nov 2 22:22:46 UTC 2007

Given that Google is the single largest contributor of traffic to all
Wikimedia projects, it seems that the simple act of signing up to Google
Webmaster Tools would provide a vast amount of data that Google is already
collecting on the projects for free. This data is extremely interesting, and
includes:

   - The exact phrases used in external links to your site
   - The top search queries used to access your site, in the following
   csv format:

>    Site Information,Location,Search Type,Top search queries,Top search
>    query clicks
>    http://en.wikipedia.org/wiki/Main_Page, (India) google.co.in,Web
>    Search,"[wikipedia:1][universal access to knowledge:6]["the world's best
>    encyclopedia":6]"

   - This includes the position of your site in the results for that
   query
   - The PageRank of all of your pages; the distribution of the PageRank
   of all of your pages; your page with the highest PageRank
   - The number of people who have subscribed to the rss feeds on your
   site using Google products that allow this.
   - Something akin to the inverse document frequency of the words in
   your site and the words used in external links to your site, as computed by
   GoogleBot

How this would work:

   1.  A Wikimedia representative creates a Google account for the
   Foundation
   2. Each language version of each project is added. This may be a one
   time labor intensive process, or it might be more straightforward. I have no
   way of testing this right now.
   3. Click the "Download data for all sites" button.
   4. Profit ;)

It seems that releasing this data does not violate the Foundation privacy
policy. The search query data is collected under an agreement between the
user and Google, before they ever enter the domain of a Foundation project.
That point aside, there are no unique identifiers that link one user to a
set of queries, only a relationship between a set of queries and a country.
There is no specific information on when a query was performed.

The community will no doubt come up with interesting visualizations and
applications of this data. Articles that are receiving a relatively high
amount of traffic but are of a relatively low level of quality can be
targeted for improvement, for example.

I would volunteer to automate as much of this as possible, including
downloading the data at certain intervals.

Please discuss! :)
/Brian