Hi Valerio,
This kind of request is a better fit for the Research mailing list. I've included the email for that list in the To: line of this email reply.
Pine
On Wed, Sep 10, 2014 at 4:15 AM, Valerio Schiavoni < valerio.schiavoni@gmail.com> wrote:
Dear WikiMedia foundation, in the context of a EU research project [1], we are interested in accessing wikipedia access traces. In the past, such traces were given for research purposes to other groups [2]. Unfortunately, only a small percentage (10%) of that trace has been made made available (10%). We are interested in accessing the totality of that same trace (or even better, a more recent one, but the same one will do).
If this is not the correct ML to use for such requests, could please anyone redirect me to correct one ?
Thanks again for your attention,
Valerio Schiavoni Post-Doc Researcher University of Neuchatel, Switzerland
1 - http://www.leads-project.eu 2 - http://www.wikibench.eu/?page_id=60 _______________________________________________ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe
Hello, just bumping my email from last week, since so far I did not get any answer.
Should I consider that dataset to be somehow lost ?
I've also contacted the researchers who partially released it, but making it publicly available is tricky for them, due to its size (12 TB), which might instead be somehow in the norms of the operations taken daily by Wikipedia servers.
Thanks again, Valerio
On Wed, Sep 10, 2014 at 4:15 AM, Valerio Schiavoni < valerio.schiavoni@gmail.com> wrote:
Dear WikiMedia foundation, in the context of a EU research project [1], we are interested in accessing wikipedia access traces. In the past, such traces were given for research purposes to other groups [2]. Unfortunately, only a small percentage (10%) of that trace has been made made available (10%). We are interested in accessing the totality of that same trace (or even better, a more recent one, but the same one will do).
If this is not the correct ML to use for such requests, could please anyone redirect me to correct one ?
Thanks again for your attention,
Valerio Schiavoni Post-Doc Researcher University of Neuchatel, Switzerland
1 - http://www.leads-project.eu 2 - http://www.wikibench.eu/?page_id=60
Just to confirm, https://dumps.wikimedia.org/other/pagecounts-raw/ won't work for you?
On Wed, Sep 17, 2014 at 8:53 AM, Valerio Schiavoni < valerio.schiavoni@gmail.com> wrote:
Hello, just bumping my email from last week, since so far I did not get any answer.
Should I consider that dataset to be somehow lost ?
I've also contacted the researchers who partially released it, but making it publicly available is tricky for them, due to its size (12 TB), which might instead be somehow in the norms of the operations taken daily by Wikipedia servers.
Thanks again, Valerio
On Wed, Sep 10, 2014 at 4:15 AM, Valerio Schiavoni < valerio.schiavoni@gmail.com> wrote:
Dear WikiMedia foundation, in the context of a EU research project [1], we are interested in accessing wikipedia access traces. In the past, such traces were given for research purposes to other groups [2]. Unfortunately, only a small percentage (10%) of that trace has been made made available (10%). We are interested in accessing the totality of that same trace (or even better, a more recent one, but the same one will do).
If this is not the correct ML to use for such requests, could please anyone redirect me to correct one ?
Thanks again for your attention,
Valerio Schiavoni Post-Doc Researcher University of Neuchatel, Switzerland
1 - http://www.leads-project.eu 2 - http://www.wikibench.eu/?page_id=60
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hello Aaron, thanks for your reply.
On Wed, Sep 17, 2014 at 4:03 PM, Aaron Halfaker aaron.halfaker@gmail.com wrote:
Just to confirm, https://dumps.wikimedia.org/other/pagecounts-raw/ won't work for you?
Unfortunately, no. Those logs only provide page counts but without the associated timestamps ("when" those pages have been accessed). If such logs exist, they would perfectly do..
2 - http://www.wikibench.eu/?page_id=60
By comparison, the logs in that dataset looks like this:
3325795636 1191194118.711 http://en.wikipedia.org/w/index.php?title=MediaWiki:Monobook.css&usemsgc... -
3325795635 1191194118.803 http://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/Icono_aviso_borrar.... -
3325795639 1191194118.671 http://de.wikipedia.org/w/index.php?title=MediaWiki:Monobook.css&usemsgc... -
The first token is just a counter, the second one is a Unix timestamp then there is the Wikipedia URL in the request, and a flag indicating if the request issued a database update or not (none of those three did).
best, Valerio
<quote who="Valerio Schiavoni" date="Wed, Sep 17, 2014 at 04:14:04PM +0200">
Unfortunately, no. Those logs only provide page counts but without the associated timestamps ("when" those pages have been accessed). If such logs exist, they would perfectly do..
The pagecount data /has/ timing data but they are "binned" by the hour.
I don't think more comprehensive data (all pages, all languages, nearly all viewers) over a long period of time exists anywhere and I don't think any similarly comprehensive data exists before 2007 at all.
You might find more granular data for short periods of time (like the WikiBench data or maybe stuff that's been collected more recently by WMF but isn't published) or much more detailed data from longer periods of time for a subset of users on a particular network (perhaps like the Indiana data, or toolbar data like the Yahoo data that some WP researchers have used).
I would /love/ to hear that I am wrong about this and that there's some wonderful, granual, broad, long-term dataset of pageviews I just don't know about it. :)
Later, Mako
I suppose you could get more granular data by conducting an opt-in study of some kind, and you would need to be careful that users who haven't opted in are not accidentally included or indirectly have their privacy affected. I agree that collection at intervals shorter than an hour is going to raise a lot of privacy considerations for users who have not opted in.
Pine
On Thu, Sep 18, 2014 at 12:03 PM, Benj. Mako Hill mako@atdot.cc wrote:
<quote who="Valerio Schiavoni" date="Wed, Sep 17, 2014 at 04:14:04PM +0200">
Unfortunately, no. Those logs only provide page counts but without the associated timestamps ("when" those pages have been accessed). If such
logs
exist, they would perfectly do..
The pagecount data /has/ timing data but they are "binned" by the hour.
I don't think more comprehensive data (all pages, all languages, nearly all viewers) over a long period of time exists anywhere and I don't think any similarly comprehensive data exists before 2007 at all.
You might find more granular data for short periods of time (like the WikiBench data or maybe stuff that's been collected more recently by WMF but isn't published) or much more detailed data from longer periods of time for a subset of users on a particular network (perhaps like the Indiana data, or toolbar data like the Yahoo data that some WP researchers have used).
I would /love/ to hear that I am wrong about this and that there's some wonderful, granual, broad, long-term dataset of pageviews I just don't know about it. :)
Later, Mako
-- Benjamin Mako Hill http://mako.cc/
Creativity can be a social contribution, but only in so far as society is free to use the results. --GNU Manifesto
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
<quote who="Pine W" date="Thu, Sep 18, 2014 at 12:07:53PM -0700">
I suppose you could get more granular data by conducting an opt-in study of some kind, and you would need to be careful that users who haven't opted in are not accidentally included or indirectly have their privacy affected. I agree that collection at intervals shorter than an hour is going to raise a lot of privacy considerations for users who have not opted in.
That would certainly work for some research questions and that's more or less what most toolbar data is.
The problem is that often questions answered with view data are about the overall popularity of visibility of pages which requires data that is representative. There's lots of reasons to believe that people who opt-in aren't going to be representative of all Wikipedia readers.
Regards, Mako
Yes, but supposedly phone survey companies are able to get representative samples of broad populations despite many people refusing to respond to phone surveys. If opt-in users were chosen using similar methods, could arguably representative data be obtained?
Pine On Sep 18, 2014 1:32 PM, "Benj. Mako Hill" mako@atdot.cc wrote:
<quote who="Pine W" date="Thu, Sep 18, 2014 at 12:07:53PM -0700"> > I suppose you could get more granular data by conducting an opt-in study of > some kind, and you would need to be careful that users who haven't opted in > are not accidentally included or indirectly have their privacy affected. I > agree that collection at intervals shorter than an hour is going to raise a > lot of privacy considerations for users who have not opted in.
That would certainly work for some research questions and that's more or less what most toolbar data is.
The problem is that often questions answered with view data are about the overall popularity of visibility of pages which requires data that is representative. There's lots of reasons to believe that people who opt-in aren't going to be representative of all Wikipedia readers.
Regards, Mako
-- Benjamin Mako Hill http://mako.cc/
Creativity can be a social contribution, but only in so far as society is free to use the results. --GNU Manifesto
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
On Thu, Sep 18 2014, Pine W wrote:
Yes, but supposedly phone survey companies are able to get representative samples of broad populations despite many people refusing to respond to phone surveys. If opt-in users were chosen using similar methods, could arguably representative data be obtained?
... Well, at least it could be representative of the opt in population, and if that's an interesting enough population it could be worthwhile.
For example people who opt to donate during the yearly fund-drive could be further invited to participate in page view tracking, say, and people who've opted in to both conditions might be taken to be representative of donors, who might be taken to be (vaguely) representative of the general population. The data from this group could be factored out against other people who opt into page view tracking who aren't donors, etc etc. (Probably I've described something that's already been done, or that can't be done; I'm not attached to the particular example!)
Further OT micro-rant about population research in free/open culture --
Although I'm very naive about Wikipedia research I've been wondering if it would be possible to do a crowd-sourced pattern finding research on Emacs use, combining ideas from:
http://www.emacswiki.org/emacs/RepetitionDetection http://popcon.debian.org/
At least in the programming world, I think the "moral" thing to do is to write programs that optimize repeated activities, and that there would be a potentially huge gain to doing this on a population-wide basis rather than on an individual basis. Because despite what I said above the first virtue of individual programmers is laziness! We're perhaps only "moral" at the population level.
Yes, but supposedly phone survey companies are able to get representative samples of broad populations despite many people refusing to respond to phone surveys. If opt-in users were chosen using similar methods, could arguably representative data be obtained?
Pine On Sep 18, 2014 1:32 PM, "Benj. Mako Hill" mako@atdot.cc wrote:
<quote who="Pine W" date="Thu, Sep 18, 2014 at 12:07:53PM -0700"> > I suppose you could get more granular data by conducting an opt-in study of > some kind, and you would need to be careful that users who haven't opted in > are not accidentally included or indirectly have their privacy affected. I > agree that collection at intervals shorter than an hour is going to raise a > lot of privacy considerations for users who have not opted in.
That would certainly work for some research questions and that's more or less what most toolbar data is.
The problem is that often questions answered with view data are about the overall popularity of visibility of pages which requires data that is representative. There's lots of reasons to believe that people who opt-in aren't going to be representative of all Wikipedia readers.
Regards, Mako
-- Benjamin Mako Hill http://mako.cc/
Creativity can be a social contribution, but only in so far as society is free to use the results. --GNU Manifesto
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
<quote who="Pine W" date="Thu, Sep 18, 2014 at 01:49:13PM -0700">
Yes, but supposedly phone survey companies are able to get representative samples of broad populations despite many people refusing to respond to phone surveys. If opt-in users were chosen using similar methods, could arguably representative data be obtained?
The way that people build representative surveys from non-representative data is by understanding quite a lot about the nature and structure of the bias in your sample. You might want to think about how people do this as trying to create a very complicated system of weights.
Folks who do this for US phone surveys, for example, have spent many decades and many millions of dollars on research to understand how to get reliable results and even then it's a quickly moving target. They still routinely sometimes miss things and get things wrong.
That said, there are certainly things we can learn. Aaron Shaw and I actually did something related with one of the big Wikipedia surveys in this article:
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0065782
In our case, our study was only possible because (a) we had very good luck finding "ground truth" data from the right point in time, (b) we had detailed demographic data on folks from the WP survey, and (c) we make a series of untestable assumptions. After all that work, we still can't know that we've got it right. We really can only suggest that there are reasons to believe our estimates are better that pretending that the opt-in survey is unbiased.
In the case of signing up for a Wikipedia toolbar, we might not even attract a sub-population that even /can/ reliably used to build representative estimates. :-(
Regards, Mako
See what you started, Pine? *This* is what happens when you get professors talking about research methods.
:P
- J
On Thu, Sep 18, 2014 at 2:21 PM, Benj. Mako Hill mako@atdot.cc wrote:
<quote who="Pine W" date="Thu, Sep 18, 2014 at 01:49:13PM -0700"> > Yes, but supposedly phone survey companies are able to get > representative samples of broad populations despite many people > refusing to respond to phone surveys. If opt-in users were chosen > using similar methods, could arguably representative data be > obtained?
The way that people build representative surveys from non-representative data is by understanding quite a lot about the nature and structure of the bias in your sample. You might want to think about how people do this as trying to create a very complicated system of weights.
Folks who do this for US phone surveys, for example, have spent many decades and many millions of dollars on research to understand how to get reliable results and even then it's a quickly moving target. They still routinely sometimes miss things and get things wrong.
That said, there are certainly things we can learn. Aaron Shaw and I actually did something related with one of the big Wikipedia surveys in this article:
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0065782
In our case, our study was only possible because (a) we had very good luck finding "ground truth" data from the right point in time, (b) we had detailed demographic data on folks from the WP survey, and (c) we make a series of untestable assumptions. After all that work, we still can't know that we've got it right. We really can only suggest that there are reasons to believe our estimates are better that pretending that the opt-in survey is unbiased.
In the case of signing up for a Wikipedia toolbar, we might not even attract a sub-population that even /can/ reliably used to build representative estimates. :-(
Regards, Mako
-- Benjamin Mako Hill http://mako.cc/
Creativity can be a social contribution, but only in so far as society is free to use the results. --GNU Manifesto
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
<quote who="Jonathan Morgan" date="Thu, Sep 18, 2014 at 02:43:35PM -0700">
See what you started, Pine? *This* is what happens when you get professors talking about research methods.
What, you get nearly identical messages written simultaneously by serial co-authors? ;)
Later, Mako
the basic issue in sampling is to decide what the target population T actually is. Then you weight the sample so that each person in the target population has an equal chance w and people not in it have weight zero.
So what is the target population we want to study? --the world's population? --the world's educated population? --everyone with internet access --everyone who ever uses Wikipedia --everyone who use it a lot --everyone who has knowledge to contribute in positive fashion? --everyone who has the internet, skills and potential to contribute? --everyone who has the potential to contribute but does not do so?
Richard Jensen rjensen@uic.edu
Let's loop back to the request at hand. Valerio, can you describe your use case for access traces at intervals shorter than one hour? The very likely outcome of this discussion is that the access traces at shorter intervals will not be made available, but I'm curious about what you would do with the data if you had it.
Pine On Sep 18, 2014 4:55 PM, "Richard Jensen" Rjensen@uic.edu wrote:
the basic issue in sampling is to decide what the target population T actually is. Then you weight the sample so that each person in the target population has an equal chance w and people not in it have weight zero.
So what is the target population we want to study? --the world's population? --the world's educated population? --everyone with internet access --everyone who ever uses Wikipedia --everyone who use it a lot --everyone who has knowledge to contribute in positive fashion? --everyone who has the internet, skills and potential to contribute? --everyone who has the potential to contribute but does not do so?
Richard Jensen rjensen@uic.edu
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hello everyone, it seems the discussion is sparkling an interesting debate, thanks to everyone.
To put back things in context, we use Wikipedia as one of the few websites where users can access different 'versions' of the same page. Users mostly read the most recent version of a given page, but from time to time, read accesses to the 'history' of a page happens. New versions of a page are created as well. Finally, users might potentially need to explore several old versions of a given web page, for example by accessing the details of its history[1]. Access traces need to be accurate to model the workload on the servers that are storing the contents being served the web serves. A resolution bigger than 1 second would not reflect the access patterns on Wikipedia, or similarly versioned, web sites. We use these access patterns to test different version-aware storage techniques. For those interested, I could send the pre-print version of an article that I will present next month at the IEEE SRDS'14 conference.
For what concern potential privacy concerns about disclosing such traces, I would like to stress that we are not looking into 'who' or from 'where' a given URL was requested. Those informations are completely absent from the Wikibench traces, and can/should remain such in new traces.
Let's say Wikipedia somehow reveals the top-10 most-visited pages in the last minute: would that represent a privacy breach for some users? I hardly doubt so, and I invite the audience to convince me about the contrary.
Best regards, Valerio
1- For example: http://it.wikipedia.org/w/index.php?title=George_W._Bush&action=history
On Fri, Sep 19, 2014 at 8:36 AM, Pine W wiki.pine@gmail.com wrote:
Let's loop back to the request at hand. Valerio, can you describe your use case for access traces at intervals shorter than one hour? The very likely outcome of this discussion is that the access traces at shorter intervals will not be made available, but I'm curious about what you would do with the data if you had it.
Pine On Sep 18, 2014 4:55 PM, "Richard Jensen" Rjensen@uic.edu wrote:
the basic issue in sampling is to decide what the target population T actually is. Then you weight the sample so that each person in the target population has an equal chance w and people not in it have weight zero.
So what is the target population we want to study? --the world's population? --the world's educated population? --everyone with internet access --everyone who ever uses Wikipedia --everyone who use it a lot --everyone who has knowledge to contribute in positive fashion? --everyone who has the internet, skills and potential to contribute? --everyone who has the potential to contribute but does not do so?
Richard Jensen rjensen@uic.edu
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Thanks for the explanation. On moderate to high traffic pages, let's say with a minimum of 10 hits per minute across the entire time span studied, perhaps the requested data could be provided while still providing strong privacy protection. Toby might need to discuss this with WMF Legal.
Pine On Sep 19, 2014 4:57 AM, "Valerio Schiavoni" valerio.schiavoni@gmail.com wrote:
Hello everyone, it seems the discussion is sparkling an interesting debate, thanks to everyone.
To put back things in context, we use Wikipedia as one of the few websites where users can access different 'versions' of the same page. Users mostly read the most recent version of a given page, but from time to time, read accesses to the 'history' of a page happens. New versions of a page are created as well. Finally, users might potentially need to explore several old versions of a given web page, for example by accessing the details of its history[1]. Access traces need to be accurate to model the workload on the servers that are storing the contents being served the web serves. A resolution bigger than 1 second would not reflect the access patterns on Wikipedia, or similarly versioned, web sites. We use these access patterns to test different version-aware storage techniques. For those interested, I could send the pre-print version of an article that I will present next month at the IEEE SRDS'14 conference.
For what concern potential privacy concerns about disclosing such traces, I would like to stress that we are not looking into 'who' or from 'where' a given URL was requested. Those informations are completely absent from the Wikibench traces, and can/should remain such in new traces.
Let's say Wikipedia somehow reveals the top-10 most-visited pages in the last minute: would that represent a privacy breach for some users? I hardly doubt so, and I invite the audience to convince me about the contrary.
Best regards, Valerio
1- For example: http://it.wikipedia.org/w/index.php?title=George_W._Bush&action=history
On Fri, Sep 19, 2014 at 8:36 AM, Pine W wiki.pine@gmail.com wrote:
Let's loop back to the request at hand. Valerio, can you describe your use case for access traces at intervals shorter than one hour? The very likely outcome of this discussion is that the access traces at shorter intervals will not be made available, but I'm curious about what you would do with the data if you had it.
Pine On Sep 18, 2014 4:55 PM, "Richard Jensen" Rjensen@uic.edu wrote:
the basic issue in sampling is to decide what the target population T actually is. Then you weight the sample so that each person in the target population has an equal chance w and people not in it have weight zero.
So what is the target population we want to study? --the world's population? --the world's educated population? --everyone with internet access --everyone who ever uses Wikipedia --everyone who use it a lot --everyone who has knowledge to contribute in positive fashion? --everyone who has the internet, skills and potential to contribute? --everyone who has the potential to contribute but does not do so?
Richard Jensen rjensen@uic.edu
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
A few more thoughts:
* You probably don't need the full URLs of the content being accessed, so those could be anonymized and replaced with random identifiers to some degree, right?
* Someone might be able to monitor the user's end of the transactions, such as by having university network logs that show destination domains and timestamps, in such a way that they could pair the university logs with Wikimedia access traces of one second granularity and thus defeat some measures of privacy for the university's Wikimedia users, correct?
* I am not sure that the staff time required to analyze this request and produce the data is a good use of resources on Wikimedia's end. Toby would be a good person to ask about this.
Pine On Sep 20, 2014 12:45 AM, "Pine W" wiki.pine@gmail.com wrote:
Thanks for the explanation. On moderate to high traffic pages, let's say with a minimum of 10 hits per minute across the entire time span studied, perhaps the requested data could be provided while still providing strong privacy protection. Toby might need to discuss this with WMF Legal.
Pine On Sep 19, 2014 4:57 AM, "Valerio Schiavoni" valerio.schiavoni@gmail.com wrote:
Hello everyone, it seems the discussion is sparkling an interesting debate, thanks to everyone.
To put back things in context, we use Wikipedia as one of the few websites where users can access different 'versions' of the same page. Users mostly read the most recent version of a given page, but from time to time, read accesses to the 'history' of a page happens. New versions of a page are created as well. Finally, users might potentially need to explore several old versions of a given web page, for example by accessing the details of its history[1]. Access traces need to be accurate to model the workload on the servers that are storing the contents being served the web serves. A resolution bigger than 1 second would not reflect the access patterns on Wikipedia, or similarly versioned, web sites. We use these access patterns to test different version-aware storage techniques. For those interested, I could send the pre-print version of an article that I will present next month at the IEEE SRDS'14 conference.
For what concern potential privacy concerns about disclosing such traces, I would like to stress that we are not looking into 'who' or from 'where' a given URL was requested. Those informations are completely absent from the Wikibench traces, and can/should remain such in new traces.
Let's say Wikipedia somehow reveals the top-10 most-visited pages in the last minute: would that represent a privacy breach for some users? I hardly doubt so, and I invite the audience to convince me about the contrary.
Best regards, Valerio
1- For example: http://it.wikipedia.org/w/index.php?title=George_W._Bush&action=history
On Fri, Sep 19, 2014 at 8:36 AM, Pine W wiki.pine@gmail.com wrote:
Let's loop back to the request at hand. Valerio, can you describe your use case for access traces at intervals shorter than one hour? The very likely outcome of this discussion is that the access traces at shorter intervals will not be made available, but I'm curious about what you would do with the data if you had it.
Pine On Sep 18, 2014 4:55 PM, "Richard Jensen" Rjensen@uic.edu wrote:
the basic issue in sampling is to decide what the target population T actually is. Then you weight the sample so that each person in the target population has an equal chance w and people not in it have weight zero.
So what is the target population we want to study? --the world's population? --the world's educated population? --everyone with internet access --everyone who ever uses Wikipedia --everyone who use it a lot --everyone who has knowledge to contribute in positive fashion? --everyone who has the internet, skills and potential to contribute? --everyone who has the potential to contribute but does not do so?
Richard Jensen rjensen@uic.edu
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Given that the request logs aren't transparent about which cached version of a page is being provided I'm finding it pretty difficult to see how they'd help you answer interesting questions here :/.
On 20 September 2014 04:02, Pine W wiki.pine@gmail.com wrote:
A few more thoughts:
- You probably don't need the full URLs of the content being accessed, so
those could be anonymized and replaced with random identifiers to some degree, right?
- Someone might be able to monitor the user's end of the transactions,
such as by having university network logs that show destination domains and timestamps, in such a way that they could pair the university logs with Wikimedia access traces of one second granularity and thus defeat some measures of privacy for the university's Wikimedia users, correct?
- I am not sure that the staff time required to analyze this request and
produce the data is a good use of resources on Wikimedia's end. Toby would be a good person to ask about this.
Pine On Sep 20, 2014 12:45 AM, "Pine W" wiki.pine@gmail.com wrote:
Thanks for the explanation. On moderate to high traffic pages, let's say with a minimum of 10 hits per minute across the entire time span studied, perhaps the requested data could be provided while still providing strong privacy protection. Toby might need to discuss this with WMF Legal.
Pine On Sep 19, 2014 4:57 AM, "Valerio Schiavoni" valerio.schiavoni@gmail.com wrote:
Hello everyone, it seems the discussion is sparkling an interesting debate, thanks to everyone.
To put back things in context, we use Wikipedia as one of the few websites where users can access different 'versions' of the same page. Users mostly read the most recent version of a given page, but from time to time, read accesses to the 'history' of a page happens. New versions of a page are created as well. Finally, users might potentially need to explore several old versions of a given web page, for example by accessing the details of its history[1]. Access traces need to be accurate to model the workload on the servers that are storing the contents being served the web serves. A resolution bigger than 1 second would not reflect the access patterns on Wikipedia, or similarly versioned, web sites. We use these access patterns to test different version-aware storage techniques. For those interested, I could send the pre-print version of an article that I will present next month at the IEEE SRDS'14 conference.
For what concern potential privacy concerns about disclosing such traces, I would like to stress that we are not looking into 'who' or from 'where' a given URL was requested. Those informations are completely absent from the Wikibench traces, and can/should remain such in new traces.
Let's say Wikipedia somehow reveals the top-10 most-visited pages in the last minute: would that represent a privacy breach for some users? I hardly doubt so, and I invite the audience to convince me about the contrary.
Best regards, Valerio
1- For example: http://it.wikipedia.org/w/index.php?title=George_W._Bush&action=history
On Fri, Sep 19, 2014 at 8:36 AM, Pine W wiki.pine@gmail.com wrote:
Let's loop back to the request at hand. Valerio, can you describe your use case for access traces at intervals shorter than one hour? The very likely outcome of this discussion is that the access traces at shorter intervals will not be made available, but I'm curious about what you would do with the data if you had it.
Pine On Sep 18, 2014 4:55 PM, "Richard Jensen" Rjensen@uic.edu wrote:
the basic issue in sampling is to decide what the target population T actually is. Then you weight the sample so that each person in the target population has an equal chance w and people not in it have weight zero.
So what is the target population we want to study? --the world's population? --the world's educated population? --everyone with internet access --everyone who ever uses Wikipedia --everyone who use it a lot --everyone who has knowledge to contribute in positive fashion? --everyone who has the internet, skills and potential to contribute? --everyone who has the potential to contribute but does not do so?
Richard Jensen rjensen@uic.edu
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Both the desire for highly granular data and the concerns about privacy seem somewhat caricatured in this conversation :)
Valerio writes:
Access traces need to be accurate to model the workload on the servers that are storing the contents being served the web serves. A resolution bigger than 1 second would not reflect the access patterns on Wikipedia, or similarly versioned, web sites.
I don't understand your last sentence. Why can't you do the analysis you describe with hour-resolution data? It might help this discussion if you did a sample analysis for one page & one day, with available data, and indicated where higher res would help.
Pine writes:
Someone might be able to monitor the user's end of the transactions, such as by having university network logs that show destination domains and timestamps, in such a way that they could pair the university logs with Wikimedia access traces of one second granularity and thus defeat some measures of privacy for the university's Wikimedia users, correct?
en.wp gets 2000+ pageviews/s, so not much privacy is lost in that scenario, which is already pretty narrow: if you have access to the university logs, you might have access to the full destination url. I'm having a hard time seeing how high-res data (full urls, no source) would be a privacy risk – but if needed, binning could likely be done closer to the second than to the hour.
Warmly, Sam
On Fri, Sep 19, 2014 at 7:56 AM, Valerio Schiavoni valerio.schiavoni@gmail.com wrote:
Hello everyone, it seems the discussion is sparkling an interesting debate, thanks to everyone.
To put back things in context, we use Wikipedia as one of the few websites where users can access different 'versions' of the same page. Users mostly read the most recent version of a given page, but from time to time, read accesses to the 'history' of a page happens. New versions of a page are created as well. Finally, users might potentially need to explore several old versions of a given web page, for example by accessing the details of its history[1]. Access traces need to be accurate to model the workload on the servers that are storing the contents being served the web serves. A resolution bigger than 1 second would not reflect the access patterns on Wikipedia, or similarly versioned, web sites. We use these access patterns to test different version-aware storage techniques. For those interested, I could send the pre-print version of an article that I will present next month at the IEEE SRDS'14 conference.
For what concern potential privacy concerns about disclosing such traces, I would like to stress that we are not looking into 'who' or from 'where' a given URL was requested. Those informations are completely absent from the Wikibench traces, and can/should remain such in new traces.
Let's say Wikipedia somehow reveals the top-10 most-visited pages in the last minute: would that represent a privacy breach for some users? I hardly doubt so, and I invite the audience to convince me about the contrary.
Best regards, Valerio
1- For example: http://it.wikipedia.org/w/index.php?title=George_W._Bush&action=history
On Fri, Sep 19, 2014 at 8:36 AM, Pine W wiki.pine@gmail.com wrote:
Let's loop back to the request at hand. Valerio, can you describe your use case for access traces at intervals shorter than one hour? The very likely outcome of this discussion is that the access traces at shorter intervals will not be made available, but I'm curious about what you would do with the data if you had it.
Pine
On Sep 18, 2014 4:55 PM, "Richard Jensen" Rjensen@uic.edu wrote:
the basic issue in sampling is to decide what the target population T actually is. Then you weight the sample so that each person in the target population has an equal chance w and people not in it have weight zero.
So what is the target population we want to study? --the world's population? --the world's educated population? --everyone with internet access --everyone who ever uses Wikipedia --everyone who use it a lot --everyone who has knowledge to contribute in positive fashion? --everyone who has the internet, skills and potential to contribute? --everyone who has the potential to contribute but does not do so?
Richard Jensen rjensen@uic.edu
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hm, on the second point the person to ask is Toby, but it sounds like there are reasons for the minimun one hour granulatity, and with Oliver's point it sounds like this research approach won't produce the intended benefits anyway. Perhaps another reason for one hour minimum granulatity is because of the storage and other resource requirements for highly granular data are too expensive to justify the benefits.
Pine
Pine
Hi,
IR-Cache provide their traces on less than a second granularity. They have been doing that for years. The way they deal with the storage problem is by having a rotating log with maximum one week, so when they will add a new file for today, they will delete the one for Monday last week. Anyone requiring to use data of more than one week needs to write his own script or download the files at least once a week. Should Wikimedia provide such data, there shouldn't be a storage problem.
Best, Ahmed
On Mon, Sep 22, 2014 at 7:13 AM, Pine W wiki.pine@gmail.com wrote:
Hm, on the second point the person to ask is Toby, but it sounds like there are reasons for the minimun one hour granulatity, and with Oliver's point it sounds like this research approach won't produce the intended benefits anyway. Perhaps another reason for one hour minimum granulatity is because of the storage and other resource requirements for highly granular data are too expensive to justify the benefits.
Pine
Pine
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hello,
On Mon, Sep 22, 2014 at 8:38 AM, Ahmed Aley ahmeda@cs.umu.se wrote:
IR-Cache provide their traces on less than a second granularity.
What and where is this IR-Cache ? A quick google search did not help...
Hi,
there website has been down for a few days. here is a cahed versionÖ http://webcache.googleusercontent.com/search?q=cache:apD3FN7QLxgJ:www.ircach...
++Ahmed
On Mon, Sep 22, 2014 at 12:45 PM, Valerio Schiavoni < valerio.schiavoni@gmail.com> wrote:
Hello,
On Mon, Sep 22, 2014 at 8:38 AM, Ahmed Aley ahmeda@cs.umu.se wrote:
IR-Cache provide their traces on less than a second granularity.
What and where is this IR-Cache ? A quick google search did not help...
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
<quote who="Valerio Schiavoni" date="Fri, Sep 19, 2014 at 01:56:29PM +0200">
Users mostly read the most recent version of a given page, but from time to time, read accesses to the 'history' of a page happens.
At least as far as know, views to historical versions of webpages in Wikipedia don't show up in the access logs at all because certain kinds of requests (like requests to /w/index.php?oldid=NUMBER) don't get recorded in the pageview data.
New versions of a page are created as well. Finally, users might potentially need to explore several old versions of a given web page, for example by accessing the details of its history[1].
AFAIK, viewing the history page itself is also not recorded in the page view data either.
Regards, Mako
Hello Mako,
On Wed, Sep 24, 2014 at 8:13 AM, Benj. Mako Hill mako@atdot.cc wrote:
Users mostly read the most recent version of a given page, but from time
to
time, read accesses to the 'history' of a page happens.
At least as far as know, views to historical versions of webpages in Wikipedia don't show up in the access logs at all because certain kinds of requests (like requests to /w/index.php?oldid=NUMBER) don't get recorded in the pageview data.
I'm sorry to contradict you, but at least on the Wikibench traces, that information is very well present. I see things like:
1609418296 1190438479.078 http://en.wikipedia.org/w/index.php?title=Western_betrayal&oldid=9828122...
That is, back in 2007, users were accessing a version of that page that dated back in 2005 or so.
New versions of a page are created as well. Finally, users might
potentially need to explore several old versions of a given web page, for example by accessing the details of its history[1].
AFAIK, viewing the history page itself is also not recorded in the page view data either.
Sorry to contradict you again, but there are indeed logs for that as well:
http://en.wikipedia.org/w/index.php?title=Marina_Nadiradze&action=histor...
I'm quite surprised that such informations are not known by the community of Wikipedia researchers.
Best, Valerio
Hi Valerio,
Mako was referring to https://dumps.wikimedia.org/other/pagecounts-raw/ and the current logging practices. My understanding is also that these things are not logged on a routine basis. The Wikibench traces seem to have been a special case.
I've also contacted the researchers who partially released it, but making
it publicly available is tricky for them, due to its size (12 TB), which might instead be somehow in the norms of the operations taken daily by Wikipedia servers.
Have the researchers looked into requester-pays data storage on Amazon or another provider? They should be able to make it public with no resources and at no cost to themselves whatever the size.
Cheers, Scott
On Wed, Sep 24, 2014 at 7:09 PM, Valerio Schiavoni < valerio.schiavoni@gmail.com> wrote:
Hello Mako,
On Wed, Sep 24, 2014 at 8:13 AM, Benj. Mako Hill mako@atdot.cc wrote:
Users mostly read the most recent version of a given page, but from
time to
time, read accesses to the 'history' of a page happens.
At least as far as know, views to historical versions of webpages in Wikipedia don't show up in the access logs at all because certain kinds of requests (like requests to /w/index.php?oldid=NUMBER) don't get recorded in the pageview data.
I'm sorry to contradict you, but at least on the Wikibench traces, that information is very well present. I see things like:
1609418296 1190438479.078 http://en.wikipedia.org/w/index.php?title=Western_betrayal&oldid=9828122...
That is, back in 2007, users were accessing a version of that page that dated back in 2005 or so.
New versions of a page are created as well. Finally, users might
potentially need to explore several old versions of a given web page, for example by accessing the details of its history[1].
AFAIK, viewing the history page itself is also not recorded in the page view data either.
Sorry to contradict you again, but there are indeed logs for that as well:
http://en.wikipedia.org/w/index.php?title=Marina_Nadiradze&action=histor...
I'm quite surprised that such informations are not known by the community of Wikipedia researchers.
Best, Valerio
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
<quote who="Valerio Schiavoni" date="Wed, Sep 24, 2014 at 12:09:44PM +0200">
I'm sorry to contradict you, but at least on the Wikibench traces, that information is very well present. I see things like:
1609418296 1190438479.078 http://en.wikipedia.org/w/index.php?title=Western_betrayal&oldid=9828122...
<snip />
I'm quite surprised that such informations are not known by the community of Wikipedia researchers.
Well, my ignorance is my own and does not reflect the community of Wikipedia researchers. :)
But, as Scott pointed out, I was referring to pagecount data published by WMF (i.e., the data binned by hour that we were discussing in the sub-thread).
I was replying to the discussion about the granularity of the pagecount data to point out that increased granularity won't help you because the data you want isn't provided in /that/ dataset at all.
Wikibench is the only source of data I know of that includes hits to the "/w/index.php" pages for all of Wikipedia (I'd love to hear that I'm wrong about that). Unfortunately, Wikibench was, as far as I know, basically a one-off thing. It's great if you want a 10% sample of this kind of data for a ~3.5 months period in late 2007. If you want anything that is less stale, I think you're going to have to try to cut a deal with WMF to collect it.
Regards, Mako
Hi
On Wed, Sep 24, 2014 at 6:08 PM, Benj. Mako Hill mako@atdot.cc wrote:
I'm quite surprised that such informations are not known by the community of Wikipedia researchers.
Well, my ignorance is my own and does not reflect the community of Wikipedia researchers. :)
I did not mean to offend anyone with this sentence. Please, community of Wikipedia researchers, accept my apologizes.
If you want anything that is less stale, I think you're going to have to try to cut a deal with WMF to collect it.
If the WMF is open to discuss this aspect, I'll be more than ready to discuss possible agreements. Anyone could please point me to the right people to contact to discuss this possibility ?
Thanks, Valerio
(Apologies if this has been referred to already on this list. If so, I missed it).
A couple of Weeks ago, Brian Keegan published a very nice blog post [1] on the use of Python for Wikimedia research. He uses examples from the English Wikipedia but the techniques he describes are applicable more generally.
It’s fascinating, and shows what a lot can be done with a few lines of code.
Michael
[1] http://nbviewer.ipython.org/github/brianckeegan/Wikipedia-Network-Analysis/b...
IPython notebook FTW. Thanks for sharing.
Make a great day, Max Klein ‽ http://notconfusing.com/
On Tue, Oct 28, 2014 at 10:46 AM, Michael Maggs Michael@maggs.name wrote:
(Apologies if this has been referred to already on this list. If so, I missed it).
A couple of Weeks ago, Brian Keegan published a very nice blog post [1] on the use of Python for Wikimedia research. He uses examples from the English Wikipedia but the techniques he describes are applicable more generally.
It’s fascinating, and shows what a lot can be done with a few lines of code.
Michael
[1] http://nbviewer.ipython.org/github/brianckeegan/Wikipedia-Network-Analysis/b...
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
(shameless plug)
for those of you who are on Twitter, follow @WikiResearch https://twitter.com/WikiResearch and you won’t miss any of these announcements.
Dario
On Oct 30, 2014, at 10:21 AM, Maximilian Klein isalix@gmail.com wrote:
IPython notebook FTW. Thanks for sharing.
Make a great day, Max Klein ‽ http://notconfusing.com/ http://notconfusing.com/
On Tue, Oct 28, 2014 at 10:46 AM, Michael Maggs <Michael@maggs.name mailto:Michael@maggs.name> wrote: (Apologies if this has been referred to already on this list. If so, I missed it).
A couple of Weeks ago, Brian Keegan published a very nice blog post [1] on the use of Python for Wikimedia research. He uses examples from the English Wikipedia but the techniques he describes are applicable more generally.
It’s fascinating, and shows what a lot can be done with a few lines of code.
Michael
[1] http://nbviewer.ipython.org/github/brianckeegan/Wikipedia-Network-Analysis/b... http://nbviewer.ipython.org/github/brianckeegan/Wikipedia-Network-Analysis/blob/master/Wikipedia%20Network%20Analysis.ipynb _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org mailto:Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Folks,
Sorry for the interruption but I need to unsubscribe from this group. Can someone please help?
Many thanks, Divya
On Fri, Oct 31, 2014 at 9:34 AM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
(shameless plug)
for those of you who are on Twitter, follow @WikiResearch https://twitter.com/WikiResearch and you won’t miss any of these announcements.
Dario
On Oct 30, 2014, at 10:21 AM, Maximilian Klein isalix@gmail.com wrote:
IPython notebook FTW. Thanks for sharing.
Make a great day, Max Klein ‽ http://notconfusing.com/
On Tue, Oct 28, 2014 at 10:46 AM, Michael Maggs Michael@maggs.name wrote:
(Apologies if this has been referred to already on this list. If so, I missed it).
A couple of Weeks ago, Brian Keegan published a very nice blog post [1] on the use of Python for Wikimedia research. He uses examples from the English Wikipedia but the techniques he describes are applicable more generally.
It’s fascinating, and shows what a lot can be done with a few lines of code.
Michael
[1] http://nbviewer.ipython.org/github/brianckeegan/Wikipedia-Network-Analysis/b...
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Valerio,
I didn't know such data existed. As an alternative, perhaps you could have a look at our click datasets, which contain requests to the Web at large (i.e., not just Wikipedia) generated from within the campus of Indiana University over a period of several months. HTH
http://carl.cs.indiana.edu/data/#click
Cheers
G
Giovanni Luca Ciampaglia
✎ 919 E 10th ∙ Bloomington 47408 IN ∙ USA ☞ http://www.glciampaglia.com/ ✆ +1 812 855-7261 ✉ gciampag@indiana.edu
2014-09-17 9:53 GMT-04:00 Valerio Schiavoni valerio.schiavoni@gmail.com:
Hello, just bumping my email from last week, since so far I did not get any answer.
Should I consider that dataset to be somehow lost ?
I've also contacted the researchers who partially released it, but making it publicly available is tricky for them, due to its size (12 TB), which might instead be somehow in the norms of the operations taken daily by Wikipedia servers.
Thanks again, Valerio
On Wed, Sep 10, 2014 at 4:15 AM, Valerio Schiavoni < valerio.schiavoni@gmail.com> wrote:
Dear WikiMedia foundation, in the context of a EU research project [1], we are interested in accessing wikipedia access traces. In the past, such traces were given for research purposes to other groups [2]. Unfortunately, only a small percentage (10%) of that trace has been made made available (10%). We are interested in accessing the totality of that same trace (or even better, a more recent one, but the same one will do).
If this is not the correct ML to use for such requests, could please anyone redirect me to correct one ?
Thanks again for your attention,
Valerio Schiavoni Post-Doc Researcher University of Neuchatel, Switzerland
1 - http://www.leads-project.eu 2 - http://www.wikibench.eu/?page_id=60
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hello Giovanni, thanks for the pointer to the Click datasets. I'd have to take a look at the complete dataset, to see how much of those requests are touching wikipedia.
Then, one of the requirements to access those datas is: "The Click Dataset is large (~2.5 TB compressed), which requires that it be transferred on a physical hard drive. You will have to provide the drive as well as pre-paid return shipment. "
I have to check if this is possible and how long this might take to ship and send back an hard-drive from Switzerland. I'll let you know !!
Best, Valerio
On Wed, Sep 17, 2014 at 4:09 PM, Giovanni Luca Ciampaglia < gciampag@indiana.edu> wrote:
Valerio,
I didn't know such data existed. As an alternative, perhaps you could have a look at our click datasets, which contain requests to the Web at large (i.e., not just Wikipedia) generated from within the campus of Indiana University over a period of several months. HTH
http://carl.cs.indiana.edu/data/#click
Cheers
G
Giovanni Luca Ciampaglia
✎ 919 E 10th ∙ Bloomington 47408 IN ∙ USA ☞ http://www.glciampaglia.com/ ✆ +1 812 855-7261 ✉ gciampag@indiana.edu
2014-09-17 9:53 GMT-04:00 Valerio Schiavoni valerio.schiavoni@gmail.com:
Hello, just bumping my email from last week, since so far I did not get any answer.
Should I consider that dataset to be somehow lost ?
I've also contacted the researchers who partially released it, but making it publicly available is tricky for them, due to its size (12 TB), which might instead be somehow in the norms of the operations taken daily by Wikipedia servers.
Thanks again, Valerio
On Wed, Sep 10, 2014 at 4:15 AM, Valerio Schiavoni < valerio.schiavoni@gmail.com> wrote:
Dear WikiMedia foundation, in the context of a EU research project [1], we are interested in accessing wikipedia access traces. In the past, such traces were given for research purposes to other groups [2]. Unfortunately, only a small percentage (10%) of that trace has been made made available (10%). We are interested in accessing the totality of that same trace (or even better, a more recent one, but the same one will do).
If this is not the correct ML to use for such requests, could please anyone redirect me to correct one ?
Thanks again for your attention,
Valerio Schiavoni Post-Doc Researcher University of Neuchatel, Switzerland
1 - http://www.leads-project.eu 2 - http://www.wikibench.eu/?page_id=60
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hello Giovanni, on second thought, I think the Click dataset won't do either. I've parsed the smaller sample [1], which is said to be extracted from the bigger one.
In that dataset there are ~34k entries related to Wikipedia, but they look like the following:
{"count": 1, "timestamp": 1257181201, "from": "en.wikipedia.org", "to": " ko.wikipedia.org"}
That is, the log only reports the host/domain accessed, but not the specific URL being requested (to be clear, the one in the HTTP request issued by the client).
This is what is of main interest to me.
Thanks for your interest anyway! Valerio
1 - http://carl.cs.indiana.edu/data/#traffic-websci14
On Wed, Sep 17, 2014 at 4:24 PM, Valerio Schiavoni < valerio.schiavoni@gmail.com> wrote:
Hello Giovanni, thanks for the pointer to the Click datasets. I'd have to take a look at the complete dataset, to see how much of those requests are touching wikipedia.
Then, one of the requirements to access those datas is: "The Click Dataset is large (~2.5 TB compressed), which requires that it be transferred on a physical hard drive. You will have to provide the drive as well as pre-paid return shipment. "
I have to check if this is possible and how long this might take to ship and send back an hard-drive from Switzerland. I'll let you know !!
Best, Valerio
On Wed, Sep 17, 2014 at 4:09 PM, Giovanni Luca Ciampaglia < gciampag@indiana.edu> wrote:
Valerio,
I didn't know such data existed. As an alternative, perhaps you could have a look at our click datasets, which contain requests to the Web at large (i.e., not just Wikipedia) generated from within the campus of Indiana University over a period of several months. HTH
http://carl.cs.indiana.edu/data/#click
Cheers
G
Giovanni Luca Ciampaglia
✎ 919 E 10th ∙ Bloomington 47408 IN ∙ USA ☞ http://www.glciampaglia.com/ ✆ +1 812 855-7261 ✉ gciampag@indiana.edu
2014-09-17 9:53 GMT-04:00 Valerio Schiavoni valerio.schiavoni@gmail.com :
Hello, just bumping my email from last week, since so far I did not get any answer.
Should I consider that dataset to be somehow lost ?
I've also contacted the researchers who partially released it, but making it publicly available is tricky for them, due to its size (12 TB), which might instead be somehow in the norms of the operations taken daily by Wikipedia servers.
Thanks again, Valerio
On Wed, Sep 10, 2014 at 4:15 AM, Valerio Schiavoni < valerio.schiavoni@gmail.com> wrote:
Dear WikiMedia foundation, in the context of a EU research project [1], we are interested in accessing wikipedia access traces. In the past, such traces were given for research purposes to other groups [2]. Unfortunately, only a small percentage (10%) of that trace has been made made available (10%). We are interested in accessing the totality of that same trace (or even better, a more recent one, but the same one will do).
If this is not the correct ML to use for such requests, could please anyone redirect me to correct one ?
Thanks again for your attention,
Valerio Schiavoni Post-Doc Researcher University of Neuchatel, Switzerland
1 - http://www.leads-project.eu 2 - http://www.wikibench.eu/?page_id=60
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hi Valerio,
The page counts dataset has a time resolution of one hour. Is that too coarse? How fine of resolution do you need?
On Wed, Sep 17, 2014 at 9:44 AM, Valerio Schiavoni < valerio.schiavoni@gmail.com> wrote:
Hello Giovanni, on second thought, I think the Click dataset won't do either. I've parsed the smaller sample [1], which is said to be extracted from the bigger one.
In that dataset there are ~34k entries related to Wikipedia, but they look like the following:
{"count": 1, "timestamp": 1257181201, "from": "en.wikipedia.org", "to": " ko.wikipedia.org"}
That is, the log only reports the host/domain accessed, but not the specific URL being requested (to be clear, the one in the HTTP request issued by the client).
This is what is of main interest to me.
Thanks for your interest anyway! Valerio
1 - http://carl.cs.indiana.edu/data/#traffic-websci14
On Wed, Sep 17, 2014 at 4:24 PM, Valerio Schiavoni < valerio.schiavoni@gmail.com> wrote:
Hello Giovanni, thanks for the pointer to the Click datasets. I'd have to take a look at the complete dataset, to see how much of those requests are touching wikipedia.
Then, one of the requirements to access those datas is: "The Click Dataset is large (~2.5 TB compressed), which requires that it be transferred on a physical hard drive. You will have to provide the drive as well as pre-paid return shipment. "
I have to check if this is possible and how long this might take to ship and send back an hard-drive from Switzerland. I'll let you know !!
Best, Valerio
On Wed, Sep 17, 2014 at 4:09 PM, Giovanni Luca Ciampaglia < gciampag@indiana.edu> wrote:
Valerio,
I didn't know such data existed. As an alternative, perhaps you could have a look at our click datasets, which contain requests to the Web at large (i.e., not just Wikipedia) generated from within the campus of Indiana University over a period of several months. HTH
http://carl.cs.indiana.edu/data/#click
Cheers
G
Giovanni Luca Ciampaglia
✎ 919 E 10th ∙ Bloomington 47408 IN ∙ USA ☞ http://www.glciampaglia.com/ ✆ +1 812 855-7261 ✉ gciampag@indiana.edu
2014-09-17 9:53 GMT-04:00 Valerio Schiavoni <valerio.schiavoni@gmail.com
:
Hello, just bumping my email from last week, since so far I did not get any answer.
Should I consider that dataset to be somehow lost ?
I've also contacted the researchers who partially released it, but making it publicly available is tricky for them, due to its size (12 TB), which might instead be somehow in the norms of the operations taken daily by Wikipedia servers.
Thanks again, Valerio
On Wed, Sep 10, 2014 at 4:15 AM, Valerio Schiavoni < valerio.schiavoni@gmail.com> wrote:
Dear WikiMedia foundation, in the context of a EU research project [1], we are interested in accessing wikipedia access traces. In the past, such traces were given for research purposes to other groups [2]. Unfortunately, only a small percentage (10%) of that trace has been made made available (10%). We are interested in accessing the totality of that same trace (or even better, a more recent one, but the same one will do).
If this is not the correct ML to use for such requests, could please anyone redirect me to correct one ?
Thanks again for your attention,
Valerio Schiavoni Post-Doc Researcher University of Neuchatel, Switzerland
1 - http://www.leads-project.eu 2 - http://www.wikibench.eu/?page_id=60
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hello Aaron, 1 hour is way too coarse. Let's say 1 second would be ok. Is that available ?
On Wed, Sep 17, 2014 at 5:23 PM, Aaron Halfaker aaron.halfaker@gmail.com wrote:
Hi Valerio,
The page counts dataset has a time resolution of one hour. Is that too coarse? How fine of resolution do you need?
On Wed, Sep 17, 2014 at 9:44 AM, Valerio Schiavoni < valerio.schiavoni@gmail.com> wrote:
Hello Giovanni, on second thought, I think the Click dataset won't do either. I've parsed the smaller sample [1], which is said to be extracted from the bigger one.
In that dataset there are ~34k entries related to Wikipedia, but they look like the following:
{"count": 1, "timestamp": 1257181201, "from": "en.wikipedia.org", "to": " ko.wikipedia.org"}
That is, the log only reports the host/domain accessed, but not the specific URL being requested (to be clear, the one in the HTTP request issued by the client).
This is what is of main interest to me.
Thanks for your interest anyway! Valerio
1 - http://carl.cs.indiana.edu/data/#traffic-websci14
On Wed, Sep 17, 2014 at 4:24 PM, Valerio Schiavoni < valerio.schiavoni@gmail.com> wrote:
Hello Giovanni, thanks for the pointer to the Click datasets. I'd have to take a look at the complete dataset, to see how much of those requests are touching wikipedia.
Then, one of the requirements to access those datas is: "The Click Dataset is large (~2.5 TB compressed), which requires that it be transferred on a physical hard drive. You will have to provide the drive as well as pre-paid return shipment. "
I have to check if this is possible and how long this might take to ship and send back an hard-drive from Switzerland. I'll let you know !!
Best, Valerio
On Wed, Sep 17, 2014 at 4:09 PM, Giovanni Luca Ciampaglia < gciampag@indiana.edu> wrote:
Valerio,
I didn't know such data existed. As an alternative, perhaps you could have a look at our click datasets, which contain requests to the Web at large (i.e., not just Wikipedia) generated from within the campus of Indiana University over a period of several months. HTH
http://carl.cs.indiana.edu/data/#click
Cheers
G
Giovanni Luca Ciampaglia
✎ 919 E 10th ∙ Bloomington 47408 IN ∙ USA ☞ http://www.glciampaglia.com/ ✆ +1 812 855-7261 ✉ gciampag@indiana.edu
2014-09-17 9:53 GMT-04:00 Valerio Schiavoni < valerio.schiavoni@gmail.com>:
Hello, just bumping my email from last week, since so far I did not get any answer.
Should I consider that dataset to be somehow lost ?
I've also contacted the researchers who partially released it, but making it publicly available is tricky for them, due to its size (12 TB), which might instead be somehow in the norms of the operations taken daily by Wikipedia servers.
Thanks again, Valerio
On Wed, Sep 10, 2014 at 4:15 AM, Valerio Schiavoni < valerio.schiavoni@gmail.com> wrote:
> Dear WikiMedia foundation, > in the context of a EU research project [1], we are interested in > accessing > wikipedia access traces. > In the past, such traces were given for research purposes to other > groups > [2]. > Unfortunately, only a small percentage (10%) of that trace has been > made > made available (10%). > We are interested in accessing the totality of that same trace (or > even > better, a more recent one, but the same one will do). > > If this is not the correct ML to use for such requests, could please > anyone > redirect me to correct one ? > > Thanks again for your attention, > > Valerio Schiavoni > Post-Doc Researcher > University of Neuchatel, Switzerland > > 1 - http://www.leads-project.eu > 2 - http://www.wikibench.eu/?page_id=60 >
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
I don't think that we keep those logs historically. analytics-l (CC'd) might have more insights.
Do we have anything more granular than the hourly view logs available here: https://dumps.wikimedia.org/other/pagecounts-raw/
On Wed, Sep 17, 2014 at 10:39 AM, Valerio Schiavoni < valerio.schiavoni@gmail.com> wrote:
Hello Aaron, 1 hour is way too coarse. Let's say 1 second would be ok. Is that available ?
On Wed, Sep 17, 2014 at 5:23 PM, Aaron Halfaker aaron.halfaker@gmail.com wrote:
Hi Valerio,
The page counts dataset has a time resolution of one hour. Is that too coarse? How fine of resolution do you need?
On Wed, Sep 17, 2014 at 9:44 AM, Valerio Schiavoni < valerio.schiavoni@gmail.com> wrote:
Hello Giovanni, on second thought, I think the Click dataset won't do either. I've parsed the smaller sample [1], which is said to be extracted from the bigger one.
In that dataset there are ~34k entries related to Wikipedia, but they look like the following:
{"count": 1, "timestamp": 1257181201, "from": "en.wikipedia.org", "to": "ko.wikipedia.org"}
That is, the log only reports the host/domain accessed, but not the specific URL being requested (to be clear, the one in the HTTP request issued by the client).
This is what is of main interest to me.
Thanks for your interest anyway! Valerio
1 - http://carl.cs.indiana.edu/data/#traffic-websci14
On Wed, Sep 17, 2014 at 4:24 PM, Valerio Schiavoni < valerio.schiavoni@gmail.com> wrote:
Hello Giovanni, thanks for the pointer to the Click datasets. I'd have to take a look at the complete dataset, to see how much of those requests are touching wikipedia.
Then, one of the requirements to access those datas is: "The Click Dataset is large (~2.5 TB compressed), which requires that it be transferred on a physical hard drive. You will have to provide the drive as well as pre-paid return shipment. "
I have to check if this is possible and how long this might take to ship and send back an hard-drive from Switzerland. I'll let you know !!
Best, Valerio
On Wed, Sep 17, 2014 at 4:09 PM, Giovanni Luca Ciampaglia < gciampag@indiana.edu> wrote:
Valerio,
I didn't know such data existed. As an alternative, perhaps you could have a look at our click datasets, which contain requests to the Web at large (i.e., not just Wikipedia) generated from within the campus of Indiana University over a period of several months. HTH
http://carl.cs.indiana.edu/data/#click
Cheers
G
Giovanni Luca Ciampaglia
✎ 919 E 10th ∙ Bloomington 47408 IN ∙ USA ☞ http://www.glciampaglia.com/ ✆ +1 812 855-7261 ✉ gciampag@indiana.edu
2014-09-17 9:53 GMT-04:00 Valerio Schiavoni < valerio.schiavoni@gmail.com>:
Hello, just bumping my email from last week, since so far I did not get any answer.
Should I consider that dataset to be somehow lost ?
I've also contacted the researchers who partially released it, but making it publicly available is tricky for them, due to its size (12 TB), which might instead be somehow in the norms of the operations taken daily by Wikipedia servers.
Thanks again, Valerio
> > On Wed, Sep 10, 2014 at 4:15 AM, Valerio Schiavoni < > valerio.schiavoni@gmail.com> wrote: > >> Dear WikiMedia foundation, >> in the context of a EU research project [1], we are interested in >> accessing >> wikipedia access traces. >> In the past, such traces were given for research purposes to other >> groups >> [2]. >> Unfortunately, only a small percentage (10%) of that trace has been >> made >> made available (10%). >> We are interested in accessing the totality of that same trace (or >> even >> better, a more recent one, but the same one will do). >> >> If this is not the correct ML to use for such requests, could >> please anyone >> redirect me to correct one ? >> >> Thanks again for your attention, >> >> Valerio Schiavoni >> Post-Doc Researcher >> University of Neuchatel, Switzerland >> >> 1 - http://www.leads-project.eu >> 2 - http://www.wikibench.eu/?page_id=60 >> > >
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
wiki-research-l@lists.wikimedia.org