I don't think that we keep those logs historically. analytics-l (CC'd) might have more insights.
Do we have anything more granular than the hourly view logs available here: https://dumps.wikimedia.org/other/pagecounts-raw/
On Wed, Sep 17, 2014 at 10:39 AM, Valerio Schiavoni < valerio.schiavoni@gmail.com> wrote:
Hello Aaron, 1 hour is way too coarse. Let's say 1 second would be ok. Is that available ?
On Wed, Sep 17, 2014 at 5:23 PM, Aaron Halfaker aaron.halfaker@gmail.com wrote:
Hi Valerio,
The page counts dataset has a time resolution of one hour. Is that too coarse? How fine of resolution do you need?
On Wed, Sep 17, 2014 at 9:44 AM, Valerio Schiavoni < valerio.schiavoni@gmail.com> wrote:
Hello Giovanni, on second thought, I think the Click dataset won't do either. I've parsed the smaller sample [1], which is said to be extracted from the bigger one.
In that dataset there are ~34k entries related to Wikipedia, but they look like the following:
{"count": 1, "timestamp": 1257181201, "from": "en.wikipedia.org", "to": "ko.wikipedia.org"}
That is, the log only reports the host/domain accessed, but not the specific URL being requested (to be clear, the one in the HTTP request issued by the client).
This is what is of main interest to me.
Thanks for your interest anyway! Valerio
1 - http://carl.cs.indiana.edu/data/#traffic-websci14
On Wed, Sep 17, 2014 at 4:24 PM, Valerio Schiavoni < valerio.schiavoni@gmail.com> wrote:
Hello Giovanni, thanks for the pointer to the Click datasets. I'd have to take a look at the complete dataset, to see how much of those requests are touching wikipedia.
Then, one of the requirements to access those datas is: "The Click Dataset is large (~2.5 TB compressed), which requires that it be transferred on a physical hard drive. You will have to provide the drive as well as pre-paid return shipment. "
I have to check if this is possible and how long this might take to ship and send back an hard-drive from Switzerland. I'll let you know !!
Best, Valerio
On Wed, Sep 17, 2014 at 4:09 PM, Giovanni Luca Ciampaglia < gciampag@indiana.edu> wrote:
Valerio,
I didn't know such data existed. As an alternative, perhaps you could have a look at our click datasets, which contain requests to the Web at large (i.e., not just Wikipedia) generated from within the campus of Indiana University over a period of several months. HTH
http://carl.cs.indiana.edu/data/#click
Cheers
G
Giovanni Luca Ciampaglia
✎ 919 E 10th ∙ Bloomington 47408 IN ∙ USA ☞ http://www.glciampaglia.com/ ✆ +1 812 855-7261 ✉ gciampag@indiana.edu
2014-09-17 9:53 GMT-04:00 Valerio Schiavoni < valerio.schiavoni@gmail.com>:
Hello, just bumping my email from last week, since so far I did not get any answer.
Should I consider that dataset to be somehow lost ?
I've also contacted the researchers who partially released it, but making it publicly available is tricky for them, due to its size (12 TB), which might instead be somehow in the norms of the operations taken daily by Wikipedia servers.
Thanks again, Valerio
> > On Wed, Sep 10, 2014 at 4:15 AM, Valerio Schiavoni < > valerio.schiavoni@gmail.com> wrote: > >> Dear WikiMedia foundation, >> in the context of a EU research project [1], we are interested in >> accessing >> wikipedia access traces. >> In the past, such traces were given for research purposes to other >> groups >> [2]. >> Unfortunately, only a small percentage (10%) of that trace has been >> made >> made available (10%). >> We are interested in accessing the totality of that same trace (or >> even >> better, a more recent one, but the same one will do). >> >> If this is not the correct ML to use for such requests, could >> please anyone >> redirect me to correct one ? >> >> Thanks again for your attention, >> >> Valerio Schiavoni >> Post-Doc Researcher >> University of Neuchatel, Switzerland >> >> 1 - http://www.leads-project.eu >> 2 - http://www.wikibench.eu/?page_id=60 >> > >
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hi all,
I can't figure out the use case from the thread but it's unlikely we would release unaggregated page views as this would have privacy implications that we would need to consider very carefully. Hourly is likely the smallest granularity we will release.
Best,
-Toby
On Wed, Sep 17, 2014 at 8:56 AM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
I don't think that we keep those logs historically. analytics-l (CC'd) might have more insights.
Do we have anything more granular than the hourly view logs available here: https://dumps.wikimedia.org/other/pagecounts-raw/
On Wed, Sep 17, 2014 at 10:39 AM, Valerio Schiavoni < valerio.schiavoni@gmail.com> wrote:
Hello Aaron, 1 hour is way too coarse. Let's say 1 second would be ok. Is that available ?
On Wed, Sep 17, 2014 at 5:23 PM, Aaron Halfaker <aaron.halfaker@gmail.com
wrote:
Hi Valerio,
The page counts dataset has a time resolution of one hour. Is that too coarse? How fine of resolution do you need?
On Wed, Sep 17, 2014 at 9:44 AM, Valerio Schiavoni < valerio.schiavoni@gmail.com> wrote:
Hello Giovanni, on second thought, I think the Click dataset won't do either. I've parsed the smaller sample [1], which is said to be extracted from the bigger one.
In that dataset there are ~34k entries related to Wikipedia, but they look like the following:
{"count": 1, "timestamp": 1257181201, "from": "en.wikipedia.org", "to": "ko.wikipedia.org"}
That is, the log only reports the host/domain accessed, but not the specific URL being requested (to be clear, the one in the HTTP request issued by the client).
This is what is of main interest to me.
Thanks for your interest anyway! Valerio
1 - http://carl.cs.indiana.edu/data/#traffic-websci14
On Wed, Sep 17, 2014 at 4:24 PM, Valerio Schiavoni < valerio.schiavoni@gmail.com> wrote:
Hello Giovanni, thanks for the pointer to the Click datasets. I'd have to take a look at the complete dataset, to see how much of those requests are touching wikipedia.
Then, one of the requirements to access those datas is: "The Click Dataset is large (~2.5 TB compressed), which requires that it be transferred on a physical hard drive. You will have to provide the drive as well as pre-paid return shipment. "
I have to check if this is possible and how long this might take to ship and send back an hard-drive from Switzerland. I'll let you know !!
Best, Valerio
On Wed, Sep 17, 2014 at 4:09 PM, Giovanni Luca Ciampaglia < gciampag@indiana.edu> wrote:
Valerio,
I didn't know such data existed. As an alternative, perhaps you could have a look at our click datasets, which contain requests to the Web at large (i.e., not just Wikipedia) generated from within the campus of Indiana University over a period of several months. HTH
http://carl.cs.indiana.edu/data/#click
Cheers
G
Giovanni Luca Ciampaglia
✎ 919 E 10th ∙ Bloomington 47408 IN ∙ USA ☞ http://www.glciampaglia.com/ ✆ +1 812 855-7261 ✉ gciampag@indiana.edu
2014-09-17 9:53 GMT-04:00 Valerio Schiavoni < valerio.schiavoni@gmail.com>:
> Hello, > just bumping my email from last week, since so far I did not get any > answer. > > Should I consider that dataset to be somehow lost ? > > I've also contacted the researchers who partially released it, but > making it publicly available is tricky for them, due to its size (12 TB), > which might instead be somehow in the norms of the operations taken daily > by Wikipedia servers. > > Thanks again, > Valerio > >> >> On Wed, Sep 10, 2014 at 4:15 AM, Valerio Schiavoni < >> valerio.schiavoni@gmail.com> wrote: >> >>> Dear WikiMedia foundation, >>> in the context of a EU research project [1], we are interested in >>> accessing >>> wikipedia access traces. >>> In the past, such traces were given for research purposes to other >>> groups >>> [2]. >>> Unfortunately, only a small percentage (10%) of that trace has >>> been made >>> made available (10%). >>> We are interested in accessing the totality of that same trace (or >>> even >>> better, a more recent one, but the same one will do). >>> >>> If this is not the correct ML to use for such requests, could >>> please anyone >>> redirect me to correct one ? >>> >>> Thanks again for your attention, >>> >>> Valerio Schiavoni >>> Post-Doc Researcher >>> University of Neuchatel, Switzerland >>> >>> 1 - http://www.leads-project.eu >>> 2 - http://www.wikibench.eu/?page_id=60 >>> >> >> > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > >
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics