I think it is extremely important to keep these files for later analysis by historians and others.
Mathias Schindler also keep an archive or at least did till April (Berlin conference). He even bought a dedicated external drive for it.
I collect files daily and merge 24 hourly files into one daily file. That saves a lot on disk space and makes processing faster. Titles with less than 10 requests per day are discarded that also saves a lot.
For the remainder instead of 24 comma separated values I use a 'sparse array' as follows:
B2D15G2 means 2 views in 2nd hour (0100-0200), 15 in 4th, 2 in 7th The string starts with total for whole day (redundant but eases processing for some purposes) So actually it is 19B2D15G2
Example: de Berlie_Doherty 9L2O1Q1R2T3 de Berliet 20E2F1K1M1N2O3P3Q4R2X1 de Berliet_GBC_8_KT 17B1E1J3M2N1O1P1Q1R2S1T1U1V1 de Berlin 8488A116B56C32D56E21F43G98H172I316J531K636L675M601N533O524P508Q510R576S426T4 92U530V508W328X200
I have files from August 2008. Roughly 3 Gb per month now. And yes a more permanent, fail-safe and more accessible storage location would be great.
Erik Zachte
-----Original Message----- From: Frédéric Schütz [mailto:schutz@mathgen.ch] Sent: Thursday, September 17, 2009 22:34 To: toolserver-l@lists.wikimedia.org Cc: wikitech-l@lists.wikimedia.org; Erik Zachte Subject: Re: [Toolserver-l] Archive of visitor stats
Lars Aronsson wrote:
Are visitor stats (as produced by Domas) safely archived somewhere, for example on the toolserver, where development projects can easily access them for analysis? I have made my own copies of the files (I guess my plan was to use them, but this hasn't started yet), but now I'm running out of disk and I urgently need to clear some space on that server.
I just deleted September 2009 (last 2 weeks) and that freed 9 GB.
The oldest I have is pagecounts-20071209-180000.gz
As Platonides mentioned, they are in /mnt/user-store/stats on the toolserver; however, I would not call that "safely archived": one of my cron jobs just copies them from Domas server, and that's it.
At the moment, there should be everything starting from 1 January 2009 (although part of it disappeared at some point, but I managed to recover it).
However, this is definitively not a sustainable solution in the long run: the files currently take 335 Gb (out of a 1.5 Tb total space).
Erik Zachte stores archives of visitor stats in a better format, aggregating some of the older data and storing several days of data in one file. I started looking into these files earlier this year, planning to spend some time playing with this data. One of my ideas was to replicate the statistical data that is on the WMF stats server somewhere on the toolserver -- and do it "officially" and not just by copying files using a personal cron job. Unfortunately, "real life" took over and I did not manage to continue this (and still can't). However, if there is any interest in improving the situation, I'd be glad to look into it as soon as I can.
I cc' Erik who may have more to say.
Cheers,
Frédéric
2009/9/17 Erik Zachte erikzachte@infodisiac.com:
I think it is extremely important to keep these files for later analysis by historians and others.
Mathias Schindler also keep an archive or at least did till April (Berlin conference). He even bought a dedicated external drive for it.
I collect files daily and merge 24 hourly files into one daily file. That saves a lot on disk space and makes processing faster. Titles with less than 10 requests per day are discarded that also saves a lot.
Careful, a recent analysis I did suggested that 15% of all page requests for articles on Wikipedia are for topics requested less than once per hour. There are a very large number of pages that rarely see hits, but collectively the traffic to such topics is important. You could end up biasing certain kinds of analysis if you always exclude the rarely visited pages.
-Robert Rohde
2009/9/18 Robert Rohde rarohde@gmail.com:
Careful, a recent analysis I did suggested that 15% of all page requests for articles on Wikipedia are for topics requested less than once per hour. There are a very large number of pages that rarely see hits, but collectively the traffic to such topics is important. You could end up biasing certain kinds of analysis if you always exclude the rarely visited pages.
Is there a link to that analysis? It would be interesting to see which are the least requested articles, for example.
Steve
On Thu, Sep 17, 2009 at 6:24 PM, Steve Bennett stevagewp@gmail.com wrote:
2009/9/18 Robert Rohde rarohde@gmail.com:
Careful, a recent analysis I did suggested that 15% of all page requests for articles on Wikipedia are for topics requested less than once per hour. There are a very large number of pages that rarely see hits, but collectively the traffic to such topics is important. You could end up biasing certain kinds of analysis if you always exclude the rarely visited pages.
Is there a link to that analysis? It would be interesting to see which are the least requested articles, for example.
That particular result is unpublished. I could make you a list of infrequently viewed articles, but it would be quite long.
-Robert Rohde
On Fri, Sep 18, 2009 at 12:20 PM, Robert Rohde rarohde@gmail.com wrote:
That particular result is unpublished. I could make you a list of infrequently viewed articles, but it would be quite long.
Could you make a list of the 100 least viewed? Or are there are large number which are essentially equal?
Steve
On Thu, Sep 17, 2009 at 9:25 PM, Steve Bennett stevagewp@gmail.com wrote:
On Fri, Sep 18, 2009 at 12:20 PM, Robert Rohde rarohde@gmail.com wrote:
That particular result is unpublished. I could make you a list of infrequently viewed articles, but it would be quite long.
Could you make a list of the 100 least viewed? Or are there are large number which are essentially equal?
Steve
There is a strong correlation between start/stub quality articles and the number of times they are viewed.
On Thu, Sep 17, 2009 at 9:28 PM, Brian Brian.Mingus@colorado.edu wrote:
On Thu, Sep 17, 2009 at 9:25 PM, Steve Bennett stevagewp@gmail.comwrote:
On Fri, Sep 18, 2009 at 12:20 PM, Robert Rohde rarohde@gmail.com wrote:
That particular result is unpublished. I could make you a list of infrequently viewed articles, but it would be quite long.
Could you make a list of the 100 least viewed? Or are there are large number which are essentially equal?
Steve
There is a strong correlation between start/stub quality articles and the number of times they are viewed.
Further correlated with number of edits..
On Fri, Sep 18, 2009 at 1:28 PM, Brian Brian.Mingus@colorado.edu wrote:
There is a strong correlation between start/stub quality articles and the number of times they are viewed.
Ah, ok. What about a list of exceptions to that: articles over 1000 characters, that have been around more than a year, and still receive less than a hit a day or something. I'm asking because perhaps something like this could help inform WP:NOT, WP:N etc.
Steve
On Thu, Sep 17, 2009 at 8:25 PM, Steve Bennett stevagewp@gmail.com wrote:
On Fri, Sep 18, 2009 at 12:20 PM, Robert Rohde rarohde@gmail.com wrote:
That particular result is unpublished. I could make you a list of infrequently viewed articles, but it would be quite long.
Could you make a list of the 100 least viewed? Or are there are large number which are essentially equal?
My sample consisted of collating 30 non-consecutive hours of data on enwiki traffic where each hour was randomly chosen from any point during the last 8 months. This was filtered to only include page titles that were valid mainspace pages.
In those 30 hours, there are 1.36 million valid article titles that are viewed exactly once [1].
Examples include:
129342_Ependes 1421_in_literature Antiprotonic_helium Antonella_Mularoni Madhusoodhanan_Nair Blue_Murder_(play) Ozonotherapy Veronika_Krausas Verret,_New_Brunswick Bare_Truth_(Nat_album)
As you can see, these are obscure topics, but they are not necessarily crazy topics. If I were to repeat it with a longer baseline (say 1000 hours rather than 30) I'm suspect you might get more interesting information on the tail, but right now probably the best I can say is that a cumulatively significant amount of traffic goes to relatively obscure pages.
-Robert Rohde
[1] Note: Because the traffic data is based on url request stings, and some url strings map to the same pages, i.e. Blue_Ocean and Blue%20Ocean, the number of valid article titles in not necessarily the same as the number of distinct pages. For practical reasons my analysis was based of the url strings, and so probably over counts the number of distinct articles involved, and to a degree overstates the fraction of traffic to obscure pages.
On Thu, Sep 17, 2009 at 9:55 PM, Robert Rohde rarohde@gmail.com wrote:
On Thu, Sep 17, 2009 at 8:25 PM, Steve Bennett stevagewp@gmail.com wrote:
On Fri, Sep 18, 2009 at 12:20 PM, Robert Rohde rarohde@gmail.com
wrote:
That particular result is unpublished. I could make you a list of infrequently viewed articles, but it would be quite long.
Could you make a list of the 100 least viewed? Or are there are large number which are essentially equal?
My sample consisted of collating 30 non-consecutive hours of data on enwiki traffic where each hour was randomly chosen from any point during the last 8 months. This was filtered to only include page titles that were valid mainspace pages.
In those 30 hours, there are 1.36 million valid article titles that are viewed exactly once [1].
Examples include:
129342_Ependes 1421_in_literature Antiprotonic_helium Antonella_Mularoni Madhusoodhanan_Nair Blue_Murder_(play) Ozonotherapy Veronika_Krausas Verret,_New_Brunswick Bare_Truth_(Nat_album)
As you can see, these are obscure topics, but they are not necessarily crazy topics. If I were to repeat it with a longer baseline (say 1000 hours rather than 30) I'm suspect you might get more interesting information on the tail, but right now probably the best I can say is that a cumulatively significant amount of traffic goes to relatively obscure pages.
-Robert Rohde
[1] Note: Because the traffic data is based on url request stings, and some url strings map to the same pages, i.e. Blue_Ocean and Blue%20Ocean, the number of valid article titles in not necessarily the same as the number of distinct pages. For practical reasons my analysis was based of the url strings, and so probably over counts the number of distinct articles involved, and to a degree overstates the fraction of traffic to obscure pages.
How sure are you that they were viewed by a person and not a bot?
On Thu, Sep 17, 2009 at 8:58 PM, Brian Brian.Mingus@colorado.edu wrote:
On Thu, Sep 17, 2009 at 9:55 PM, Robert Rohde rarohde@gmail.com wrote:
On Thu, Sep 17, 2009 at 8:25 PM, Steve Bennett stevagewp@gmail.com wrote:
On Fri, Sep 18, 2009 at 12:20 PM, Robert Rohde rarohde@gmail.com
wrote:
That particular result is unpublished. I could make you a list of infrequently viewed articles, but it would be quite long.
Could you make a list of the 100 least viewed? Or are there are large number which are essentially equal?
My sample consisted of collating 30 non-consecutive hours of data on enwiki traffic where each hour was randomly chosen from any point during the last 8 months. This was filtered to only include page titles that were valid mainspace pages.
In those 30 hours, there are 1.36 million valid article titles that are viewed exactly once [1].
Examples include:
129342_Ependes 1421_in_literature Antiprotonic_helium Antonella_Mularoni Madhusoodhanan_Nair Blue_Murder_(play) Ozonotherapy Veronika_Krausas Verret,_New_Brunswick Bare_Truth_(Nat_album)
As you can see, these are obscure topics, but they are not necessarily crazy topics. If I were to repeat it with a longer baseline (say 1000 hours rather than 30) I'm suspect you might get more interesting information on the tail, but right now probably the best I can say is that a cumulatively significant amount of traffic goes to relatively obscure pages.
-Robert Rohde
[1] Note: Because the traffic data is based on url request stings, and some url strings map to the same pages, i.e. Blue_Ocean and Blue%20Ocean, the number of valid article titles in not necessarily the same as the number of distinct pages. For practical reasons my analysis was based of the url strings, and so probably over counts the number of distinct articles involved, and to a degree overstates the fraction of traffic to obscure pages.
How sure are you that they were viewed by a person and not a bot?
There is no differentiation between people and bots. (Some of these things are why it is an unpublished analysis. ;-) I was actually using traffic data for a totally different purpose, but decided to look at things likes like obscure pages, while I was at it.)
-Robert Rohde
On Thu, Sep 17, 2009 at 10:18 PM, Robert Rohde rarohde@gmail.com wrote:
On Thu, Sep 17, 2009 at 8:58 PM, Brian Brian.Mingus@colorado.edu wrote:
On Thu, Sep 17, 2009 at 9:55 PM, Robert Rohde rarohde@gmail.com wrote:
On Thu, Sep 17, 2009 at 8:25 PM, Steve Bennett stevagewp@gmail.com wrote:
On Fri, Sep 18, 2009 at 12:20 PM, Robert Rohde rarohde@gmail.com
wrote:
That particular result is unpublished. I could make you a list of infrequently viewed articles, but it would be quite long.
Could you make a list of the 100 least viewed? Or are there are large number which are essentially equal?
My sample consisted of collating 30 non-consecutive hours of data on enwiki traffic where each hour was randomly chosen from any point during the last 8 months. This was filtered to only include page titles that were valid mainspace pages.
In those 30 hours, there are 1.36 million valid article titles that are viewed exactly once [1].
Examples include:
129342_Ependes 1421_in_literature Antiprotonic_helium Antonella_Mularoni Madhusoodhanan_Nair Blue_Murder_(play) Ozonotherapy Veronika_Krausas Verret,_New_Brunswick Bare_Truth_(Nat_album)
As you can see, these are obscure topics, but they are not necessarily crazy topics. If I were to repeat it with a longer baseline (say 1000 hours rather than 30) I'm suspect you might get more interesting information on the tail, but right now probably the best I can say is that a cumulatively significant amount of traffic goes to relatively obscure pages.
-Robert Rohde
[1] Note: Because the traffic data is based on url request stings, and some url strings map to the same pages, i.e. Blue_Ocean and Blue%20Ocean, the number of valid article titles in not necessarily the same as the number of distinct pages. For practical reasons my analysis was based of the url strings, and so probably over counts the number of distinct articles involved, and to a degree overstates the fraction of traffic to obscure pages.
How sure are you that they were viewed by a person and not a bot?
There is no differentiation between people and bots. (Some of these things are why it is an unpublished analysis. ;-) I was actually using traffic data for a totally different purpose, but decided to look at things likes like obscure pages, while I was at it.)
-Robert Rohde
Oh I see. It would be reassuring to know that there were a million or so articles not viewed at all?
Great, thank you. Even that is enough to begin to draw some conclusions.
129342_Ependes
Lol, the stereotypical asteroid article. Well, that's one more hit than I would expect it to get.
1421_in_literature
My eye was drawn to [[1421 in literature]], but that has always been a redirect, so perhaps the one hit was the person creating it. :)
Antiprotonic_helium
Looks like a decent article! But it was orphaned...so I linked to it from Antiproton.
Antonella_Mularoni
Excellent article - pity no traffic.
Madhusoodhanan_Nair
Redirect to borderline vanity
Blue_Murder_(play)
A redirect
Ozonotherapy
Redirect to fringe science
Veronika_Krausas
Ok article, but pretty obscure subject.
Verret,_New_Brunswick
Substub.
Bare_Truth_(Nat_album)
A crappy article about what sounds like an even crappier album. With offensive album art to boot.
Hmm, what conclusion to draw from all this? Most of those articles were redirects or crappy articles - Antonella_Mularoni was the only real exception.
Steve
* Steve Bennett stevagewp@gmail.com [Fri, 18 Sep 2009 14:20:53 +1000]:
1421_in_literature
My eye was drawn to [[1421 in literature]], but that has always been a redirect, so perhaps the one hit was the person creating it. :)
It redirects to 15th century literature article, which has no books written in 1421 mentioned. Lots of other years of the same century, but no 1421. And look at the infobox table in the top-right corner, there seems to be "every year in literature" links or redirects. Probably created by some bot. Dmitriy
Дана Friday 18 September 2009 06:20:53 Steve Bennett написа:
Hmm, what conclusion to draw from all this? Most of those articles were redirects or crappy articles - Antonella_Mularoni was the only real exception.
I find Antiprotonic helium to be a very interesting and sufficiently informative article - certainly not crappy.
Steve Bennett wrote:
Is there a link to that analysis? It would be interesting to see which are the least requested articles, for example.
I don't have that, but you can visit http://stats.grok.se/en/200909/Mineral_County,_Montana to find out that this article was viewed 368 times during August 2009, whereas http://stats.grok.se/en/200908/Tabaning_Sita_Forest_Park was viewed only 29 times.
On sv.wikipedia there is a "gadget" for adding a "tab" to each article, a tab that links to this "stats" website.
In word frequency analysis, the expected case is that half of the different words in any text are used only once, a quarter is used only twice, an eighth is used 3 or 4 times, etc. There are different names for such models: Zipf's law, power law distribution, long tail, and so on. More often than not, such terms are used without fully understanding the math behind them.
This is a little different from the case of Wikipedia articles, where some articles are perhaps never viewed. But we should expect that a large number of articles are viewed very seldom.
So if you ask which articles are least requested, you should probably expect a list of 1.5 million articles (of the 3 million in the English Wikipedia). It's similar to asking which words are least frequently used. With time, we will add another 3 million articles about things that are even less interesting, and a few thousand articles on more interesting topics.
It's a different case if you ask the question for a limited set of articles, which you already know something about, for example those about the 56 counties in Montana, which should all be equally boring, or where interest should perhaps be proportional to the population. Which are more or less requested? Is something wrong with some of those articles?
Sure, info gets lost. And the Long Tail is meaningful for some research no doubt. But my resources are finite.
Actually I do store some all inclusive counts in the compacted 24 hr file:
# Lines starting with ampersand (@) show totals per 'namespace' (including omitted counts for low traffic articles) # Since valid namespace string are not known in the compression script any string followed by colon (:) counts as possible namespace string # Please reconcile with real namespace name strings later # 'namespaces' with count < 5 are combined in 'Other' (on larger wikis these are surely false positives)
@ aa.z Category 9 @ aa.z File 20 @ aa.z Image 9 @ aa.z MediaWiki 20 @ aa.z NamespaceArticles 163 @ aa.z Special 97 @ aa.z Talk 17 @ aa.z User 35 @ aa.z Wikipedia 16 @ aa.z -other- 11
Erik Zachte
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l- bounces@lists.wikimedia.org] On Behalf Of Robert Rohde Sent: Friday, September 18, 2009 02:33 To: Wikimedia developers Cc: Mathias Schindler; Frédéric Schütz; toolserver- l@lists.wikimedia.org Subject: Re: [Wikitech-l] [Toolserver-l] Archive of visitor stats
2009/9/17 Erik Zachte erikzachte@infodisiac.com:
I think it is extremely important to keep these files for later
analysis by
historians and others.
Mathias Schindler also keep an archive or at least did till April
(Berlin
conference). He even bought a dedicated external drive for it.
I collect files daily and merge 24 hourly files into one daily file. That saves a lot on disk space and makes processing faster. Titles with less than 10 requests per day are discarded that also
saves a
lot.
Careful, a recent analysis I did suggested that 15% of all page requests for articles on Wikipedia are for topics requested less than once per hour. There are a very large number of pages that rarely see hits, but collectively the traffic to such topics is important. You could end up biasing certain kinds of analysis if you always exclude the rarely visited pages.
-Robert Rohde
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Erik Zachte wrote:
Sure, info gets lost. And the Long Tail is meaningful for some research no doubt. But my resources are finite.
Actually I do store some all inclusive counts in the compacted 24 hr file:
# Lines starting with ampersand (@) show totals per 'namespace' (including omitted counts for low traffic articles) # Since valid namespace string are not known in the compression script any string followed by colon (:) counts as possible namespace string # Please reconcile with real namespace name strings later # 'namespaces' with count < 5 are combined in 'Other' (on larger wikis these are surely false positives)
Making the script aware of namespace names would be quite easy.
On Fri, Sep 18, 2009 at 10:02 AM, Platonides Platonides@gmail.com wrote:
Erik Zachte wrote:
Sure, info gets lost. And the Long Tail is meaningful for some research no doubt. But my resources are finite.
Actually I do store some all inclusive counts in the compacted 24 hr file:
# Lines starting with ampersand (@) show totals per 'namespace' (including omitted counts for low traffic articles) # Since valid namespace string are not known in the compression script any string followed by colon (:) counts as possible namespace string # Please reconcile with real namespace name strings later # 'namespaces' with count < 5 are combined in 'Other' (on larger wikis these are surely false positives)
Making the script aware of namespace names would be quite easy.
For English this is obviously true, but Erik writes scripts intended to be language agnostic and work with all WMF projects. While certainly possible to teach it about namespaces in the general sense, it would take rather a bit of effort to call up the local namespace names and all legitimate variants for every different project/language in turn.
-Robert Rohde
Making the script aware of namespace names would be quite easy.
Yes it is more a matter of priority than feasibility.
I already use localized namespace names in wikistats, obviously. Without those the dumps can't be interpreted. Each xml (full) archive dump starts with list of localized namespace names.
I also parse php files for localization of reserved words like #REDIRECT And parse other php files for language names translations And extract many more language name translations from wp:en interwiki links via api.
But every such action takes time, needs safeguards (files can be moved, can be temporary inaccessible, formats change, maybe not in xml, but in php for sure) and requires occasional attention for maintenance.
So for a housekeeping job where really almost no-one seemed to care about at the time, I just chose to keep it simple (this particular optimization can always be retrofitted).
If we find a better place to store them than on the wikistats server we might be able to store them unfiltered, but still condensed as one daily file, as this speeds up processing greatly, or maybe repackaged into a monthly file per wiki.
Erik Zachte
2009/9/18 Erik Zachte erikzachte@infodisiac.com:
I think it is extremely important to keep these files for later analysis by historians and others.
Mathias Schindler also keep an archive or at least did till April (Berlin conference). He even bought a dedicated external drive for it.
Right now, I have a single copy of all the files from December 2007 to April 2009 on a single hard drive. I haven't done any integrity checks beyond some initial tests. The dataset has some missing spots when the service to produce the files was not working. In some cases, it is just an empty .gz file, in some cases there was no file produced at all.
In my spare time, I will try to load the files from May to now to this hard drive until it is full.
The situation is rather uncomfortable for me since I am in no way able to guarantee the integrity and safety of these files for a longer time frame. While I might continue downloading and "storing" the files, I would be extremely happy to hear that the full and unabridged set of files is available a) to anyone b) for an indefinite time span c) free of charge d) with some backup and data integrity check in place.
Speaking of wish lists, a web-accessible service to work with the data would be nice. We know for sure that journalists and hopefully some more demographics like the data, numbers and resulting shiny graphs.
Mathias
wikitech-l@lists.wikimedia.org