Why hasn't Wikistats been updated since May?
Because Erik Zachte, the third party who makes them, hasn't updated his
scripts since May?
Excuse me?
In June I asked two times for new dumps, so that I could run stats once more before the conversion to Mediawiki 1.5, knowing that I would not have time for yet another upgrade of the scripts soon.
As soon as the dumps were available I tried to run the stats. It turned out that the database had already been changed even before mediawiki 1.5. Lots of archive data had been moved to another database for which no dump was available at all!! And priorities were such it was not to be expected soon. If someone bothered at all to announce that publicly, e.g. on wikitech, I missed it, but it would not have changed much, expect saving me time on finding this out myself.
Then mediawiki 1.5 upgrade occurred which brought a completely new database scheme. After a while Brion announced he would produce new dumps in xml format, which would simplify things a lot. (It does)
So I waited for that. The new dumps came about mid July when I had started to prepare for two Wikimania presentations. A week after Wikimania I started to work on updating the scripts.
The new xml dump format makes things much simpler (and hopefully more consistent), yet because I want to keep the script backwards compatible for other mediawiki installations that use it, at least for a while, it was not so simple to add code for the new dumps after all.
Let me tell you that the scripts are not exactly trivial. They contain many thousands of line of perl code. Partly because old sql dumps became nearly impossible to parse offline, but I got it working, within space constraints. Partly because they do much more than simple counting, e.g. timelines overview, category trees, wikibooks indexes, they have to process data in a certain sequence for that, and also bother to present data somewhat differently for each project.
So considering I could only start on this >> two weeks ago << I'm making fair progress.
By the way, many people would like to see for more frequent dumps. I also still do not know why dumps are not produced more often, but trust there will be a good reason, probably performance. There is no fixed schedule at all, one waits, people start asking for a new dump, often get no answer, and finally it arrives, more or less once a month these days.
Erik Zachte
Erik Zachte wrote:
So I waited for that. The new dumps came about mid July when I had started to prepare for two Wikimania presentations. A week after Wikimania I started to work on updating the scripts.
Great! Looking forward to them.
By the way, many people would like to see for more frequent dumps. I also still do not know why dumps are not produced more often, but trust there will be a good reason, probably performance. There is no fixed schedule at all, one waits, people start asking for a new dump, often get no answer, and finally it arrives, more or less once a month these days.
The dump script is partially broken right now and needs to be fixed up to work properly.
-- brion vibber (brion @ pobox.com)
Erik Zachte wrote:
So considering I could only start on this >> two weeks ago << I'm making fair progress.
By the way, many people would like to see for more frequent dumps.
This progress is excellent news! I'm looking forward to it. Now that wikistats has a long history, it should be continued as soon as possible, so we can watch trends over time.
However, it is also possible that the MediaWiki software could produce some statistics more directly. Statistics is essentially the reduction of data volumes into useful numbers (e.g. reducing the list of article names to the article count), and the closer to the source such reductions can take place, the more efficient. The dilemma is that this reduction is irreversible, you cannot reconstruct the the full information from the reduced data. This is where the board or its officers can provide insights into what statistics are really useful for managing the project.
Page visit counters were an example of such direct statistics, that was also the first to be (blindly, in my opinion) disabled during the performance problems in 2002-2003. Perhaps they were never so useful anyway. Right now (on wikipedia-l) people are browsing through the "what links here" list to find the number of links. This work could be saved by presenting a "select count(*)" at the top of the [[Special:Whatlinkshere]] page, at virtually no extra cost. Such counts could be presented also for the lists of user contributions, pages belonging to a category, etc.
Collecting statistics from full database dumps is a slow and heavy process. We could do better. But only if we know which stats to collect.
On 8/25/05, Lars Aronsson lars@aronsson.se wrote:
Right now (on wikipedia-l) people are browsing through the "what links here" list to find the number of links. This work could be saved by presenting a "select count(*)" at the top of the [[Special:Whatlinkshere]] page, at virtually no extra cost. Such counts could be presented also for the lists of user contributions, pages belonging to a category, etc.
Collecting statistics from full database dumps is a slow and heavy process. We could do better. But only if we know which stats to collect.
Having these counts in MediaWiki would be great, and not just for researchers. Of general use, apart from "What links here" would be User contributions and Recent Changes - if recent changes can be selected by in the last 1,3,7...days, couldn't they be counted automatically?
I'm not sure how generally useful (or possible) it would be to count other more specific things like, say, number of times this user has edited this article, but above are my top three.
Cormac
Cormac Lawler wrote:
Collecting statistics from full database dumps is a slow and heavy process. We could do better. But only if we know which stats to collect.
Most statistics can be created out of the database dumps but first you have to know how to get it, where to put it and how to treat it. I have updated http://meta.wikimedia.org/wiki/Help:Export but you still need some hardware and programming skills.
Having these counts in MediaWiki would be great, and not just for researchers. Of general use, apart from "What links here" would be User contributions and Recent Changes - if recent changes can be selected by in the last 1,3,7...days, couldn't they be counted automatically?
I'm not sure how generally useful (or possible) it would be to count other more specific things like, say, number of times this user has edited this article, but above are my top three.
You can already get a counter of user contributions with Kate's tool. http://kohl.wikimedia.org/~kate/cgi-bin/count_edits.cgi
A simple counter for "What links here" (= number of inlinks) would be nice and easy to implement.
A dump of the Recent Changes (RSS) in the last 1 and 7 days is heavily needed for research.
Full user contributions can be exported with scripts parsing HTML (Python Robot framework) but of course a XML export format would be nice too.
Greetings, Jakob
Jakob Voss wrote:
Cormac Lawler wrote:
Collecting statistics from full database dumps is a slow and heavy process. We could do better. But only if we know which stats to collect.
Most statistics can be created out of the database dumps but first you have to know how to get it, where to put it and how to treat it. I have updated http://meta.wikimedia.org/wiki/Help:Export but you still need some hardware and programming skills.
There are many statistics, particularly user counts and attempts to determine authorship of a particular artile or revision histories of articles, that simply can't be obtained from the Special:Export feature, as it is currently implemented. And for other statistics that would be useful, I fail to see how the Special:export feature is any different from simply scraping the HTML page itself. In short, you need a full db dump to do most statistical analysis. I wish that you could get user (contributor) information through the special:export pages, but I havn't been able to get it. That is, who did what and what has been added by a given contributor. What is there in the special:export function is terriffic, but it is only a good start.
Robert Scott Horning wrote:
Jakob Voss wrote:
Most statistics can be created out of the database dumps but first you have to know how to get it, where to put it and how to treat it. I have updated http://meta.wikimedia.org/wiki/Help:Export but you still need some hardware and programming skills.
There are many statistics, particularly user counts and attempts to determine authorship of a particular artile or revision histories of articles, that simply can't be obtained from the Special:Export feature, as it is currently implemented.
Unfourtunately Special:Export is crippled. Generally you get get all the version history but this feature seems to be disabled like the internal search. Let's hope that "making available the content in a transparent copy" will get more focus after the next server growth.
Greetings, Jakob
wikimedia-l@lists.wikimedia.org