We had a very useful collective notetaking effort during Felipe's Wikimania session on
Mining Wikipedia public data. To have a second copy, I've dumped it the contents into
the Talk page for that session:
There are several interesting parts -- including a summary of Felipe's
I'll paste below just one section -- about tools/best practices -- because I'd
really like to see a central place to look up documentation on best practices, tools, and
methodologies. It could transclude from or point to the existing documentation.
Would that be useful to anyone else? If so, this list might give a scope of the tech
aspects, as a starting place. If it already exists --as an existing single point-of-entry,
I'd be delighted to know that instead!
Here's part of that sync.in sheet -- worth looking at the whole thing, at
What tools/best practices can we share/should we know about?
Tools for analytizing particular articles
- number of contributors
- number of people who are watching a page
- most viewed pages, largest # of editors in a month,
viewed page statistics
visualization in place
Bots and code
pywikipediabot - queries the Wikipedia API
Talk to Daniel about Toolserver accounts
Tools for dealing with particular dumps
- Information on downloading the
What are these good for? (classify me)
quantitative analysis tool (from Felipe Ortega et
search tool for database dumps
- preprocessor for XML dumps,
"eliminates some information and adds other useful information"
- List of parsers
- Static HTML dumps