On 07/10/13 09:53, Lydia Pintscher wrote:
On Sun, Oct 6, 2013 at 10:35 PM, Daniel Koller daniel@dakoller.net wrote:
Hi Lydia,
I would be willing to support here... are other people for support and/or guidance on where to start?
Great! The best start is probably a download of the database dumps. You can find that here: https://www.wikidata.org/wiki/Wikidata:Database_download If you run into any issues or have questions just ask here. I'm sure there are people around to help.
First, if you want to download and analyse dumps automatically, then the wda script is your friend [1]. It knows where to get the dumps, it can get all relevant dump files (including dailies), and it can iterate through all dumps (or through all most recent page revisions in all dumps) to do something.
Second, I have created some stats on development over time a while back, using 14-day scan intervals. This is also done by the wda script, but using a MySQL database to store partly aggregated information from the dumps (problem: you have no random access to the dumps, the dumps mostly contain all revisions of one page in a sequence, while the analysis of development over time requires you to look at the revisions of all pages at one time; so one needs to reshuffle all the data first). The software also does some basic calendar calculations to find out which 14-day interval a revision belongs to.
I attach a figure produced from the resulting data. This is from mid July, so not current any more. But the software still exists if anyone wants to redo it now (most code should be in wda, but I might have some local scripts for actually using the code, which is not the standard operation of wda, obviously ;-). The code does not currently capture the number of references that are different from "imported from" but it would not be hard to add this. We will update these stats at some point in the future, but maybe not this weekend.
Cheers,
Markus