Hi Lars,
Op 19-5-2010 13:45, Lars Aronsson schreef:
Wikipedia was created in 2001 and the image bank Wikimedia Commons a few years later. It now contains 6 million files, mostly images. Most of them use the template:Information which has a Date= field to indicate when the content was created. The ideal format is the ISO date format YYYY-MM-DD, but this is not always followed. When I tried to parse the year, I was successful for 3.5 million files. (Maybe I didn't try very hard.)
I guess you used a regex. Which one exactly? Or did you publish your code somewhere?
So, when were our files created? Of course, most were created after Wikipedia was founded, in the most recent decade. Even for old buildings, new photos were taken and uploaded.
For older decades, we should expect more information for more recent ones, since more cameras were in used and more books published with each new decade. Exactly how big has that growth rate been?
It turns out, we have roughly 2% more files for each new year. A graph plotting each year is very bumpy, but if sum up each decade, it becomes quite smooth. This does not mean that content production increased with 2% annually, but the content that survived and was copied to Wikimedia Commons has grown this fast.
But this is only true for the years between 1750 and 1900.
For years before 1750, before enlightenment, the growth rate is only 0.5 percent annually. Also quite reasonable.
The real surprise is that after 1900, there is no growth. We have roughly 30,000 files from each decade in the 20th century. These are the numbers I found:
1850s 8652 files 1860s 12144 1870s 16561 1880s 19382 1890s 25985 1900s 37936 1910s 34882 1920s 23715 1930s 24507 1940s 30720 1950s 29364 1960s 24164 1970s 23991 1980s 31185 1990s 45423 2000s 2,951,138 files
And the graph is found on http://commons.wikimedia.org/wiki/File:Wikimedia_Commons_files_per_decade.pn...
My guess is that this is an effect of copyright laws, which locks down the use of 20th century content.
Nice stats! I wonder how the distribution is of years with the images of the batch uploads and if this influences the overall statistics. Do you have a list of the 2.5 M files you couldn't parse? We might be able to add dates to some of these images or convert the dates to the ISO format.
Maarten