Op 19-5-2010 13:45, Lars Aronsson schreef:
Wikipedia was created in 2001 and the image bank
a few years later. It now contains 6 million files, mostly images.
Most of them use the template:Information which has a Date= field
to indicate when the content was created. The ideal format is the
ISO date format YYYY-MM-DD, but this is not always followed. When
I tried to parse the year, I was successful for 3.5 million files.
(Maybe I didn't try very hard.)
I guess you used a regex. Which one exactly? Or did you publish your
So, when were our files created? Of course, most were
after Wikipedia was founded, in the most recent decade.
Even for old buildings, new photos were taken and uploaded.
For older decades, we should expect more information for more
recent ones, since more cameras were in used and more books
published with each new decade. Exactly how big has that
growth rate been?
It turns out, we have roughly 2% more files for each new year.
A graph plotting each year is very bumpy, but if sum up each
decade, it becomes quite smooth. This does not mean that content
production increased with 2% annually, but the content that
survived and was copied to Wikimedia Commons has grown this fast.
But this is only true for the years between 1750 and 1900.
For years before 1750, before enlightenment, the growth rate
is only 0.5 percent annually. Also quite reasonable.
The real surprise is that after 1900, there is no growth.
We have roughly 30,000 files from each decade in the
20th century. These are the numbers I found:
1850s 8652 files
2000s 2,951,138 files
And the graph is found on
My guess is that this is an effect of copyright laws,
which locks down the use of 20th century content.
Nice stats! I wonder how the distribution is of years with the images of
the batch uploads and if this influences the overall statistics.
Do you have a list of the 2.5 M files you couldn't parse? We might be
able to add dates to some of these images or convert the dates to the