Wikipedia was created in 2001 and the image bank Wikimedia Commons a few years later. It now contains 6 million files, mostly images. Most of them use the template:Information which has a Date= field to indicate when the content was created. The ideal format is the ISO date format YYYY-MM-DD, but this is not always followed. When I tried to parse the year, I was successful for 3.5 million files. (Maybe I didn't try very hard.)
So, when were our files created? Of course, most were created after Wikipedia was founded, in the most recent decade. Even for old buildings, new photos were taken and uploaded.
For older decades, we should expect more information for more recent ones, since more cameras were in used and more books published with each new decade. Exactly how big has that growth rate been?
It turns out, we have roughly 2% more files for each new year. A graph plotting each year is very bumpy, but if sum up each decade, it becomes quite smooth. This does not mean that content production increased with 2% annually, but the content that survived and was copied to Wikimedia Commons has grown this fast.
But this is only true for the years between 1750 and 1900.
For years before 1750, before enlightenment, the growth rate is only 0.5 percent annually. Also quite reasonable.
The real surprise is that after 1900, there is no growth. We have roughly 30,000 files from each decade in the 20th century. These are the numbers I found:
1850s 8652 files 1860s 12144 1870s 16561 1880s 19382 1890s 25985 1900s 37936 1910s 34882 1920s 23715 1930s 24507 1940s 30720 1950s 29364 1960s 24164 1970s 23991 1980s 31185 1990s 45423 2000s 2,951,138 files
And the graph is found on http://commons.wikimedia.org/wiki/File:Wikimedia_Commons_files_per_decade.pn...
My guess is that this is an effect of copyright laws, which locks down the use of 20th century content.
Hi Lars,
Op 19-5-2010 13:45, Lars Aronsson schreef:
Wikipedia was created in 2001 and the image bank Wikimedia Commons a few years later. It now contains 6 million files, mostly images. Most of them use the template:Information which has a Date= field to indicate when the content was created. The ideal format is the ISO date format YYYY-MM-DD, but this is not always followed. When I tried to parse the year, I was successful for 3.5 million files. (Maybe I didn't try very hard.)
I guess you used a regex. Which one exactly? Or did you publish your code somewhere?
So, when were our files created? Of course, most were created after Wikipedia was founded, in the most recent decade. Even for old buildings, new photos were taken and uploaded.
For older decades, we should expect more information for more recent ones, since more cameras were in used and more books published with each new decade. Exactly how big has that growth rate been?
It turns out, we have roughly 2% more files for each new year. A graph plotting each year is very bumpy, but if sum up each decade, it becomes quite smooth. This does not mean that content production increased with 2% annually, but the content that survived and was copied to Wikimedia Commons has grown this fast.
But this is only true for the years between 1750 and 1900.
For years before 1750, before enlightenment, the growth rate is only 0.5 percent annually. Also quite reasonable.
The real surprise is that after 1900, there is no growth. We have roughly 30,000 files from each decade in the 20th century. These are the numbers I found:
1850s 8652 files 1860s 12144 1870s 16561 1880s 19382 1890s 25985 1900s 37936 1910s 34882 1920s 23715 1930s 24507 1940s 30720 1950s 29364 1960s 24164 1970s 23991 1980s 31185 1990s 45423 2000s 2,951,138 files
And the graph is found on http://commons.wikimedia.org/wiki/File:Wikimedia_Commons_files_per_decade.pn...
My guess is that this is an effect of copyright laws, which locks down the use of 20th century content.
Nice stats! I wonder how the distribution is of years with the images of the batch uploads and if this influences the overall statistics. Do you have a list of the 2.5 M files you couldn't parse? We might be able to add dates to some of these images or convert the dates to the ISO format.
Maarten
Maarten Dammers wrote:
Op 19-5-2010 13:45, Lars Aronsson schreef:
I tried to parse the year, I was successful for 3.5 million files. (Maybe I didn't try very hard.)
I guess you used a regex. Which one exactly? Or did you publish your code somewhere?
No, I did not publish my code or regex, and I don't intend to. This was a quick hack, and I know I might have missed lots of files. For example, just one random image from the huge Bundesarchiv image donation has a "Date=0-00-00", http://commons.wikimedia.org/wiki/File:Bundesarchiv_Bild_147-0435,_Wolfgang_...
(It's a mystery to me, why this is displayed as "november 1999".)
Then again, another random Bundesarchive image has "Date=1950-07-05", which should be covered by my hack, http://commons.wikimedia.org/wiki/File:Bodo_Uhse.jpg
We would have far fewer images from the 1950s if it weren't for this donation.
I want to encourage others to invent their own regex and see if they can find other results than mine. My numbers are posted on the talk page of the graph.
On Thu, May 20, 2010 at 10:07 AM, Lars Aronsson lars@aronsson.se wrote:
Maarten Dammers wrote:
Op 19-5-2010 13:45, Lars Aronsson schreef:
I tried to parse the year, I was successful for 3.5 million files. (Maybe I didn't try very hard.)
I guess you used a regex. Which one exactly? Or did you publish your code somewhere?
No, I did not publish my code or regex, and I don't intend to. This was a quick hack, and I know I might have missed lots of files. For example, just one random image from the huge Bundesarchiv image donation has a "Date=0-00-00", http://commons.wikimedia.org/wiki/File:Bundesarchiv_Bild_147-0435,_Wolfgang_...
(It's a mystery to me, why this is displayed as "november 1999".)
Probably "year 00" interpreted as 2000; January would be "month 01", so "month 00" subtracts one, making it December 1999. The first day would be "01", so subtracting one lands you on an undefined November day...
Magnus
On Wed, May 19, 2010 at 12:45 PM, Lars Aronsson lars@aronsson.se wrote:
The ideal format is the ISO date format YYYY-MM-DD, but this is not always followed.
....
And the graph is found on http://commons.wikimedia.org/wiki/File:Wikimedia_Commons_files_per_decade.pn...
...using date format "19 May 2010" :-)
Magnus Manske wrote:
And the graph is found on http://commons.wikimedia.org/wiki/File:Wikimedia_Commons_files_per_decade.pn...
...using date format "19 May 2010" :-)
Fail. See the wikitext. ISO dates are autotranslated. Use http://commons.wikimedia.org/w/index.php?title=File:Wikimedia_Commons_files_... and you will get 2010年5月19日 instead.
On Thu, May 20, 2010 at 12:48 AM, Platonides Platonides@gmail.com wrote:
Magnus Manske wrote:
And the graph is found on http://commons.wikimedia.org/wiki/File:Wikimedia_Commons_files_per_decade.pn...
...using date format "19 May 2010" :-)
Fail. See the wikitext. ISO dates are autotranslated. Use http://commons.wikimedia.org/w/index.php?title=File:Wikimedia_Commons_files_... and you will get 2010年5月19日 instead.
Gah! Hidden magic! Away, dark spirits!