[Commons-l] Files per decade

Maarten Dammers maarten at mdammers.nl
Wed May 19 21:18:58 UTC 2010


Hi Lars,

Op 19-5-2010 13:45, Lars Aronsson schreef:
> Wikipedia was created in 2001 and the image bank Wikimedia Commons
> a few years later. It now contains 6 million files, mostly images.
> Most of them use the template:Information which has a Date= field
> to indicate when the content was created. The ideal format is the
> ISO date format YYYY-MM-DD, but this is not always followed. When
> I tried to parse the year, I was successful for 3.5 million files.
> (Maybe I didn't try very hard.)
>    
I guess you used a regex. Which one exactly? Or did you publish your 
code somewhere?
> So, when were our files created? Of course, most were created
> after Wikipedia was founded, in the most recent decade.
> Even for old buildings, new photos were taken and uploaded.
>
> For older decades, we should expect more information for more
> recent ones, since more cameras were in used and more books
> published with each new decade. Exactly how big has that
> growth rate been?
>
> It turns out, we have roughly 2% more files for each new year.
> A graph plotting each year is very bumpy, but if sum up each
> decade, it becomes quite smooth. This does not mean that content
> production increased with 2% annually, but the content that
> survived and was copied to Wikimedia Commons has grown this fast.
>
> But this is only true for the years between 1750 and 1900.
>
> For years before 1750, before enlightenment, the growth rate
> is only 0.5 percent annually. Also quite reasonable.
>
> The real surprise is that after 1900, there is no growth.
> We have roughly 30,000 files from each decade in the
> 20th century. These are the numbers I found:
>
> 1850s  8652 files
> 1860s 12144
> 1870s 16561
> 1880s 19382
> 1890s 25985
> 1900s 37936
> 1910s 34882
> 1920s 23715
> 1930s 24507
> 1940s 30720
> 1950s 29364
> 1960s 24164
> 1970s 23991
> 1980s 31185
> 1990s 45423
> 2000s 2,951,138 files
>
> And the graph is found on
> http://commons.wikimedia.org/wiki/File:Wikimedia_Commons_files_per_decade.png
>
> My guess is that this is an effect of copyright laws,
> which locks down the use of 20th century content.
>    

Nice stats! I wonder how the distribution is of years with the images of 
the batch uploads and if this influences the overall statistics.
Do you have a list of the 2.5 M files you couldn't parse? We might be 
able to add dates to some of these images or convert the dates to the 
ISO format.

Maarten




More information about the Commons-l mailing list