On Tue, Jan 22, 2013 at 6:32 PM, Ward Cunningham <ward@c2.com> wrote:
Laura -- If there are extra fields I should capture from the article's markup, I'm happy to add them to my results.


My understanding is that the only real video file format that works on Commons is .ogg (and .ogv), but you run into the issue of picking up sound files.  http://en.wikipedia.org/wiki/Wikipedia:Spoken_articles is the biggest example of where I think you'd run into false positives with the sound file issue using that as a search phrase. Music articles are probably another... because I'm pretty sure audio files are still limited to .ogg and .oga given licensing limitations because of non-CC compliant codecs being used for formats like .mp3.
 
I'm 2.6 million articles through the parse of the jan 2 dump. I've found close to 3000 videos so far.


I tend to create project by project lists so I can do comparisons as I'm less interested in the actual total volume, and more interested in seeing how things differ from one group to another.  There might be a good reason why you don't have much data or why numbers have gone up if looking at things say based on Wikiproject or category inclusion.  (Maybe videos from older movies suddenly came into public domain and with support of a GLAM donation, a major effort was made to include these into articles.  Maybe the Wiki Loves Monuments people opened up video and people made a  big push to include their monument videos into articles.)  Approaching the research question from a what is going on and how can this be explained seems a better way to contextualize data to understand why things are moving.
 
My parse is running at 6mb per second. I have to run to the train station to pickup a colleague. I'm going to see if I can finish on battery. (I should have done this in the cloud but thought it would be handy to have a current copy of wikipedia on my laptop.)


 Not a worry.  Just a side point.


--
mobile: 0412183663
twitter: purplepopple
blog: ozziesport.com