Re: [Wikimedia-l] [Wikitech-l] Request for comments: How to deal with open datasets?

16 May 2014

David,

I would strongly prefer a system that keeps the parts together, while
at the same time, keeping all the parts separate and interchangeable.
I hate that the .djvu files are blobs now, because if I find a better
scan of an engraving from a book, I would like to replace the crappy
scan that is in the .djvu file. I suppose you need to keep the version
you uploaded, but you always want to present the best you have to the
reader.

I have looked at problems with datasets for a small GLAM, and have
seen just how bad the data can be. I am mostly a web-surfer of
poorly-designed GLAM datasets, which is why I have spent many hours
thinking about these things. I have since given up trying to preach
the evangelism of open data to GLAMs and started thinking more about
what Wikipedia can do to curate the world's art. Many GLAMs are
willing to share their data, but believe me when I say we may not want
it. The backlog in batch uploads to Commons is not the technical
upload queue, it's all the data massaging by hand that Wikipedians
need to do beforehand. That work, which is done by Commons wizards,
goes largely unrecognized today.

Theoretically, a specific artwork is both a data item and a dataset.
If you look at our artwork template on Commons you may have noticed
how it has grown in the past 4 years and is fast becoming a fairly
comprehensive standard dataset for certain items. The next step is to
create a way to index these per object (yes we have categories - is
that really the best we can do?).

For popular artworks that are architectural features, Wiki Loves
Monuments has harvested so many images of these from all different
angles that you could probably make the case that Wikimedia Commons
has more images than any other publication about that specific item.
If you browse the various language versions and their representation
of the object, you will notice that individual Wikipedians have
selected different images, but these are rarely linked to each other
and the casual Wikipedia reader has no idea that they can probably
view the object in 3-D if they want to, or see a short movie about how
it was made. Indeed, let's face it, most casual readers have only
heard of Wikipedia and are completely unaware of Wikimedia Commons and
have never heard of Wikimedia Commons categories.

Take the case for the Sagrada Familia:
https://commons.wikimedia.org/wiki/Category:Sagrada_Fam%C3%ADlia

This category is augmented by a gallery page, with the helpful text
"The Sagrada Família is an unfinished church in the Catalan city of
Barcelona, considered to be architect Antoni Gaudí's masterpiece. For
images of the Holy Family (Jesus, Mary, and Joseph), see Category:Holy
Family." :
https://commons.wikimedia.org/wiki/Sagrada_Fam%C3%ADlia

Is this really the best we can do? Has anyone ever stopped and counted
the rate at which we accumulate photos of the Sagrada Familia each
year? We don't want to deter people from uploading, because we are
probably still missing important photos of various internal features.
But how do we show the gaps in our coverage of this object, while
presenting an encyclopedic view? The English Wikipedia page includes
about 40 images with a link to the category, but no other hints for
media navigation.

This is just one example, there are many more. I would like to see a
system by which the normal Wiki-collaboration process can be used to
slowly integrate all of the Commons files into datasets per item, and
then include these into datasets per city or artist or GLAM or
whatever. I suppose it should be lists of categories, gallery pages,
and templates, most of them blank (like the artwork template - you can
use the fields or not, as long as you include the minimum for the
upload wizard). Wikidata can help with the template fields as
properties.

Jane

2014-05-15 18:14 GMT+02:00, David Cuenca &lt;dacuetu(a)gmail.com&gt;om>:
...
  Jane,

 Thanks for your input! I never thought as datasets as incorporating images,
 but just as a table (whose elements might point to images, but not contain
 them). Are people in the GLAM scene expecting other files embedded when
 talking about datasets?

 Well, if it is a standard format (csv or json), then it is easy to keep the
 whole dataset together, you just need to consider it a text file, and then
 you upload a new one, like any other file in Commons :)

 Micru

 On Thu, May 15, 2014 at 5:18 PM, Jane Darnell &lt;jane023(a)gmail.com&gt; wrote:

  David,
 This is an interesting question. I think that a dataset is just like
 any other table such as the ones included in Wikipedia, with lots more
 entries and maybe even pieces attached that can't go on Wikipedia such
 as pictures, audio, short films, pieces of software code, or other
 media.

 So I guess this page should be merged with the DataNamespace page. The
 problem is how to reference a dataset or table. Images on Commons are
 timestamped with a source link that is often {{self}}, but more often
 a weblink somewhere that may or may not die within a year or two.
 Since the image is something that you can't really change easily, this
 is generally not an issue, but how do you see this with data that can
 be manipulated? I don't really see how you can upload datasets as
 whole "blobs" that will keep all the pieces together the way a .djvu
 file keeps the text with the images.

 Jane

 2014-05-15 16:46 GMT+02:00, David Cuenca &lt;dacuetu(a)gmail.com&gt;om>:
  On Thu, May 15, 2014 at 1:42 PM, Cristian
Consonni <  kikkocristian(a)gmail.com&gt;
   wrote:

  Thanks for the pointer, "How can I put this
open data on Wikidata is a
 question that I have been asked many times", this page was needed.

 Thanks for your comment!

 On Thu, May 15, 2014 at 3:59 PM, Samuel Klein &lt;meta.sj(a)gmail.com&gt; wrote:

  Thanks Micru!  I think we should start by
including datasets on
 wikisource, with descriptions about them (storing the files on commons
 where possible).   And adding more data formats to the formats
 accepted on commons.

 I don't follow you... why would you put datasets on Wikisource when they
 are only used in Wikipedia and have to be stored somewhere else? As it
 is
 now, it doesn't seem a good dataset management solution.
 Besides that it would conflict with its identity as repository for  textual
  sources..
 About Commons I don't know if it is relevant to their mission as a  sharing
  media platform either... I hope someone from
their community can share
 their views.

 Thanks for the input,
 Micru
 _______________________________________________
 Wikimedia-l mailing list, guidelines at:
 https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
 Wikimedia-l(a)lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe> 
 _______________________________________________
 Wikimedia-l mailing list, guidelines at:
 https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
 Wikimedia-l(a)lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>

 --
 Etiamsi omnes, ego non
 _______________________________________________
 Wikimedia-l mailing list, guidelines at:
 https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
 Wikimedia-l(a)lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe> 

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [Wikimedia-l] [Wikitech-l] Request for comments: How to deal with open datasets?