Re: [Wikimedia-l] [Wikitech-l] Request for comments: How to deal with open datasets?

List overview All Threads
Download

newer

older

Dealing with GLAM data (it was:...

Request for comments: How to deal...

David Cuenca

15 May 2014 15 May '14

10:46 p.m.

On Thu, May 15, 2014 at 1:42 PM, Cristian Consonni kikkocristian@gmail.com wrote:

...

Thanks for the pointer, "How can I put this open data on Wikidata is a question that I have been asked many times", this page was needed.

Thanks for your comment!

On Thu, May 15, 2014 at 3:59 PM, Samuel Klein meta.sj@gmail.com wrote:

...

Thanks Micru! I think we should start by including datasets on wikisource, with descriptions about them (storing the files on commons where possible). And adding more data formats to the formats accepted on commons.

I don't follow you... why would you put datasets on Wikisource when they are only used in Wikipedia and have to be stored somewhere else? As it is now, it doesn't seem a good dataset management solution. Besides that it would conflict with its identity as repository for textual sources.. About Commons I don't know if it is relevant to their mission as a sharing media platform either... I hope someone from their community can share their views.

Thanks for the input, Micru

Show replies by date

Jane Darnell

15 May 15 May

11:18 p.m.

New subject: [Wikitech-l] Request for comments: How to deal with open datasets?

David, This is an interesting question. I think that a dataset is just like any other table such as the ones included in Wikipedia, with lots more entries and maybe even pieces attached that can't go on Wikipedia such as pictures, audio, short films, pieces of software code, or other media.

So I guess this page should be merged with the DataNamespace page. The problem is how to reference a dataset or table. Images on Commons are timestamped with a source link that is often {{self}}, but more often a weblink somewhere that may or may not die within a year or two. Since the image is something that you can't really change easily, this is generally not an issue, but how do you see this with data that can be manipulated? I don't really see how you can upload datasets as whole "blobs" that will keep all the pieces together the way a .djvu file keeps the text with the images.

Jane

2014-05-15 16:46 GMT+02:00, David Cuenca dacuetu@gmail.com:

...

On Thu, May 15, 2014 at 1:42 PM, Cristian Consonni kikkocristian@gmail.com wrote:

...
Thanks for the pointer, "How can I put this open data on Wikidata is a question that I have been asked many times", this page was needed.

Thanks for your comment!

On Thu, May 15, 2014 at 3:59 PM, Samuel Klein meta.sj@gmail.com wrote:

...
Thanks Micru! I think we should start by including datasets on wikisource, with descriptions about them (storing the files on commons where possible). And adding more data formats to the formats accepted on commons.

I don't follow you... why would you put datasets on Wikisource when they are only used in Wikipedia and have to be stored somewhere else? As it is now, it doesn't seem a good dataset management solution. Besides that it would conflict with its identity as repository for textual sources.. About Commons I don't know if it is relevant to their mission as a sharing media platform either... I hope someone from their community can share their views.

Thanks for the input, Micru _______________________________________________ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe

David Cuenca

16 May 16 May

12:14 a.m.

New subject: [Wikitech-l] Request for comments: How to deal with open datasets?

Jane,

Thanks for your input! I never thought as datasets as incorporating images, but just as a table (whose elements might point to images, but not contain them). Are people in the GLAM scene expecting other files embedded when talking about datasets?

Well, if it is a standard format (csv or json), then it is easy to keep the whole dataset together, you just need to consider it a text file, and then you upload a new one, like any other file in Commons :)

Micru

On Thu, May 15, 2014 at 5:18 PM, Jane Darnell jane023@gmail.com wrote:

...

David, This is an interesting question. I think that a dataset is just like any other table such as the ones included in Wikipedia, with lots more entries and maybe even pieces attached that can't go on Wikipedia such as pictures, audio, short films, pieces of software code, or other media.

So I guess this page should be merged with the DataNamespace page. The problem is how to reference a dataset or table. Images on Commons are timestamped with a source link that is often {{self}}, but more often a weblink somewhere that may or may not die within a year or two. Since the image is something that you can't really change easily, this is generally not an issue, but how do you see this with data that can be manipulated? I don't really see how you can upload datasets as whole "blobs" that will keep all the pieces together the way a .djvu file keeps the text with the images.

Jane

2014-05-15 16:46 GMT+02:00, David Cuenca dacuetu@gmail.com:

...
On Thu, May 15, 2014 at 1:42 PM, Cristian Consonni <

kikkocristian@gmail.com>

...
wrote:

...
Thanks for the pointer, "How can I put this open data on Wikidata is a question that I have been asked many times", this page was needed.

Thanks for your comment!

On Thu, May 15, 2014 at 3:59 PM, Samuel Klein meta.sj@gmail.com wrote:

...
Thanks Micru! I think we should start by including datasets on wikisource, with descriptions about them (storing the files on commons where possible). And adding more data formats to the formats accepted on commons.

I don't follow you... why would you put datasets on Wikisource when they are only used in Wikipedia and have to be stored somewhere else? As it is now, it doesn't seem a good dataset management solution. Besides that it would conflict with its identity as repository for

textual

...
sources.. About Commons I don't know if it is relevant to their mission as a

sharing

...
media platform either... I hope someone from their community can share their views.

Thanks for the input, Micru _______________________________________________ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe

Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe

-- Etiamsi omnes, ego non

Jane Darnell

5:35 p.m.

New subject: [Wikitech-l] Request for comments: How to deal with open datasets?

David,

I would strongly prefer a system that keeps the parts together, while at the same time, keeping all the parts separate and interchangeable. I hate that the .djvu files are blobs now, because if I find a better scan of an engraving from a book, I would like to replace the crappy scan that is in the .djvu file. I suppose you need to keep the version you uploaded, but you always want to present the best you have to the reader.

I have looked at problems with datasets for a small GLAM, and have seen just how bad the data can be. I am mostly a web-surfer of poorly-designed GLAM datasets, which is why I have spent many hours thinking about these things. I have since given up trying to preach the evangelism of open data to GLAMs and started thinking more about what Wikipedia can do to curate the world's art. Many GLAMs are willing to share their data, but believe me when I say we may not want it. The backlog in batch uploads to Commons is not the technical upload queue, it's all the data massaging by hand that Wikipedians need to do beforehand. That work, which is done by Commons wizards, goes largely unrecognized today.

Theoretically, a specific artwork is both a data item and a dataset. If you look at our artwork template on Commons you may have noticed how it has grown in the past 4 years and is fast becoming a fairly comprehensive standard dataset for certain items. The next step is to create a way to index these per object (yes we have categories - is that really the best we can do?).

For popular artworks that are architectural features, Wiki Loves Monuments has harvested so many images of these from all different angles that you could probably make the case that Wikimedia Commons has more images than any other publication about that specific item. If you browse the various language versions and their representation of the object, you will notice that individual Wikipedians have selected different images, but these are rarely linked to each other and the casual Wikipedia reader has no idea that they can probably view the object in 3-D if they want to, or see a short movie about how it was made. Indeed, let's face it, most casual readers have only heard of Wikipedia and are completely unaware of Wikimedia Commons and have never heard of Wikimedia Commons categories.

Take the case for the Sagrada Familia: https://commons.wikimedia.org/wiki/Category:Sagrada_Fam%C3%ADlia

This category is augmented by a gallery page, with the helpful text "The Sagrada Família is an unfinished church in the Catalan city of Barcelona, considered to be architect Antoni Gaudí's masterpiece. For images of the Holy Family (Jesus, Mary, and Joseph), see Category:Holy Family." : https://commons.wikimedia.org/wiki/Sagrada_Fam%C3%ADlia

Is this really the best we can do? Has anyone ever stopped and counted the rate at which we accumulate photos of the Sagrada Familia each year? We don't want to deter people from uploading, because we are probably still missing important photos of various internal features. But how do we show the gaps in our coverage of this object, while presenting an encyclopedic view? The English Wikipedia page includes about 40 images with a link to the category, but no other hints for media navigation.

This is just one example, there are many more. I would like to see a system by which the normal Wiki-collaboration process can be used to slowly integrate all of the Commons files into datasets per item, and then include these into datasets per city or artist or GLAM or whatever. I suppose it should be lists of categories, gallery pages, and templates, most of them blank (like the artwork template - you can use the fields or not, as long as you include the minimum for the upload wizard). Wikidata can help with the template fields as properties.

Jane

2014-05-15 18:14 GMT+02:00, David Cuenca dacuetu@gmail.com:

...

Jane,

Thanks for your input! I never thought as datasets as incorporating images, but just as a table (whose elements might point to images, but not contain them). Are people in the GLAM scene expecting other files embedded when talking about datasets?

Well, if it is a standard format (csv or json), then it is easy to keep the whole dataset together, you just need to consider it a text file, and then you upload a new one, like any other file in Commons :)

Micru

On Thu, May 15, 2014 at 5:18 PM, Jane Darnell jane023@gmail.com wrote:

...
David, This is an interesting question. I think that a dataset is just like any other table such as the ones included in Wikipedia, with lots more entries and maybe even pieces attached that can't go on Wikipedia such as pictures, audio, short films, pieces of software code, or other media.

So I guess this page should be merged with the DataNamespace page. The problem is how to reference a dataset or table. Images on Commons are timestamped with a source link that is often {{self}}, but more often a weblink somewhere that may or may not die within a year or two. Since the image is something that you can't really change easily, this is generally not an issue, but how do you see this with data that can be manipulated? I don't really see how you can upload datasets as whole "blobs" that will keep all the pieces together the way a .djvu file keeps the text with the images.

Jane

2014-05-15 16:46 GMT+02:00, David Cuenca dacuetu@gmail.com:

...
On Thu, May 15, 2014 at 1:42 PM, Cristian Consonni <

kikkocristian@gmail.com>

...
wrote:

...
Thanks for the pointer, "How can I put this open data on Wikidata is a question that I have been asked many times", this page was needed.

Thanks for your comment!

On Thu, May 15, 2014 at 3:59 PM, Samuel Klein meta.sj@gmail.com wrote:

...
Thanks Micru! I think we should start by including datasets on wikisource, with descriptions about them (storing the files on commons where possible). And adding more data formats to the formats accepted on commons.

I don't follow you... why would you put datasets on Wikisource when they are only used in Wikipedia and have to be stored somewhere else? As it is now, it doesn't seem a good dataset management solution. Besides that it would conflict with its identity as repository for

textual

...
sources.. About Commons I don't know if it is relevant to their mission as a

sharing

...
media platform either... I hope someone from their community can share their views.

Thanks for the input, Micru _______________________________________________ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe

Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe

-- Etiamsi omnes, ego non _______________________________________________ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe

David Cuenca

8:15 p.m.

New subject: [Wikitech-l] Request for comments: How to deal with open datasets?

Thanks for sharing your frustrations. I also notice that the word "dataset" has many interpretations that we should delimit first its meaning, or at least add some clarification. For instance "dataset" can be interpreted as a: - static data table: just a table containing some information in each fields and that gets updated/versioned as a whole - collections of files and their associated metadata: what the GLAMs release - an object and its representations

My intention was that the RFC only refers to the first interpretation of the word "dataset", that is "static data tables". Sorry if my use of the word caused some misunderstanding, I will try to make it more clear in the RFC.

The issues that you mention in your email are not related to the RFC scope, however they deserve attention. I have identified the following subtopics: 1. Change a page of a djvu 2. Dirty metadata from GLAMs 3. Not reconizing the good job done by Commons 4. Files associated with a concept 5. Users not classifying their data in proper subcategories 6. Showing gaps in our coverage 7. Files as parts of a real object item

To keep this thread on-topic, I will address them in a separate email that I will call it "Dealing with GLAM data". I hope you don't mind.

Micru

On Fri, May 16, 2014 at 11:35 AM, Jane Darnell jane023@gmail.com wrote:

...

David,

I would strongly prefer a system that keeps the parts together, while at the same time, keeping all the parts separate and interchangeable. I hate that the .djvu files are blobs now, because if I find a better scan of an engraving from a book, I would like to replace the crappy scan that is in the .djvu file. I suppose you need to keep the version you uploaded, but you always want to present the best you have to the reader.

I have looked at problems with datasets for a small GLAM, and have seen just how bad the data can be. I am mostly a web-surfer of poorly-designed GLAM datasets, which is why I have spent many hours thinking about these things. I have since given up trying to preach the evangelism of open data to GLAMs and started thinking more about what Wikipedia can do to curate the world's art. Many GLAMs are willing to share their data, but believe me when I say we may not want it. The backlog in batch uploads to Commons is not the technical upload queue, it's all the data massaging by hand that Wikipedians need to do beforehand. That work, which is done by Commons wizards, goes largely unrecognized today.

Theoretically, a specific artwork is both a data item and a dataset. If you look at our artwork template on Commons you may have noticed how it has grown in the past 4 years and is fast becoming a fairly comprehensive standard dataset for certain items. The next step is to create a way to index these per object (yes we have categories - is that really the best we can do?).

For popular artworks that are architectural features, Wiki Loves Monuments has harvested so many images of these from all different angles that you could probably make the case that Wikimedia Commons has more images than any other publication about that specific item. If you browse the various language versions and their representation of the object, you will notice that individual Wikipedians have selected different images, but these are rarely linked to each other and the casual Wikipedia reader has no idea that they can probably view the object in 3-D if they want to, or see a short movie about how it was made. Indeed, let's face it, most casual readers have only heard of Wikipedia and are completely unaware of Wikimedia Commons and have never heard of Wikimedia Commons categories.

Take the case for the Sagrada Familia: https://commons.wikimedia.org/wiki/Category:Sagrada_Fam%C3%ADlia

This category is augmented by a gallery page, with the helpful text "The Sagrada Família is an unfinished church in the Catalan city of Barcelona, considered to be architect Antoni Gaudí's masterpiece. For images of the Holy Family (Jesus, Mary, and Joseph), see Category:Holy Family." : https://commons.wikimedia.org/wiki/Sagrada_Fam%C3%ADlia

Is this really the best we can do? Has anyone ever stopped and counted the rate at which we accumulate photos of the Sagrada Familia each year? We don't want to deter people from uploading, because we are probably still missing important photos of various internal features. But how do we show the gaps in our coverage of this object, while presenting an encyclopedic view? The English Wikipedia page includes about 40 images with a link to the category, but no other hints for media navigation.

This is just one example, there are many more. I would like to see a system by which the normal Wiki-collaboration process can be used to slowly integrate all of the Commons files into datasets per item, and then include these into datasets per city or artist or GLAM or whatever. I suppose it should be lists of categories, gallery pages, and templates, most of them blank (like the artwork template - you can use the fields or not, as long as you include the minimum for the upload wizard). Wikidata can help with the template fields as properties.

Jane

2014-05-15 18:14 GMT+02:00, David Cuenca dacuetu@gmail.com:

...
Jane,

Thanks for your input! I never thought as datasets as incorporating

images,

...
but just as a table (whose elements might point to images, but not

contain

...
them). Are people in the GLAM scene expecting other files embedded when talking about datasets?

Well, if it is a standard format (csv or json), then it is easy to keep

the

...
whole dataset together, you just need to consider it a text file, and

then

...
you upload a new one, like any other file in Commons :)

Micru

On Thu, May 15, 2014 at 5:18 PM, Jane Darnell jane023@gmail.com wrote:

...
David, This is an interesting question. I think that a dataset is just like any other table such as the ones included in Wikipedia, with lots more entries and maybe even pieces attached that can't go on Wikipedia such as pictures, audio, short films, pieces of software code, or other media.

So I guess this page should be merged with the DataNamespace page. The problem is how to reference a dataset or table. Images on Commons are timestamped with a source link that is often {{self}}, but more often a weblink somewhere that may or may not die within a year or two. Since the image is something that you can't really change easily, this is generally not an issue, but how do you see this with data that can be manipulated? I don't really see how you can upload datasets as whole "blobs" that will keep all the pieces together the way a .djvu file keeps the text with the images.

Jane

2014-05-15 16:46 GMT+02:00, David Cuenca dacuetu@gmail.com:

...
On Thu, May 15, 2014 at 1:42 PM, Cristian Consonni <

kikkocristian@gmail.com>

...
wrote:

...
Thanks for the pointer, "How can I put this open data on Wikidata is

a

...
...
...
...
question that I have been asked many times", this page was needed.

Thanks for your comment!

On Thu, May 15, 2014 at 3:59 PM, Samuel Klein meta.sj@gmail.com

wrote:

...
...
...
...
Thanks Micru! I think we should start by including datasets on wikisource, with descriptions about them (storing the files on

commons

...
...
...
...
where possible). And adding more data formats to the formats accepted on commons.

I don't follow you... why would you put datasets on Wikisource when

they

...
...
...
are only used in Wikipedia and have to be stored somewhere else? As it is now, it doesn't seem a good dataset management solution. Besides that it would conflict with its identity as repository for

textual

...
sources.. About Commons I don't know if it is relevant to their mission as a

sharing

...
media platform either... I hope someone from their community can share their views.

Thanks for the input, Micru _______________________________________________ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l

,

...
...
...
mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe

Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe

-- Etiamsi omnes, ego non _______________________________________________ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe

Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe

-- Etiamsi omnes, ego non

3725

Age (days ago)

3726

Last active (days ago)

wikimedia-l@lists.wikimedia.org

4 comments

2 participants

tags (0)

participants (2)

David Cuenca
Jane Darnell