Hi all,
Looking at more "orphaned items", I found several pairs of items that look like these two:
https://www.wikidata.org/wiki/Q17574663 https://www.wikidata.org/wiki/Q17569687
Same label and description, same coordinates, no Wikidata articles, "identified" by different Historic Scotland IDs. If you follow the ID links, however, you can see that the first of the items has data that does not match the ID, while the second is correct.
The direct question is: How to fix these errors? There are other cases, such as Q17572335 and Q17570206. I did not do a systematic study, but something seems to have gone wrong here in more than one case. I cannot fix mass edits one by one without having a clue what has happened and why.
The indirect question is: How can I find out who did this and maybe ask the person to fix it? The history is of no help (Reinheitsgebot/Widar). Posting every error in Wikidata to this list to ask also seems like a bad idea.
Finally, the technical question is: Why is this even possible? I thought that, in each language, label+description are a key (globally unique), yet here we have many pairs of items with exactly the same label and description. Or is the problem that no description was entered and so the system does not apply the key? In any case, a data integration helper application that looks at equal labels+descriptions would probably make sense, especially for orphaned items. (As I know Wikidata, someone might well reply to this email with a link to where this is already found ;-).
Regards
Markus
Markus Krötzsch, 01/06/2015 22:26:
How can I find out who did this
The tool (https://meta.wikimedia.org/wiki/Mix%27n%27match ) does have a log, though it's not so easy to search it IIRC.
Nemo
On 01.06.2015 22:37, Federico Leva (Nemo) wrote:
Markus Krötzsch, 01/06/2015 22:26:
How can I find out who did this
The tool (https://meta.wikimedia.org/wiki/Mix%27n%27match ) does have a log, though it's not so easy to search it IIRC.
How do you know that the data comes from this tool? The history does not mention it:
https://www.wikidata.org/w/index.php?title=Q17574663&action=history
Markus
On 1 June 2015 at 21:26, Markus Krötzsch markus@semantic-mediawiki.org wrote:
The indirect question is: How can I find out who did this and maybe ask the person to fix it? The history is of no help (Reinheitsgebot/Widar).
Did you look at:
https://www.wikidata.org/wiki/User:Reinheitsgebot ?
On 01.06.2015 23:45, Andy Mabbett wrote:
On 1 June 2015 at 21:26, Markus Krötzsch markus@semantic-mediawiki.org wrote:
The indirect question is: How can I find out who did this and maybe ask the person to fix it? The history is of no help (Reinheitsgebot/Widar).
Did you look at:
https://www.wikidata.org/wiki/User:Reinheitsgebot ?
Yes, that's what I did first, but the page just says that the bot makes mass edits on behalf of other (unknown) users. But you are right that one should probably still ask the bot author first:
Magnus, do you know on which basis these edits were made and how the errors could have sneaked in? Do you have any idea of the scale of the problem? (So far I have no idea: maybe I was just very (un)lucky to find several such cases in a row, or maybe the problem affects a relevant portion of the >80,000 orphaned items in the UK ...).
Regards,
Markus
Am 01.06.2015 um 22:26 schrieb Markus Krötzsch:
Finally, the technical question is: Why is this even possible? I thought that, in each language, label+description are a key (globally unique), yet here we have many pairs of items with exactly the same label and description. Or is the problem that no description was entered and so the system does not apply the key?
The uniqueness constraint does indeed not apply if there is no description.
I created (some/most of) these items as part of the Wiki Loves Monuments UK 2014 drive, to run the campaign from Wikidata rather than from a bespoke database. This allows the community (TM) to maintain the data, rather than one poor sod (e.g., myself) having to frantically update all of it every year ;-)
"Consumer" tool is here: https://tools.wmflabs.org/wlmuk/index_wd.html
These are based on "official" data from National Heritage, provided to me via Wikimedia UK. Grade A (or Grade I/II* in England) structures should be noteworthy by default.
It appears (as per your examples) that some of these were created as duplicates/with wrong IDs. As I said, this is based on "official" data, so it's the best I could do at the time. With mass creation, there are bound to be a few strays. If you can find some large-scale, systemic issue I'll try to fix it, but the one-offs will always fall back to manual fixing. At least, with Wikidata, we can fix them together.
On Tue, Jun 2, 2015 at 10:01 AM Daniel Kinzler daniel.kinzler@wikimedia.de wrote:
Am 01.06.2015 um 22:26 schrieb Markus Krötzsch:
Finally, the technical question is: Why is this even possible? I thought
that,
in each language, label+description are a key (globally unique), yet
here we
have many pairs of items with exactly the same label and description. Or
is the
problem that no description was entered and so the system does not apply
the
key?
The uniqueness constraint does indeed not apply if there is no description.
-- Daniel Kinzler Senior Software Developer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Update: There appear to be quite a few items with duplicate Scotland IDs (not all of them may be erroneous!): http://wdq.wmflabs.org/stats?action=doublestring&prop=709
On Tue, Jun 2, 2015 at 10:23 AM Magnus Manske magnusmanske@googlemail.com wrote:
I created (some/most of) these items as part of the Wiki Loves Monuments UK 2014 drive, to run the campaign from Wikidata rather than from a bespoke database. This allows the community (TM) to maintain the data, rather than one poor sod (e.g., myself) having to frantically update all of it every year ;-)
"Consumer" tool is here: https://tools.wmflabs.org/wlmuk/index_wd.html
These are based on "official" data from National Heritage, provided to me via Wikimedia UK. Grade A (or Grade I/II* in England) structures should be noteworthy by default.
It appears (as per your examples) that some of these were created as duplicates/with wrong IDs. As I said, this is based on "official" data, so it's the best I could do at the time. With mass creation, there are bound to be a few strays. If you can find some large-scale, systemic issue I'll try to fix it, but the one-offs will always fall back to manual fixing. At least, with Wikidata, we can fix them together.
On Tue, Jun 2, 2015 at 10:01 AM Daniel Kinzler < daniel.kinzler@wikimedia.de> wrote:
Am 01.06.2015 um 22:26 schrieb Markus Krötzsch:
Finally, the technical question is: Why is this even possible? I
thought that,
in each language, label+description are a key (globally unique), yet
here we
have many pairs of items with exactly the same label and description.
Or is the
problem that no description was entered and so the system does not
apply the
key?
The uniqueness constraint does indeed not apply if there is no description.
-- Daniel Kinzler Senior Software Developer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Update 2: For example, https://www.wikidata.org/wiki/Q17847522 and https://www.wikidata.org/wiki/Q17847537 have the same Scotland ID, but refer to different entities (church and churchyard, respectively). They were as two entities in the original dataset, sharing the same ID.
On Tue, Jun 2, 2015 at 10:26 AM Magnus Manske magnusmanske@googlemail.com wrote:
Update: There appear to be quite a few items with duplicate Scotland IDs (not all of them may be erroneous!): http://wdq.wmflabs.org/stats?action=doublestring&prop=709
On Tue, Jun 2, 2015 at 10:23 AM Magnus Manske magnusmanske@googlemail.com wrote:
I created (some/most of) these items as part of the Wiki Loves Monuments UK 2014 drive, to run the campaign from Wikidata rather than from a bespoke database. This allows the community (TM) to maintain the data, rather than one poor sod (e.g., myself) having to frantically update all of it every year ;-)
"Consumer" tool is here: https://tools.wmflabs.org/wlmuk/index_wd.html
These are based on "official" data from National Heritage, provided to me via Wikimedia UK. Grade A (or Grade I/II* in England) structures should be noteworthy by default.
It appears (as per your examples) that some of these were created as duplicates/with wrong IDs. As I said, this is based on "official" data, so it's the best I could do at the time. With mass creation, there are bound to be a few strays. If you can find some large-scale, systemic issue I'll try to fix it, but the one-offs will always fall back to manual fixing. At least, with Wikidata, we can fix them together.
On Tue, Jun 2, 2015 at 10:01 AM Daniel Kinzler < daniel.kinzler@wikimedia.de> wrote:
Am 01.06.2015 um 22:26 schrieb Markus Krötzsch:
Finally, the technical question is: Why is this even possible? I
thought that,
in each language, label+description are a key (globally unique), yet
here we
have many pairs of items with exactly the same label and description.
Or is the
problem that no description was entered and so the system does not
apply the
key?
The uniqueness constraint does indeed not apply if there is no description.
-- Daniel Kinzler Senior Software Developer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 02.06.2015 11:30, Magnus Manske wrote:
Update 2: For example, https://www.wikidata.org/wiki/Q17847522 and https://www.wikidata.org/wiki/Q17847537 have the same Scotland ID, but refer to different entities (church and churchyard, respectively). They were as two entities in the original dataset, sharing the same ID.
Yes, I noticed such cases too. From the information Wikidata, it is not clear to me why this is sometimes done and sometimes not done.
For example, these adjacent houses have the same Scotland ID but different items that each have their own coordinates (where did the coordinates come from?):
https://www.wikidata.org/wiki/Q17576211 https://www.wikidata.org/wiki/Q17576182 https://www.wikidata.org/wiki/Q17576185
In many other cases, adjacent houses with the same ID are combined into one item:
https://www.wikidata.org/wiki/Q17806587
(note, however, that the house addresses given in the ID and in the item label do not match, though they overlap on most of the houses.)
Finally, there are also cases where there are different IDs and we have several items, but they have the same labels that merge the contents of the two IDs:
https://www.wikidata.org/wiki/Q17810121 https://www.wikidata.org/wiki/Q17810137
It seems that the data was not taken from the Historic Sites database but from some different source that has its own coordinate data and a different (but seemingly arbitrary) approach to grouping sites. However, the coordinated give Historic Scotland as their reference -- I wonder if Historic Scotland might be changing frequently or exist in several versions.
Regards,
Markus
On Tue, Jun 2, 2015 at 10:26 AM Magnus Manske <magnusmanske@googlemail.com mailto:magnusmanske@googlemail.com> wrote:
Update: There appear to be quite a few items with duplicate Scotland IDs (not all of them may be erroneous!): http://wdq.wmflabs.org/stats?action=doublestring&prop=709 On Tue, Jun 2, 2015 at 10:23 AM Magnus Manske <magnusmanske@googlemail.com <mailto:magnusmanske@googlemail.com>> wrote: I created (some/most of) these items as part of the Wiki Loves Monuments UK 2014 drive, to run the campaign from Wikidata rather than from a bespoke database. This allows the community (TM) to maintain the data, rather than one poor sod (e.g., myself) having to frantically update all of it every year ;-) "Consumer" tool is here: https://tools.wmflabs.org/wlmuk/index_wd.html These are based on "official" data from National Heritage, provided to me via Wikimedia UK. Grade A (or Grade I/II* in England) structures should be noteworthy by default. It appears (as per your examples) that some of these were created as duplicates/with wrong IDs. As I said, this is based on "official" data, so it's the best I could do at the time. With mass creation, there are bound to be a few strays. If you can find some large-scale, systemic issue I'll try to fix it, but the one-offs will always fall back to manual fixing. At least, with Wikidata, we can fix them together. On Tue, Jun 2, 2015 at 10:01 AM Daniel Kinzler <daniel.kinzler@wikimedia.de <mailto:daniel.kinzler@wikimedia.de>> wrote: Am 01.06.2015 um 22:26 schrieb Markus Krötzsch: > Finally, the technical question is: Why is this even possible? I thought that, > in each language, label+description are a key (globally unique), yet here we > have many pairs of items with exactly the same label and description. Or is the > problem that no description was entered and so the system does not apply the > key? The uniqueness constraint does indeed not apply if there is no description. -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Another interesting type of Scottish historic orphans are those that are duplicates of items that do have site links. Even very prominent ones are duplicated, such as
https://www.wikidata.org/wiki/Q17569486 (dup) https://www.wikidata.org/wiki/Q933000 (real item)
Interestingly, they use different Scotland IDs, and it does indeed seem that Historic Scotland also contains duplicates:
http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL... http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...
Overall, this seems to be an example of an ID that really should not be considered "identity providing" since there seems to be an many-to-many relationship between Wikidata and Historic Scottland. Orphans should receive additional ids from a better source if at all possible. With the great number of seemingly legit non-functional uses of the Scotland IDs, they cannot be used in practice to detect duplicates.
Regards,
Markus
On 02.06.2015 13:01, Markus Krötzsch wrote:
On 02.06.2015 11:30, Magnus Manske wrote:
Update 2: For example, https://www.wikidata.org/wiki/Q17847522 and https://www.wikidata.org/wiki/Q17847537 have the same Scotland ID, but refer to different entities (church and churchyard, respectively). They were as two entities in the original dataset, sharing the same ID.
Yes, I noticed such cases too. From the information Wikidata, it is not clear to me why this is sometimes done and sometimes not done.
For example, these adjacent houses have the same Scotland ID but different items that each have their own coordinates (where did the coordinates come from?):
https://www.wikidata.org/wiki/Q17576211 https://www.wikidata.org/wiki/Q17576182 https://www.wikidata.org/wiki/Q17576185
In many other cases, adjacent houses with the same ID are combined into one item:
https://www.wikidata.org/wiki/Q17806587
(note, however, that the house addresses given in the ID and in the item label do not match, though they overlap on most of the houses.)
Finally, there are also cases where there are different IDs and we have several items, but they have the same labels that merge the contents of the two IDs:
https://www.wikidata.org/wiki/Q17810121 https://www.wikidata.org/wiki/Q17810137
It seems that the data was not taken from the Historic Sites database but from some different source that has its own coordinate data and a different (but seemingly arbitrary) approach to grouping sites. However, the coordinated give Historic Scotland as their reference -- I wonder if Historic Scotland might be changing frequently or exist in several versions.
Regards,
Markus
On Tue, Jun 2, 2015 at 10:26 AM Magnus Manske <magnusmanske@googlemail.com mailto:magnusmanske@googlemail.com> wrote:
Update: There appear to be quite a few items with duplicate Scotland IDs (not all of them may be erroneous!): http://wdq.wmflabs.org/stats?action=doublestring&prop=709 On Tue, Jun 2, 2015 at 10:23 AM Magnus Manske <magnusmanske@googlemail.com <mailto:magnusmanske@googlemail.com>> wrote: I created (some/most of) these items as part of the Wiki Loves Monuments UK 2014 drive, to run the campaign from Wikidata rather than from a bespoke database. This allows the community (TM) to maintain the data, rather than one poor sod (e.g., myself) having to frantically update all of it every year ;-) "Consumer" tool is here: https://tools.wmflabs.org/wlmuk/index_wd.html These are based on "official" data from National Heritage, provided to me via Wikimedia UK. Grade A (or Grade I/II* in England) structures should be noteworthy by default. It appears (as per your examples) that some of these were created as duplicates/with wrong IDs. As I said, this is based on "official" data, so it's the best I could do at the time. With mass creation, there are bound to be a few strays. If you can find some large-scale, systemic issue I'll try to fix it, but the one-offs will always fall back to manual fixing. At least, with Wikidata, we can fix them together. On Tue, Jun 2, 2015 at 10:01 AM Daniel Kinzler <daniel.kinzler@wikimedia.de <mailto:daniel.kinzler@wikimedia.de>> wrote: Am 01.06.2015 um 22:26 schrieb Markus Krötzsch: > Finally, the technical question is: Why is this even possible? I thought that, > in each language, label+description are a key (globally unique), yet here we > have many pairs of items with exactly the same label and description. Or is the > problem that no description was entered and so the system does not apply the > key? The uniqueness constraint does indeed not apply if there is no description. -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi Markus, Magnus and Wikidatans,
I can't yet add data to, for example, this - https://www.wikidata.org/wiki/Q933000 (real item) - by clicking "save," since the "save" button isn't an active link, but the "cancel" button is. I tried to add this URL - http://www.forthroadbridge.org/home (which I"m not actually able to see in my browser presently - all I see is a blank white page, unusually) - as well as to add the word "Fife" to various fields to this "Forth Road Bridge" Q item. Will this be possible in the near future?
Scott
On Tue, Jun 2, 2015 at 4:12 AM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
Another interesting type of Scottish historic orphans are those that are duplicates of items that do have site links. Even very prominent ones are duplicated, such as
https://www.wikidata.org/wiki/Q17569486 (dup) https://www.wikidata.org/wiki/Q933000 (real item)
Interestingly, they use different Scotland IDs, and it does indeed seem that Historic Scotland also contains duplicates:
http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...
http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...
Overall, this seems to be an example of an ID that really should not be considered "identity providing" since there seems to be an many-to-many relationship between Wikidata and Historic Scottland. Orphans should receive additional ids from a better source if at all possible. With the great number of seemingly legit non-functional uses of the Scotland IDs, they cannot be used in practice to detect duplicates.
Regards,
Markus
On 02.06.2015 13:01, Markus Krötzsch wrote:
On 02.06.2015 11:30, Magnus Manske wrote:
Update 2: For example, https://www.wikidata.org/wiki/Q17847522 and https://www.wikidata.org/wiki/Q17847537 have the same Scotland ID, but refer to different entities (church and churchyard, respectively). They were as two entities in the original dataset, sharing the same ID.
Yes, I noticed such cases too. From the information Wikidata, it is not clear to me why this is sometimes done and sometimes not done.
For example, these adjacent houses have the same Scotland ID but different items that each have their own coordinates (where did the coordinates come from?):
https://www.wikidata.org/wiki/Q17576211 https://www.wikidata.org/wiki/Q17576182 https://www.wikidata.org/wiki/Q17576185
In many other cases, adjacent houses with the same ID are combined into one item:
https://www.wikidata.org/wiki/Q17806587
(note, however, that the house addresses given in the ID and in the item label do not match, though they overlap on most of the houses.)
Finally, there are also cases where there are different IDs and we have several items, but they have the same labels that merge the contents of the two IDs:
https://www.wikidata.org/wiki/Q17810121 https://www.wikidata.org/wiki/Q17810137
It seems that the data was not taken from the Historic Sites database but from some different source that has its own coordinate data and a different (but seemingly arbitrary) approach to grouping sites. However, the coordinated give Historic Scotland as their reference -- I wonder if Historic Scotland might be changing frequently or exist in several versions.
Regards,
Markus
On Tue, Jun 2, 2015 at 10:26 AM Magnus Manske <magnusmanske@googlemail.com mailto:magnusmanske@googlemail.com> wrote:
Update: There appear to be quite a few items with duplicate Scotland IDs (not all of them may be erroneous!): http://wdq.wmflabs.org/stats?action=doublestring&prop=709 On Tue, Jun 2, 2015 at 10:23 AM Magnus Manske <magnusmanske@googlemail.com <mailto:magnusmanske@googlemail.com>> wrote: I created (some/most of) these items as part of the Wiki Loves Monuments UK 2014 drive, to run the campaign from Wikidata rather than from a bespoke database. This allows the community (TM) to maintain the data, rather than one poor sod (e.g., myself) having to frantically update all of it every year ;-) "Consumer" tool is here: https://tools.wmflabs.org/wlmuk/index_wd.html These are based on "official" data from National Heritage, provided to me via Wikimedia UK. Grade A (or Grade I/II* in England) structures should be noteworthy by default. It appears (as per your examples) that some of these were created as duplicates/with wrong IDs. As I said, this is based on "official" data, so it's the best I could do at the time. With mass creation, there are bound to be a few strays. If you can find some large-scale, systemic issue I'll try to fix it, but the one-offs will always fall back to manual fixing. At least, with Wikidata, we can fix them together. On Tue, Jun 2, 2015 at 10:01 AM Daniel Kinzler <daniel.kinzler@wikimedia.de <mailto:daniel.kinzler@wikimedia.de>> wrote: Am 01.06.2015 um 22:26 schrieb Markus Krötzsch: > Finally, the technical question is: Why is this even possible? I thought that, > in each language, label+description are a key (globally unique), yet here we > have many pairs of items with exactly the same label and description. Or is the > problem that no description was entered and so the system does not apply the > key? The uniqueness constraint does indeed not apply if there is no description. -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
I have added the (un-broken) URL as "official website".
Not sure which property to use for "Fife", though.
On Tue, Jun 2, 2015 at 11:35 PM Scott MacLeod < worlduniversityandschool@gmail.com> wrote:
Hi Markus, Magnus and Wikidatans,
I can't yet add data to, for example, this - https://www.wikidata.org/wiki/Q933000 (real item) - by clicking "save," since the "save" button isn't an active link, but the "cancel" button is. I tried to add this URL - http://www.forthroadbridge.org/home (which I"m not actually able to see in my browser presently - all I see is a blank white page, unusually) - as well as to add the word "Fife" to various fields to this "Forth Road Bridge" Q item. Will this be possible in the near future?
Scott
On Tue, Jun 2, 2015 at 4:12 AM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
Another interesting type of Scottish historic orphans are those that are duplicates of items that do have site links. Even very prominent ones are duplicated, such as
https://www.wikidata.org/wiki/Q17569486 (dup) https://www.wikidata.org/wiki/Q933000 (real item)
Interestingly, they use different Scotland IDs, and it does indeed seem that Historic Scotland also contains duplicates:
http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...
http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...
Overall, this seems to be an example of an ID that really should not be considered "identity providing" since there seems to be an many-to-many relationship between Wikidata and Historic Scottland. Orphans should receive additional ids from a better source if at all possible. With the great number of seemingly legit non-functional uses of the Scotland IDs, they cannot be used in practice to detect duplicates.
Regards,
Markus
On 02.06.2015 13:01, Markus Krötzsch wrote:
On 02.06.2015 11:30, Magnus Manske wrote:
Update 2: For example, https://www.wikidata.org/wiki/Q17847522 and https://www.wikidata.org/wiki/Q17847537 have the same Scotland ID, but refer to different entities (church and churchyard, respectively). They were as two entities in the original dataset, sharing the same ID.
Yes, I noticed such cases too. From the information Wikidata, it is not clear to me why this is sometimes done and sometimes not done.
For example, these adjacent houses have the same Scotland ID but different items that each have their own coordinates (where did the coordinates come from?):
https://www.wikidata.org/wiki/Q17576211 https://www.wikidata.org/wiki/Q17576182 https://www.wikidata.org/wiki/Q17576185
In many other cases, adjacent houses with the same ID are combined into one item:
https://www.wikidata.org/wiki/Q17806587
(note, however, that the house addresses given in the ID and in the item label do not match, though they overlap on most of the houses.)
Finally, there are also cases where there are different IDs and we have several items, but they have the same labels that merge the contents of the two IDs:
https://www.wikidata.org/wiki/Q17810121 https://www.wikidata.org/wiki/Q17810137
It seems that the data was not taken from the Historic Sites database but from some different source that has its own coordinate data and a different (but seemingly arbitrary) approach to grouping sites. However, the coordinated give Historic Scotland as their reference -- I wonder if Historic Scotland might be changing frequently or exist in several versions.
Regards,
Markus
On Tue, Jun 2, 2015 at 10:26 AM Magnus Manske <magnusmanske@googlemail.com mailto:magnusmanske@googlemail.com> wrote:
Update: There appear to be quite a few items with duplicate Scotland IDs (not all of them may be erroneous!): http://wdq.wmflabs.org/stats?action=doublestring&prop=709 On Tue, Jun 2, 2015 at 10:23 AM Magnus Manske <magnusmanske@googlemail.com <mailto:magnusmanske@googlemail.com>> wrote: I created (some/most of) these items as part of the Wiki Loves Monuments UK 2014 drive, to run the campaign from Wikidata rather than from a bespoke database. This allows the community (TM) to maintain the data, rather than one poor sod (e.g., myself) having to frantically update all of it every year ;-) "Consumer" tool is here: https://tools.wmflabs.org/wlmuk/index_wd.html These are based on "official" data from National Heritage, provided to me via Wikimedia UK. Grade A (or Grade I/II* in England) structures should be noteworthy by default. It appears (as per your examples) that some of these were created as duplicates/with wrong IDs. As I said, this is based on "official" data, so it's the best I could do at the time. With mass creation, there are bound to be a few strays. If you can find some large-scale, systemic issue I'll try to fix it, but the one-offs will always fall back to manual fixing. At least, with Wikidata, we can fix them together. On Tue, Jun 2, 2015 at 10:01 AM Daniel Kinzler <daniel.kinzler@wikimedia.de <mailto:daniel.kinzler@wikimedia.de>> wrote: Am 01.06.2015 um 22:26 schrieb Markus Krötzsch: > Finally, the technical question is: Why is this even possible? I thought that, > in each language, label+description are a key (globally unique), yet here we > have many pairs of items with exactly the same label and description. Or is the > problem that no description was entered and so the system does not apply the > key? The uniqueness constraint does indeed not apply if there is no description. -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
--
- Scott MacLeod - Founder & President
- http://worlduniversityandschool.org
- 415 480 4577
- PO Box 442, (86 Ridgecrest Road), Canyon, CA 94516
- World University and School - like Wikipedia with best STEM-centric
OpenCourseWare - incorporated as a nonprofit university and school in California, and is a U.S. 501 (c) (3) tax-exempt educational organization, both effective April 2010.
World University and School is sending you this because of your interest in free, online, higher education. If you don't want to receive these, please reply with 'unsubscribe' in the body of the email, leaving the subject line intact. Thank you. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Thanks, Magnus,
What I was hoping to be able to do is wiki-add resources myself to this and other Q items, but I still can't do this even after the changes you made (thank you!) ... I'd add Fife, for example, to - located in the administrative territorial entity https://www.wikidata.org/wiki/Property:P131 - but would also like to add further resources to Wikidata and Q items too ... which is what I think makes Wikidata so potentially great ... and which is what lead to the growth of Wikipedia too I think. Thank you again, M, M & Wikidatans!
Best, Scott
On Tue, Jun 2, 2015 at 3:59 PM, Magnus Manske magnusmanske@googlemail.com wrote:
I have added the (un-broken) URL as "official website".
Not sure which property to use for "Fife", though.
On Tue, Jun 2, 2015 at 11:35 PM Scott MacLeod < worlduniversityandschool@gmail.com> wrote:
Hi Markus, Magnus and Wikidatans,
I can't yet add data to, for example, this - https://www.wikidata.org/wiki/Q933000 (real item) - by clicking "save," since the "save" button isn't an active link, but the "cancel" button is. I tried to add this URL - http://www.forthroadbridge.org/home (which I"m not actually able to see in my browser presently - all I see is a blank white page, unusually) - as well as to add the word "Fife" to various fields to this "Forth Road Bridge" Q item. Will this be possible in the near future?
Scott
On Tue, Jun 2, 2015 at 4:12 AM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
Another interesting type of Scottish historic orphans are those that are duplicates of items that do have site links. Even very prominent ones are duplicated, such as
https://www.wikidata.org/wiki/Q17569486 (dup) https://www.wikidata.org/wiki/Q933000 (real item)
Interestingly, they use different Scotland IDs, and it does indeed seem that Historic Scotland also contains duplicates:
http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...
http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...
Overall, this seems to be an example of an ID that really should not be considered "identity providing" since there seems to be an many-to-many relationship between Wikidata and Historic Scottland. Orphans should receive additional ids from a better source if at all possible. With the great number of seemingly legit non-functional uses of the Scotland IDs, they cannot be used in practice to detect duplicates.
Regards,
Markus
On 02.06.2015 13:01, Markus Krötzsch wrote:
On 02.06.2015 11:30, Magnus Manske wrote:
Update 2: For example, https://www.wikidata.org/wiki/Q17847522 and https://www.wikidata.org/wiki/Q17847537 have the same Scotland ID, but refer to different entities (church and churchyard, respectively). They were as two entities in the original dataset, sharing the same ID.
Yes, I noticed such cases too. From the information Wikidata, it is not clear to me why this is sometimes done and sometimes not done.
For example, these adjacent houses have the same Scotland ID but different items that each have their own coordinates (where did the coordinates come from?):
https://www.wikidata.org/wiki/Q17576211 https://www.wikidata.org/wiki/Q17576182 https://www.wikidata.org/wiki/Q17576185
In many other cases, adjacent houses with the same ID are combined into one item:
https://www.wikidata.org/wiki/Q17806587
(note, however, that the house addresses given in the ID and in the item label do not match, though they overlap on most of the houses.)
Finally, there are also cases where there are different IDs and we have several items, but they have the same labels that merge the contents of the two IDs:
https://www.wikidata.org/wiki/Q17810121 https://www.wikidata.org/wiki/Q17810137
It seems that the data was not taken from the Historic Sites database but from some different source that has its own coordinate data and a different (but seemingly arbitrary) approach to grouping sites. However, the coordinated give Historic Scotland as their reference -- I wonder if Historic Scotland might be changing frequently or exist in several versions.
Regards,
Markus
On Tue, Jun 2, 2015 at 10:26 AM Magnus Manske <magnusmanske@googlemail.com mailto:magnusmanske@googlemail.com> wrote:
Update: There appear to be quite a few items with duplicate
Scotland IDs (not all of them may be erroneous!): http://wdq.wmflabs.org/stats?action=doublestring&prop=709
On Tue, Jun 2, 2015 at 10:23 AM Magnus Manske <magnusmanske@googlemail.com <mailto:magnusmanske@googlemail.com>> wrote: I created (some/most of) these items as part of the Wiki Loves Monuments UK 2014 drive, to run the campaign from Wikidata rather than from a bespoke database. This allows the community (TM) to maintain the data, rather than one poor sod (e.g., myself) having to frantically update all of it every year ;-) "Consumer" tool is here: https://tools.wmflabs.org/wlmuk/index_wd.html These are based on "official" data from National Heritage, provided to me via Wikimedia UK. Grade A (or Grade I/II* in England) structures should be noteworthy by default. It appears (as per your examples) that some of these were created as duplicates/with wrong IDs. As I said, this is based on "official" data, so it's the best I could do at the time. With mass creation, there are bound to be a few strays. If you can find some large-scale, systemic issue I'll try to fix it, but the one-offs will always fall back to manual fixing. At least, with Wikidata, we can fix them together. On Tue, Jun 2, 2015 at 10:01 AM Daniel Kinzler <daniel.kinzler@wikimedia.de <mailto:daniel.kinzler@wikimedia.de>> wrote: Am 01.06.2015 um 22:26 schrieb Markus Krötzsch: > Finally, the technical question is: Why is this even possible? I thought that, > in each language, label+description are a key (globally unique), yet here we > have many pairs of items with exactly the same label and description. Or is the > problem that no description was entered and so the
system does not apply the > key?
The uniqueness constraint does indeed not apply if there is no description. -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
--
- Scott MacLeod - Founder & President
- http://worlduniversityandschool.org
- 415 480 4577
- PO Box 442, (86 Ridgecrest Road), Canyon, CA 94516
- World University and School - like Wikipedia with best STEM-centric
OpenCourseWare - incorporated as a nonprofit university and school in California, and is a U.S. 501 (c) (3) tax-exempt educational organization, both effective April 2010.
World University and School is sending you this because of your interest in free, online, higher education. If you don't want to receive these, please reply with 'unsubscribe' in the body of the email, leaving the subject line intact. Thank you. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On Tue, Jun 2, 2015 at 12:12 PM Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
Another interesting type of Scottish historic orphans are those that are duplicates of items that do have site links. Even very prominent ones are duplicated, such as
https://www.wikidata.org/wiki/Q17569486 (dup) https://www.wikidata.org/wiki/Q933000 (real item)
Interestingly, they use different Scotland IDs, and it does indeed seem that Historic Scotland also contains duplicates:
http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...
http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...
Overall, this seems to be an example of an ID that really should not be considered "identity providing" since there seems to be an many-to-many relationship between Wikidata and Historic Scottland. Orphans should receive additional ids from a better source if at all possible. With the great number of seemingly legit non-functional uses of the Scotland IDs, they cannot be used in practice to detect duplicates.
They are not unique on the Historic Scorland site, but they can still have the correct IDs on WIkidata, even if they are non-unique. What will be required for this (and other external IDs) in the long run is an automated or semi-automated check against the foreign data corpus, with heuristics highlighting potential issues. This includes new items in the external source (or ones we missed during initial import).
Given that I received the original data as a CSV from WMUK, who got it from Historic Scotland under Freedom of Information (IIRC), this might prove tricky.
Regards,
Markus
On 02.06.2015 13:01, Markus Krötzsch wrote:
On 02.06.2015 11:30, Magnus Manske wrote:
Update 2: For example, https://www.wikidata.org/wiki/Q17847522 and https://www.wikidata.org/wiki/Q17847537 have the same Scotland ID, but refer to different entities (church and churchyard, respectively). They were as two entities in the original dataset, sharing the same ID.
Yes, I noticed such cases too. From the information Wikidata, it is not clear to me why this is sometimes done and sometimes not done.
For example, these adjacent houses have the same Scotland ID but different items that each have their own coordinates (where did the coordinates come from?):
https://www.wikidata.org/wiki/Q17576211 https://www.wikidata.org/wiki/Q17576182 https://www.wikidata.org/wiki/Q17576185
In many other cases, adjacent houses with the same ID are combined into one item:
https://www.wikidata.org/wiki/Q17806587
(note, however, that the house addresses given in the ID and in the item label do not match, though they overlap on most of the houses.)
Finally, there are also cases where there are different IDs and we have several items, but they have the same labels that merge the contents of the two IDs:
https://www.wikidata.org/wiki/Q17810121 https://www.wikidata.org/wiki/Q17810137
It seems that the data was not taken from the Historic Sites database but from some different source that has its own coordinate data and a different (but seemingly arbitrary) approach to grouping sites. However, the coordinated give Historic Scotland as their reference -- I wonder if Historic Scotland might be changing frequently or exist in several versions.
Regards,
Markus
On Tue, Jun 2, 2015 at 10:26 AM Magnus Manske <magnusmanske@googlemail.com mailto:magnusmanske@googlemail.com>
wrote:
Update: There appear to be quite a few items with duplicate Scotland IDs (not all of them may be erroneous!): http://wdq.wmflabs.org/stats?action=doublestring&prop=709 On Tue, Jun 2, 2015 at 10:23 AM Magnus Manske <magnusmanske@googlemail.com <mailto:magnusmanske@googlemail.com>> wrote: I created (some/most of) these items as part of the Wiki Loves Monuments UK 2014 drive, to run the campaign from Wikidata rather than from a bespoke database. This allows the community (TM) to maintain the data, rather than one poor sod (e.g., myself) having to frantically update all of it every year ;-) "Consumer" tool is here: https://tools.wmflabs.org/wlmuk/index_wd.html These are based on "official" data from National Heritage, provided to me via Wikimedia UK. Grade A (or Grade I/II* in England) structures should be noteworthy by default. It appears (as per your examples) that some of these were created as duplicates/with wrong IDs. As I said, this is based on "official" data, so it's the best I could do at the time. With mass creation, there are bound to be a few strays. If you can find some large-scale, systemic issue I'll try to fix it, but the one-offs will always fall back to manual fixing. At least, with Wikidata, we can fix them together. On Tue, Jun 2, 2015 at 10:01 AM Daniel Kinzler <daniel.kinzler@wikimedia.de <mailto:daniel.kinzler@wikimedia.de>> wrote: Am 01.06.2015 um 22:26 schrieb Markus Krötzsch: > Finally, the technical question is: Why is this even possible? I thought that, > in each language, label+description are a key (globally unique), yet here we > have many pairs of items with exactly the same label and description. Or is the > problem that no description was entered and so the system does not apply the > key? The uniqueness constraint does indeed not apply if there is no description. -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
This particular case is something of a known problem - we've encountered it with some of the other heritage-building identifier lists as well.
Bridges often span a river which is the border for two jurisdictions (in this case, council areas). Each local area counts it as a historic building, and because the national lists are aggregated from local lists, it gets two entries in the main list, one as Fife and one as Edinburgh. A similar case in Wales is the Menai Suspension Bridge, which is 4049 from the Gwynedd register and 18572 from the Anglesey one (Wikidata, at Q581526, only lists one identifer).
The lack of deduplication is probably intentional rather than a bug, and both entries are "correct". Perhaps one way to handle this for Wikidata would be to, hmm, say something like "if the item is some kind of a bridge, then allow two IDs" in the constraints?
I can't immediately think of any bridges which cross national borders *and* are a heritage building in both countries, but we'd see the same thing there, with it having identifiers from both sides.
Andrew.
On 2 June 2015 at 12:12, Markus Krötzsch markus@semantic-mediawiki.org wrote:
Another interesting type of Scottish historic orphans are those that are duplicates of items that do have site links. Even very prominent ones are duplicated, such as
https://www.wikidata.org/wiki/Q17569486 (dup) https://www.wikidata.org/wiki/Q933000 (real item)
Interestingly, they use different Scotland IDs, and it does indeed seem that Historic Scotland also contains duplicates:
http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL... http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...
Overall, this seems to be an example of an ID that really should not be considered "identity providing" since there seems to be an many-to-many relationship between Wikidata and Historic Scottland. Orphans should receive additional ids from a better source if at all possible. With the great number of seemingly legit non-functional uses of the Scotland IDs, they cannot be used in practice to detect duplicates.
Regards,
Markus
On 02.06.2015 13:01, Markus Krötzsch wrote:
On 02.06.2015 11:30, Magnus Manske wrote:
Update 2: For example, https://www.wikidata.org/wiki/Q17847522 and https://www.wikidata.org/wiki/Q17847537 have the same Scotland ID, but refer to different entities (church and churchyard, respectively). They were as two entities in the original dataset, sharing the same ID.
Yes, I noticed such cases too. From the information Wikidata, it is not clear to me why this is sometimes done and sometimes not done.
For example, these adjacent houses have the same Scotland ID but different items that each have their own coordinates (where did the coordinates come from?):
https://www.wikidata.org/wiki/Q17576211 https://www.wikidata.org/wiki/Q17576182 https://www.wikidata.org/wiki/Q17576185
In many other cases, adjacent houses with the same ID are combined into one item:
https://www.wikidata.org/wiki/Q17806587
(note, however, that the house addresses given in the ID and in the item label do not match, though they overlap on most of the houses.)
Finally, there are also cases where there are different IDs and we have several items, but they have the same labels that merge the contents of the two IDs:
https://www.wikidata.org/wiki/Q17810121 https://www.wikidata.org/wiki/Q17810137
It seems that the data was not taken from the Historic Sites database but from some different source that has its own coordinate data and a different (but seemingly arbitrary) approach to grouping sites. However, the coordinated give Historic Scotland as their reference -- I wonder if Historic Scotland might be changing frequently or exist in several versions.
Regards,
Markus
On Tue, Jun 2, 2015 at 10:26 AM Magnus Manske <magnusmanske@googlemail.com mailto:magnusmanske@googlemail.com> wrote:
Update: There appear to be quite a few items with duplicate Scotland IDs (not all of them may be erroneous!): http://wdq.wmflabs.org/stats?action=doublestring&prop=709 On Tue, Jun 2, 2015 at 10:23 AM Magnus Manske <magnusmanske@googlemail.com <mailto:magnusmanske@googlemail.com>> wrote: I created (some/most of) these items as part of the Wiki Loves Monuments UK 2014 drive, to run the campaign from Wikidata rather than from a bespoke database. This allows the community (TM) to maintain the data, rather than one poor sod (e.g., myself) having to frantically update all of it every year ;-) "Consumer" tool is here: https://tools.wmflabs.org/wlmuk/index_wd.html These are based on "official" data from National Heritage, provided to me via Wikimedia UK. Grade A (or Grade I/II* in England) structures should be noteworthy by default. It appears (as per your examples) that some of these were created as duplicates/with wrong IDs. As I said, this is based on "official" data, so it's the best I could do at the time. With mass creation, there are bound to be a few strays. If you can find some large-scale, systemic issue I'll try to fix it, but the one-offs will always fall back to manual fixing. At least, with Wikidata, we can fix them together. On Tue, Jun 2, 2015 at 10:01 AM Daniel Kinzler <daniel.kinzler@wikimedia.de <mailto:daniel.kinzler@wikimedia.de>> wrote: Am 01.06.2015 um 22:26 schrieb Markus Krötzsch: > Finally, the technical question is: Why is this even possible? I thought that, > in each language, label+description are a key (globally unique), yet here we > have many pairs of items with exactly the same label and description. Or is the > problem that no description was entered and so the system does not apply the > key? The uniqueness constraint does indeed not apply if there is no description. -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Maybe there is a case to separate import and verification here?
There are many statements in Wikidata nowadays, but they get really "trustworthy" through references (other than "imported from Wikipedia"). But for external IDs, references are superfluous; they are their own reference, by definition. So how about marking IDs with a "verified" (or "last verified on") qualifier? Much of such work could be done by bots; we could then filter the problematic ones out for manual verification.
As we have no control over external lists, this would have to be re-checked ever so often; but, again bots to the rescue.
On Wed, Jun 3, 2015 at 12:49 PM Andrew Gray andrew.gray@dunelm.org.uk wrote:
This particular case is something of a known problem - we've encountered it with some of the other heritage-building identifier lists as well.
Bridges often span a river which is the border for two jurisdictions (in this case, council areas). Each local area counts it as a historic building, and because the national lists are aggregated from local lists, it gets two entries in the main list, one as Fife and one as Edinburgh. A similar case in Wales is the Menai Suspension Bridge, which is 4049 from the Gwynedd register and 18572 from the Anglesey one (Wikidata, at Q581526, only lists one identifer).
The lack of deduplication is probably intentional rather than a bug, and both entries are "correct". Perhaps one way to handle this for Wikidata would be to, hmm, say something like "if the item is some kind of a bridge, then allow two IDs" in the constraints?
I can't immediately think of any bridges which cross national borders *and* are a heritage building in both countries, but we'd see the same thing there, with it having identifiers from both sides.
Andrew.
On 2 June 2015 at 12:12, Markus Krötzsch markus@semantic-mediawiki.org wrote:
Another interesting type of Scottish historic orphans are those that are duplicates of items that do have site links. Even very prominent ones are duplicated, such as
https://www.wikidata.org/wiki/Q17569486 (dup) https://www.wikidata.org/wiki/Q933000 (real item)
Interestingly, they use different Scotland IDs, and it does indeed seem
that
Historic Scotland also contains duplicates:
http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...
http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...
Overall, this seems to be an example of an ID that really should not be considered "identity providing" since there seems to be an many-to-many relationship between Wikidata and Historic Scottland. Orphans should
receive
additional ids from a better source if at all possible. With the great number of seemingly legit non-functional uses of the Scotland IDs, they cannot be used in practice to detect duplicates.
Regards,
Markus
On 02.06.2015 13:01, Markus Krötzsch wrote:
On 02.06.2015 11:30, Magnus Manske wrote:
Update 2: For example, https://www.wikidata.org/wiki/Q17847522 and https://www.wikidata.org/wiki/Q17847537 have the same Scotland ID, but refer to different entities (church and churchyard, respectively). They were as two entities in the original dataset, sharing the same ID.
Yes, I noticed such cases too. From the information Wikidata, it is not clear to me why this is sometimes done and sometimes not done.
For example, these adjacent houses have the same Scotland ID but different items that each have their own coordinates (where did the coordinates come from?):
https://www.wikidata.org/wiki/Q17576211 https://www.wikidata.org/wiki/Q17576182 https://www.wikidata.org/wiki/Q17576185
In many other cases, adjacent houses with the same ID are combined into one item:
https://www.wikidata.org/wiki/Q17806587
(note, however, that the house addresses given in the ID and in the item label do not match, though they overlap on most of the houses.)
Finally, there are also cases where there are different IDs and we have several items, but they have the same labels that merge the contents of the two IDs:
https://www.wikidata.org/wiki/Q17810121 https://www.wikidata.org/wiki/Q17810137
It seems that the data was not taken from the Historic Sites database but from some different source that has its own coordinate data and a different (but seemingly arbitrary) approach to grouping sites. However, the coordinated give Historic Scotland as their reference -- I wonder if Historic Scotland might be changing frequently or exist in several versions.
Regards,
Markus
On Tue, Jun 2, 2015 at 10:26 AM Magnus Manske <magnusmanske@googlemail.com mailto:magnusmanske@googlemail.com>
wrote:
Update: There appear to be quite a few items with duplicate
Scotland
IDs (not all of them may be erroneous!): http://wdq.wmflabs.org/stats?action=doublestring&prop=709 On Tue, Jun 2, 2015 at 10:23 AM Magnus Manske <magnusmanske@googlemail.com <mailto:magnusmanske@googlemail.com>> wrote: I created (some/most of) these items as part of the Wiki Loves Monuments UK 2014 drive, to run the campaign from Wikidata rather than from a bespoke database. This allows the community (TM) to maintain the data, rather than one poor sod (e.g., myself) having to frantically update all of it every year ;-) "Consumer" tool is here: https://tools.wmflabs.org/wlmuk/index_wd.html These are based on "official" data from National Heritage, provided to me via Wikimedia UK. Grade A (or Grade I/II* in England) structures should be noteworthy by default. It appears (as per your examples) that some of these were created as duplicates/with wrong IDs. As I said, this is based on "official" data, so it's the best I could do at the time. With mass creation, there are bound to be a few strays. If you can find some large-scale, systemic issue I'll try to fix it, but the one-offs will always fall back to manual fixing. At least, with Wikidata, we can fix them together. On Tue, Jun 2, 2015 at 10:01 AM Daniel Kinzler <daniel.kinzler@wikimedia.de <mailto:daniel.kinzler@wikimedia.de>> wrote: Am 01.06.2015 um 22:26 schrieb Markus Krötzsch: > Finally, the technical question is: Why is this even possible? I thought that, > in each language, label+description are a key (globally unique), yet here we > have many pairs of items with exactly the same label and description. Or is the > problem that no description was entered and so the
system
does not apply the > key? The uniqueness constraint does indeed not apply if there is no description. -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
--
- Andrew Gray andrew.gray@dunelm.org.uk
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 03.06.2015 13:57, Magnus Manske wrote:
Maybe there is a case to separate import and verification here?
There are many statements in Wikidata nowadays, but they get really "trustworthy" through references (other than "imported from Wikipedia"). But for external IDs, references are superfluous; they are their own reference, by definition. So how about marking IDs with a "verified" (or "last verified on") qualifier? Much of such work could be done by bots; we could then filter the problematic ones out for manual verification.
As we have no control over external lists, this would have to be re-checked ever so often; but, again bots to the rescue.
Yes, I fully support this proposal.
What do you think about making "last verified on" not a qualifier but (part of) the reference information? The reference could state where the bot has looked up the ID and give a time. This would be somewhat similar to what is now used in Freebase Ids, e.g., in https://www.wikidata.org/wiki/Q42.
In general, it might be useful to have such a "last verified on" property that can be added to arbitrary references. There are many other uses for this. One common case would be that a user has changed the value without even being aware of the reference -- then one would be able to detect this automatically by comparing the last modification time with the "last verified on" date.
Putting the "last verified on" into the references also makes it possible to have different dates for different references there.
Regards,
Markus
I second this. For a related effort, see:
https://github.com/pav-ontology/pav/
in particular, pav:sourceLastAccessedOn, pav:lastRefreshedOn, pav:lastUpdateOn http://pav-ontology.github.io/pav/#d4e846
On Jun 3, 2015, at 3:56 PM, Markus Krötzsch markus@semantic-mediawiki.org wrote:
On 03.06.2015 13:57, Magnus Manske wrote:
Maybe there is a case to separate import and verification here?
There are many statements in Wikidata nowadays, but they get really "trustworthy" through references (other than "imported from Wikipedia"). But for external IDs, references are superfluous; they are their own reference, by definition. So how about marking IDs with a "verified" (or "last verified on") qualifier? Much of such work could be done by bots; we could then filter the problematic ones out for manual verification.
As we have no control over external lists, this would have to be re-checked ever so often; but, again bots to the rescue.
Yes, I fully support this proposal.
What do you think about making "last verified on" not a qualifier but (part of) the reference information? The reference could state where the bot has looked up the ID and give a time. This would be somewhat similar to what is now used in Freebase Ids, e.g., in https://www.wikidata.org/wiki/Q42.
In general, it might be useful to have such a "last verified on" property that can be added to arbitrary references. There are many other uses for this. One common case would be that a user has changed the value without even being aware of the reference -- then one would be able to detect this automatically by comparing the last modification time with the "last verified on" date.
Putting the "last verified on" into the references also makes it possible to have different dates for different references there.
Regards,
Markus
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Coming back to Magnus's suggestion ... I think the existing property "retrieved" (P813) could be used for this "last verified on" property, that is, for setting the time a which some external reference was last compared to a claim in Wikidata.
Magnus also pointed out that many external IDs are "self-verifying" in that they are their own reference. The situation is somewhat similar for homepages. Should we adopt the practice of giving a single retrieved value (without any further information) as the reference for such cases?
Adding P813 dates more widely would also open up new ways of maintaining data, since one would have a way to filter statements by how long ago they had last been checked.
Best wishes,
Markus
On 03.06.2015 15:56, Markus Krötzsch wrote:
On 03.06.2015 13:57, Magnus Manske wrote:
Maybe there is a case to separate import and verification here?
There are many statements in Wikidata nowadays, but they get really "trustworthy" through references (other than "imported from Wikipedia"). But for external IDs, references are superfluous; they are their own reference, by definition. So how about marking IDs with a "verified" (or "last verified on") qualifier? Much of such work could be done by bots; we could then filter the problematic ones out for manual verification.
As we have no control over external lists, this would have to be re-checked ever so often; but, again bots to the rescue.
Yes, I fully support this proposal.
What do you think about making "last verified on" not a qualifier but (part of) the reference information? The reference could state where the bot has looked up the ID and give a time. This would be somewhat similar to what is now used in Freebase Ids, e.g., in https://www.wikidata.org/wiki/Q42.
In general, it might be useful to have such a "last verified on" property that can be added to arbitrary references. There are many other uses for this. One common case would be that a user has changed the value without even being aware of the reference -- then one would be able to detect this automatically by comparing the last modification time with the "last verified on" date.
Putting the "last verified on" into the references also makes it possible to have different dates for different references there.
Regards,
Markus
One question remaining is: Should there be a difference between "human-verified" and "bot-verified"? A bot can check if e.g. the label (or the words in the label) occur on the page at the URL to check, but it can't know for sure. Human review is more reliable, but vastly slower and not likely to happen for many/most such statements. Two different properties could act as different confidence levels. But maybe I'm just over-engineering this ;-)
On Sun, Jun 7, 2015 at 4:19 PM Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
Coming back to Magnus's suggestion ... I think the existing property "retrieved" (P813) could be used for this "last verified on" property, that is, for setting the time a which some external reference was last compared to a claim in Wikidata.
Magnus also pointed out that many external IDs are "self-verifying" in that they are their own reference. The situation is somewhat similar for homepages. Should we adopt the practice of giving a single retrieved value (without any further information) as the reference for such cases?
Adding P813 dates more widely would also open up new ways of maintaining data, since one would have a way to filter statements by how long ago they had last been checked.
Best wishes,
Markus
On 03.06.2015 15:56, Markus Krötzsch wrote:
On 03.06.2015 13:57, Magnus Manske wrote:
Maybe there is a case to separate import and verification here?
There are many statements in Wikidata nowadays, but they get really "trustworthy" through references (other than "imported from Wikipedia"). But for external IDs, references are superfluous; they are their own reference, by definition. So how about marking IDs with a "verified" (or "last verified on") qualifier? Much of such work could be done by bots; we could then filter the problematic ones out for manual verification.
As we have no control over external lists, this would have to be re-checked ever so often; but, again bots to the rescue.
Yes, I fully support this proposal.
What do you think about making "last verified on" not a qualifier but (part of) the reference information? The reference could state where the bot has looked up the ID and give a time. This would be somewhat similar to what is now used in Freebase Ids, e.g., in https://www.wikidata.org/wiki/Q42.
In general, it might be useful to have such a "last verified on" property that can be added to arbitrary references. There are many other uses for this. One common case would be that a user has changed the value without even being aware of the reference -- then one would be able to detect this automatically by comparing the last modification time with the "last verified on" date.
Putting the "last verified on" into the references also makes it possible to have different dates for different references there.
Regards,
Markus
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 07.06.2015 18:29, Magnus Manske wrote:
One question remaining is: Should there be a difference between "human-verified" and "bot-verified"? A bot can check if e.g. the label (or the words in the label) occur on the page at the URL to check, but it can't know for sure. Human review is more reliable, but vastly slower and not likely to happen for many/most such statements. Two different properties could act as different confidence levels. But maybe I'm just over-engineering this ;-)
It depends. For structured data sources, a bot should be able to do a thorough verification (possibly better than a human), e.g., by comparing name, birthdate and deathdate of a person at once. I would focus on these cases first since we have enough of them ;-)
For cases where a bot con only make a guess, it might be better to add a human to the loop, as in your (truly amazing!) sourcerer game. The game also shows that it may depend on the items how well this approach works, since text matches are sometimes completely meaningless (e.g., "Human parent taxon homo" can not be verified by looking for "Homo" since every page that might contain this fact also mentions "Homo sapiens" many times). For such difficult cases, I am not sure if a bot-defined information "looked correct, but I am not sure" would really be very helpful. It depends ;-)
Cheers,
Markus
On Sun, Jun 7, 2015 at 4:19 PM Markus Krötzsch <markus@semantic-mediawiki.org mailto:markus@semantic-mediawiki.org> wrote:
Coming back to Magnus's suggestion ... I think the existing property "retrieved" (P813) could be used for this "last verified on" property, that is, for setting the time a which some external reference was last compared to a claim in Wikidata. Magnus also pointed out that many external IDs are "self-verifying" in that they are their own reference. The situation is somewhat similar for homepages. Should we adopt the practice of giving a single retrieved value (without any further information) as the reference for such cases? Adding P813 dates more widely would also open up new ways of maintaining data, since one would have a way to filter statements by how long ago they had last been checked. Best wishes, Markus On 03.06.2015 15:56, Markus Krötzsch wrote: > On 03.06.2015 13:57, Magnus Manske wrote: >> Maybe there is a case to separate import and verification here? >> >> There are many statements in Wikidata nowadays, but they get really >> "trustworthy" through references (other than "imported from Wikipedia"). >> But for external IDs, references are superfluous; they are their own >> reference, by definition. So how about marking IDs with a "verified" (or >> "last verified on") qualifier? Much of such work could be done by bots; >> we could then filter the problematic ones out for manual verification. >> >> As we have no control over external lists, this would have to be >> re-checked ever so often; but, again bots to the rescue. >> > > Yes, I fully support this proposal. > > What do you think about making "last verified on" not a qualifier but > (part of) the reference information? The reference could state where the > bot has looked up the ID and give a time. This would be somewhat similar > to what is now used in Freebase Ids, e.g., in > https://www.wikidata.org/wiki/Q42. > > In general, it might be useful to have such a "last verified on" > property that can be added to arbitrary references. There are many other > uses for this. One common case would be that a user has changed the > value without even being aware of the reference -- then one would be > able to detect this automatically by comparing the last modification > time with the "last verified on" date. > > Putting the "last verified on" into the references also makes it > possible to have different dates for different references there. > > Regards, > > Markus > > > > > _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Il 07/giu/2015 17:19, "Markus Krötzsch" markus@semantic-mediawiki.org ha scritto:
Coming back to Magnus's suggestion ... I think the existing property
"retrieved" (P813) could be used for this "last verified on" property, that is, for setting the time a which some external reference was last compared to a claim in Wikidata.
Magnus also pointed out that many external IDs are "self-verifying" in
that they are their own reference. The situation is somewhat similar for homepages. Should we adopt the practice of giving a single retrieved value (without any further information) as the reference for such cases?
Adding P813 dates more widely would also open up new ways of maintaining
data, since one would have a way to filter statements by how long ago they had last been checked.
Sounds ok, but how will we do it? And should we wait for the identifier datatype to be ready?
L.
On 07.06.2015 20:07, Luca Martinelli wrote:
Il 07/giu/2015 17:19, "Markus Krötzsch" <markus@semantic-mediawiki.org mailto:markus@semantic-mediawiki.org> ha scritto:
Coming back to Magnus's suggestion ... I think the existing property
"retrieved" (P813) could be used for this "last verified on" property, that is, for setting the time a which some external reference was last compared to a claim in Wikidata.
Magnus also pointed out that many external IDs are "self-verifying"
in that they are their own reference. The situation is somewhat similar for homepages. Should we adopt the practice of giving a single retrieved value (without any further information) as the reference for such cases?
Adding P813 dates more widely would also open up new ways of
maintaining data, since one would have a way to filter statements by how long ago they had last been checked.
Sounds ok, but how will we do it?
As editors, we can just do it from now on. I was always unsure what to use as a reference for ids and homepages. Now I'll use this.
Bot operators can do the same. Magnus is already using P813 in the sourcerer game as well.
I don't know if there is a good place on Wikidata to document such things. I always struggle to find documentation about how to do references (best practices, e.g., how to cite an online news portal correctly).
And should we wait for the identifier datatype to be ready?
Time information makes sense for any online reference, so we do not really need to know if the statement we are editing is for an ID property. Whether a statement is "self-verifying" so that a single P813 would already work as reference depends on the context. The properties that this is mainly true for are those of type URL and those where you can get a URL or URI to verify things (i.e., those with properties P1630 or P1921).
Markus
On Sun, Jun 7, 2015 at 11:30 PM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
I don't know if there is a good place on Wikidata to document such things. I always struggle to find documentation about how to do references (best practices, e.g., how to cite an online news portal correctly).
That'd be https://www.wikidata.org/wiki/Help:Sources
Cheers Lydia
Hi
Am 07.06.2015 um 17:18 schrieb Markus Krötzsch:
Magnus also pointed out that many external IDs are "self-verifying" in that they are their own reference. The situation is somewhat similar for homepages. Should we adopt the practice of giving a single retrieved value (without any further information) as the reference for such cases?
I'd use the reference URL the value was imported from together with the retrieved value.
Best regards Bene
On 08.06.2015 11:16, Bene* wrote:
Hi
Am 07.06.2015 um 17:18 schrieb Markus Krötzsch:
Magnus also pointed out that many external IDs are "self-verifying" in that they are their own reference. The situation is somewhat similar for homepages. Should we adopt the practice of giving a single retrieved value (without any further information) as the reference for such cases?
I'd use the reference URL the value was imported from together with the retrieved value.
Yes, that's a good point. Even if an ID has a value for "formatter URL" that defines the URL, it might be good to record the URL that was used to verify the data, since the formatter URL might change. Especially bots should add this, since it's no extra work for them. However, a single "retrieved" value is still better than nothing there. For homepages and other URL properties, I would not maybe store the URL again in the reference.
Note that most often the reference is not where the value is "imported from" but simply an external reference. In many cases, we have imported data from Wikipedia but it is then verified from another dataset.
Cheers,
Markus
On 3 June 2015 at 12:48, Andrew Gray andrew.gray@dunelm.org.uk wrote:
The lack of deduplication is probably intentional rather than a bug, and both entries are "correct". Perhaps one way to handle this for Wikidata would be to, hmm, say something like "if the item is some kind of a bridge, then allow two IDs" in the constraints?
The constraint should be "usually one ID" (i.e. "SHOULD only have one ID), not "MUST have only one ID.
Wikidata already allows for this, and the constraints are editable.
See also the talk page and report for P496 for an example of a listed exception.
Thanks, Andrew, for the clarification. This makes perfect sense.
I don't see a problem with one bridge having two IDs in some external database. We already have this for other ID-like properties for other reasons. What is important though is that it still is a single bridge, and should therefore be one item.
Your clarification is reassuring since it suggests that the problem is not overly common after all. Maybe one can just merge these cases manually. Once the (multiple) ids are found in the merged items, avoiding future duplicates will be done as usual (which is still difficult with the Scottish Heritage ids since we have many legit Wikidata items that have the same id -- but this at least is an independent problem).
Regards,
Markus
On 03.06.2015 13:48, Andrew Gray wrote:
This particular case is something of a known problem - we've encountered it with some of the other heritage-building identifier lists as well.
Bridges often span a river which is the border for two jurisdictions (in this case, council areas). Each local area counts it as a historic building, and because the national lists are aggregated from local lists, it gets two entries in the main list, one as Fife and one as Edinburgh. A similar case in Wales is the Menai Suspension Bridge, which is 4049 from the Gwynedd register and 18572 from the Anglesey one (Wikidata, at Q581526, only lists one identifer).
The lack of deduplication is probably intentional rather than a bug, and both entries are "correct". Perhaps one way to handle this for Wikidata would be to, hmm, say something like "if the item is some kind of a bridge, then allow two IDs" in the constraints?
I can't immediately think of any bridges which cross national borders *and* are a heritage building in both countries, but we'd see the same thing there, with it having identifiers from both sides.
Andrew.
On 2 June 2015 at 12:12, Markus Krötzsch markus@semantic-mediawiki.org wrote:
Another interesting type of Scottish historic orphans are those that are duplicates of items that do have site links. Even very prominent ones are duplicated, such as
https://www.wikidata.org/wiki/Q17569486 (dup) https://www.wikidata.org/wiki/Q933000 (real item)
Interestingly, they use different Scotland IDs, and it does indeed seem that Historic Scotland also contains duplicates:
http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL... http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...
Overall, this seems to be an example of an ID that really should not be considered "identity providing" since there seems to be an many-to-many relationship between Wikidata and Historic Scottland. Orphans should receive additional ids from a better source if at all possible. With the great number of seemingly legit non-functional uses of the Scotland IDs, they cannot be used in practice to detect duplicates.
Regards,
Markus
On 02.06.2015 13:01, Markus Krötzsch wrote:
On 02.06.2015 11:30, Magnus Manske wrote:
Update 2: For example, https://www.wikidata.org/wiki/Q17847522 and https://www.wikidata.org/wiki/Q17847537 have the same Scotland ID, but refer to different entities (church and churchyard, respectively). They were as two entities in the original dataset, sharing the same ID.
Yes, I noticed such cases too. From the information Wikidata, it is not clear to me why this is sometimes done and sometimes not done.
For example, these adjacent houses have the same Scotland ID but different items that each have their own coordinates (where did the coordinates come from?):
https://www.wikidata.org/wiki/Q17576211 https://www.wikidata.org/wiki/Q17576182 https://www.wikidata.org/wiki/Q17576185
In many other cases, adjacent houses with the same ID are combined into one item:
https://www.wikidata.org/wiki/Q17806587
(note, however, that the house addresses given in the ID and in the item label do not match, though they overlap on most of the houses.)
Finally, there are also cases where there are different IDs and we have several items, but they have the same labels that merge the contents of the two IDs:
https://www.wikidata.org/wiki/Q17810121 https://www.wikidata.org/wiki/Q17810137
It seems that the data was not taken from the Historic Sites database but from some different source that has its own coordinate data and a different (but seemingly arbitrary) approach to grouping sites. However, the coordinated give Historic Scotland as their reference -- I wonder if Historic Scotland might be changing frequently or exist in several versions.
Regards,
Markus
On Tue, Jun 2, 2015 at 10:26 AM Magnus Manske <magnusmanske@googlemail.com mailto:magnusmanske@googlemail.com> wrote:
Update: There appear to be quite a few items with duplicate Scotland IDs (not all of them may be erroneous!): http://wdq.wmflabs.org/stats?action=doublestring&prop=709 On Tue, Jun 2, 2015 at 10:23 AM Magnus Manske <magnusmanske@googlemail.com <mailto:magnusmanske@googlemail.com>> wrote: I created (some/most of) these items as part of the Wiki Loves Monuments UK 2014 drive, to run the campaign from Wikidata rather than from a bespoke database. This allows the community (TM) to maintain the data, rather than one poor sod (e.g., myself) having to frantically update all of it every year ;-) "Consumer" tool is here: https://tools.wmflabs.org/wlmuk/index_wd.html These are based on "official" data from National Heritage, provided to me via Wikimedia UK. Grade A (or Grade I/II* in England) structures should be noteworthy by default. It appears (as per your examples) that some of these were created as duplicates/with wrong IDs. As I said, this is based on "official" data, so it's the best I could do at the time. With mass creation, there are bound to be a few strays. If you can find some large-scale, systemic issue I'll try to fix it, but the one-offs will always fall back to manual fixing. At least, with Wikidata, we can fix them together. On Tue, Jun 2, 2015 at 10:01 AM Daniel Kinzler <daniel.kinzler@wikimedia.de <mailto:daniel.kinzler@wikimedia.de>> wrote: Am 01.06.2015 um 22:26 schrieb Markus Krötzsch: > Finally, the technical question is: Why is this even possible? I thought that, > in each language, label+description are a key (globally unique), yet here we > have many pairs of items with exactly the same label and description. Or is the > problem that no description was entered and so the system does not apply the > key? The uniqueness constraint does indeed not apply if there is no description. -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
A related suggestion... I've wondered before if what we could use for such imports is a "meta" value for the P31 property - something like "instance of: imported unchecked item". When a person has corrected or checked the items, added sitelinks, etc it's easy to remove this value. This would let us easily identify ones that might still need assistance, eg to check for duplicates or to mark them as a part of a larger item, without continually having to go through the list.
Commons does something similar with hidden tracking categories for bulk uploads, and it's quite useful there.
Andrew.
On 3 June 2015 at 14:48, Markus Krötzsch markus@semantic-mediawiki.org wrote:
Thanks, Andrew, for the clarification. This makes perfect sense.
I don't see a problem with one bridge having two IDs in some external database. We already have this for other ID-like properties for other reasons. What is important though is that it still is a single bridge, and should therefore be one item.
Your clarification is reassuring since it suggests that the problem is not overly common after all. Maybe one can just merge these cases manually. Once the (multiple) ids are found in the merged items, avoiding future duplicates will be done as usual (which is still difficult with the Scottish Heritage ids since we have many legit Wikidata items that have the same id -- but this at least is an independent problem).
Regards,
Markus
On 03.06.2015 13:48, Andrew Gray wrote:
This particular case is something of a known problem - we've encountered it with some of the other heritage-building identifier lists as well.
Bridges often span a river which is the border for two jurisdictions (in this case, council areas). Each local area counts it as a historic building, and because the national lists are aggregated from local lists, it gets two entries in the main list, one as Fife and one as Edinburgh. A similar case in Wales is the Menai Suspension Bridge, which is 4049 from the Gwynedd register and 18572 from the Anglesey one (Wikidata, at Q581526, only lists one identifer).
The lack of deduplication is probably intentional rather than a bug, and both entries are "correct". Perhaps one way to handle this for Wikidata would be to, hmm, say something like "if the item is some kind of a bridge, then allow two IDs" in the constraints?
I can't immediately think of any bridges which cross national borders *and* are a heritage building in both countries, but we'd see the same thing there, with it having identifiers from both sides.
Andrew.
On 2 June 2015 at 12:12, Markus Krötzsch markus@semantic-mediawiki.org wrote:
Another interesting type of Scottish historic orphans are those that are duplicates of items that do have site links. Even very prominent ones are duplicated, such as
https://www.wikidata.org/wiki/Q17569486 (dup) https://www.wikidata.org/wiki/Q933000 (real item)
Interestingly, they use different Scotland IDs, and it does indeed seem that Historic Scotland also contains duplicates:
http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...
http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...
Overall, this seems to be an example of an ID that really should not be considered "identity providing" since there seems to be an many-to-many relationship between Wikidata and Historic Scottland. Orphans should receive additional ids from a better source if at all possible. With the great number of seemingly legit non-functional uses of the Scotland IDs, they cannot be used in practice to detect duplicates.
Regards,
Markus
On 02.06.2015 13:01, Markus Krötzsch wrote:
On 02.06.2015 11:30, Magnus Manske wrote:
Update 2: For example, https://www.wikidata.org/wiki/Q17847522 and https://www.wikidata.org/wiki/Q17847537 have the same Scotland ID, but refer to different entities (church and churchyard, respectively). They were as two entities in the original dataset, sharing the same ID.
Yes, I noticed such cases too. From the information Wikidata, it is not clear to me why this is sometimes done and sometimes not done.
For example, these adjacent houses have the same Scotland ID but different items that each have their own coordinates (where did the coordinates come from?):
https://www.wikidata.org/wiki/Q17576211 https://www.wikidata.org/wiki/Q17576182 https://www.wikidata.org/wiki/Q17576185
In many other cases, adjacent houses with the same ID are combined into one item:
https://www.wikidata.org/wiki/Q17806587
(note, however, that the house addresses given in the ID and in the item label do not match, though they overlap on most of the houses.)
Finally, there are also cases where there are different IDs and we have several items, but they have the same labels that merge the contents of the two IDs:
https://www.wikidata.org/wiki/Q17810121 https://www.wikidata.org/wiki/Q17810137
It seems that the data was not taken from the Historic Sites database but from some different source that has its own coordinate data and a different (but seemingly arbitrary) approach to grouping sites. However, the coordinated give Historic Scotland as their reference -- I wonder if Historic Scotland might be changing frequently or exist in several versions.
Regards,
Markus
On Tue, Jun 2, 2015 at 10:26 AM Magnus Manske <magnusmanske@googlemail.com mailto:magnusmanske@googlemail.com> wrote:
Update: There appear to be quite a few items with duplicate
Scotland IDs (not all of them may be erroneous!): http://wdq.wmflabs.org/stats?action=doublestring&prop=709
On Tue, Jun 2, 2015 at 10:23 AM Magnus Manske <magnusmanske@googlemail.com <mailto:magnusmanske@googlemail.com>> wrote: I created (some/most of) these items as part of the Wiki Loves Monuments UK 2014 drive, to run the campaign from Wikidata rather than from a bespoke database. This allows the community (TM) to maintain the data, rather than one poor sod (e.g., myself) having to frantically update all of it every year ;-) "Consumer" tool is here: https://tools.wmflabs.org/wlmuk/index_wd.html These are based on "official" data from National Heritage, provided to me via Wikimedia UK. Grade A (or Grade I/II* in England) structures should be noteworthy by default. It appears (as per your examples) that some of these were created as duplicates/with wrong IDs. As I said, this is based on "official" data, so it's the best I could do at the time. With mass creation, there are bound to be a few strays. If you can find some large-scale, systemic issue I'll try to fix it, but the one-offs will always fall back to manual fixing. At least, with Wikidata, we can fix them together. On Tue, Jun 2, 2015 at 10:01 AM Daniel Kinzler <daniel.kinzler@wikimedia.de <mailto:daniel.kinzler@wikimedia.de>> wrote: Am 01.06.2015 um 22:26 schrieb Markus Krötzsch: > Finally, the technical question is: Why is this even possible? I thought that, > in each language, label+description are a key (globally unique), yet here we > have many pairs of items with exactly the same label
and description. Or is the > problem that no description was entered and so the system does not apply the > key?
The uniqueness constraint does indeed not apply if there
is no description.
-- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hoi, What is wrong with identifying NOTHING with there being a problem?
To me it seems a bit too much.. I like to keep tabs on the items that have 0 statements.. They are useless in many ways. Thanks, GerardM
On 7 June 2015 at 19:44, Andrew Gray andrew.gray@dunelm.org.uk wrote:
A related suggestion... I've wondered before if what we could use for such imports is a "meta" value for the P31 property - something like "instance of: imported unchecked item". When a person has corrected or checked the items, added sitelinks, etc it's easy to remove this value. This would let us easily identify ones that might still need assistance, eg to check for duplicates or to mark them as a part of a larger item, without continually having to go through the list.
Commons does something similar with hidden tracking categories for bulk uploads, and it's quite useful there.
Andrew.
On 3 June 2015 at 14:48, Markus Krötzsch markus@semantic-mediawiki.org wrote:
Thanks, Andrew, for the clarification. This makes perfect sense.
I don't see a problem with one bridge having two IDs in some external database. We already have this for other ID-like properties for other reasons. What is important though is that it still is a single bridge,
and
should therefore be one item.
Your clarification is reassuring since it suggests that the problem is
not
overly common after all. Maybe one can just merge these cases manually.
Once
the (multiple) ids are found in the merged items, avoiding future
duplicates
will be done as usual (which is still difficult with the Scottish
Heritage
ids since we have many legit Wikidata items that have the same id -- but this at least is an independent problem).
Regards,
Markus
On 03.06.2015 13:48, Andrew Gray wrote:
This particular case is something of a known problem - we've encountered it with some of the other heritage-building identifier lists as well.
Bridges often span a river which is the border for two jurisdictions (in this case, council areas). Each local area counts it as a historic building, and because the national lists are aggregated from local lists, it gets two entries in the main list, one as Fife and one as Edinburgh. A similar case in Wales is the Menai Suspension Bridge, which is 4049 from the Gwynedd register and 18572 from the Anglesey one (Wikidata, at Q581526, only lists one identifer).
The lack of deduplication is probably intentional rather than a bug, and both entries are "correct". Perhaps one way to handle this for Wikidata would be to, hmm, say something like "if the item is some kind of a bridge, then allow two IDs" in the constraints?
I can't immediately think of any bridges which cross national borders *and* are a heritage building in both countries, but we'd see the same thing there, with it having identifiers from both sides.
Andrew.
On 2 June 2015 at 12:12, Markus Krötzsch <markus@semantic-mediawiki.org
wrote:
Another interesting type of Scottish historic orphans are those that
are
duplicates of items that do have site links. Even very prominent ones
are
duplicated, such as
https://www.wikidata.org/wiki/Q17569486 (dup) https://www.wikidata.org/wiki/Q933000 (real item)
Interestingly, they use different Scotland IDs, and it does indeed seem that Historic Scotland also contains duplicates:
http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...
http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...
Overall, this seems to be an example of an ID that really should not be considered "identity providing" since there seems to be an many-to-many relationship between Wikidata and Historic Scottland. Orphans should receive additional ids from a better source if at all possible. With the great number of seemingly legit non-functional uses of the Scotland IDs, they cannot be used in practice to detect duplicates.
Regards,
Markus
On 02.06.2015 13:01, Markus Krötzsch wrote:
On 02.06.2015 11:30, Magnus Manske wrote:
Update 2: For example, https://www.wikidata.org/wiki/Q17847522 and https://www.wikidata.org/wiki/Q17847537 have the same Scotland ID, but refer to different entities (church
and
churchyard, respectively). They were as two entities in the original dataset, sharing the same ID.
Yes, I noticed such cases too. From the information Wikidata, it is
not
clear to me why this is sometimes done and sometimes not done.
For example, these adjacent houses have the same Scotland ID but different items that each have their own coordinates (where did the coordinates come from?):
https://www.wikidata.org/wiki/Q17576211 https://www.wikidata.org/wiki/Q17576182 https://www.wikidata.org/wiki/Q17576185
In many other cases, adjacent houses with the same ID are combined
into
one item:
https://www.wikidata.org/wiki/Q17806587
(note, however, that the house addresses given in the ID and in the
item
label do not match, though they overlap on most of the houses.)
Finally, there are also cases where there are different IDs and we
have
several items, but they have the same labels that merge the contents
of
the two IDs:
https://www.wikidata.org/wiki/Q17810121 https://www.wikidata.org/wiki/Q17810137
It seems that the data was not taken from the Historic Sites database but from some different source that has its own coordinate data and a different (but seemingly arbitrary) approach to grouping sites.
However,
the coordinated give Historic Scotland as their reference -- I wonder
if
Historic Scotland might be changing frequently or exist in several versions.
Regards,
Markus
On Tue, Jun 2, 2015 at 10:26 AM Magnus Manske <magnusmanske@googlemail.com mailto:magnusmanske@googlemail.com> wrote:
Update: There appear to be quite a few items with duplicate
Scotland IDs (not all of them may be erroneous!): http://wdq.wmflabs.org/stats?action=doublestring&prop=709
On Tue, Jun 2, 2015 at 10:23 AM Magnus Manske <magnusmanske@googlemail.com <mailto:
magnusmanske@googlemail.com>>
wrote: I created (some/most of) these items as part of the Wiki
Loves
Monuments UK 2014 drive, to run the campaign from Wikidata rather than from a bespoke database. This allows the
community
(TM) to maintain the data, rather than one poor sod (e.g., myself) having to frantically update all of it every year
;-)
"Consumer" tool is here: https://tools.wmflabs.org/wlmuk/index_wd.html These are based on "official" data from National Heritage, provided to me via Wikimedia UK. Grade A (or Grade I/II* in England) structures should be noteworthy by default. It appears (as per your examples) that some of these were created as duplicates/with wrong IDs. As I said, this is
based
on "official" data, so it's the best I could do at the time. With mass creation, there are bound to be a few strays. If
you
can find some large-scale, systemic issue I'll try to fix
it,
but the one-offs will always fall back to manual fixing. At least, with Wikidata, we can fix them together. On Tue, Jun 2, 2015 at 10:01 AM Daniel Kinzler <daniel.kinzler@wikimedia.de <mailto:daniel.kinzler@wikimedia.de>> wrote: Am 01.06.2015 um 22:26 schrieb Markus Krötzsch: > Finally, the technical question is: Why is this even possible? I thought that, > in each language, label+description are a key
(globally
unique), yet here we > have many pairs of items with exactly the same label
and description. Or is the > problem that no description was entered and so the system does not apply the > key?
The uniqueness constraint does indeed not apply if there
is no description.
-- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
--
- Andrew Gray andrew.gray@dunelm.org.uk
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Ah, that might be a little bit overkill. It would have to be "re-instanced" on every subsequent edit. Not to mention the "contamination" of P31 with maintenance items.
On Sun, Jun 7, 2015 at 6:45 PM Andrew Gray andrew.gray@dunelm.org.uk wrote:
A related suggestion... I've wondered before if what we could use for such imports is a "meta" value for the P31 property - something like "instance of: imported unchecked item". When a person has corrected or checked the items, added sitelinks, etc it's easy to remove this value. This would let us easily identify ones that might still need assistance, eg to check for duplicates or to mark them as a part of a larger item, without continually having to go through the list.
Commons does something similar with hidden tracking categories for bulk uploads, and it's quite useful there.
Andrew.
On 3 June 2015 at 14:48, Markus Krötzsch markus@semantic-mediawiki.org wrote:
Thanks, Andrew, for the clarification. This makes perfect sense.
I don't see a problem with one bridge having two IDs in some external database. We already have this for other ID-like properties for other reasons. What is important though is that it still is a single bridge,
and
should therefore be one item.
Your clarification is reassuring since it suggests that the problem is
not
overly common after all. Maybe one can just merge these cases manually.
Once
the (multiple) ids are found in the merged items, avoiding future
duplicates
will be done as usual (which is still difficult with the Scottish
Heritage
ids since we have many legit Wikidata items that have the same id -- but this at least is an independent problem).
Regards,
Markus
On 03.06.2015 13:48, Andrew Gray wrote:
This particular case is something of a known problem - we've encountered it with some of the other heritage-building identifier lists as well.
Bridges often span a river which is the border for two jurisdictions (in this case, council areas). Each local area counts it as a historic building, and because the national lists are aggregated from local lists, it gets two entries in the main list, one as Fife and one as Edinburgh. A similar case in Wales is the Menai Suspension Bridge, which is 4049 from the Gwynedd register and 18572 from the Anglesey one (Wikidata, at Q581526, only lists one identifer).
The lack of deduplication is probably intentional rather than a bug, and both entries are "correct". Perhaps one way to handle this for Wikidata would be to, hmm, say something like "if the item is some kind of a bridge, then allow two IDs" in the constraints?
I can't immediately think of any bridges which cross national borders *and* are a heritage building in both countries, but we'd see the same thing there, with it having identifiers from both sides.
Andrew.
On 2 June 2015 at 12:12, Markus Krötzsch <markus@semantic-mediawiki.org
wrote:
Another interesting type of Scottish historic orphans are those that
are
duplicates of items that do have site links. Even very prominent ones
are
duplicated, such as
https://www.wikidata.org/wiki/Q17569486 (dup) https://www.wikidata.org/wiki/Q933000 (real item)
Interestingly, they use different Scotland IDs, and it does indeed seem that Historic Scotland also contains duplicates:
http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...
http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...
Overall, this seems to be an example of an ID that really should not be considered "identity providing" since there seems to be an many-to-many relationship between Wikidata and Historic Scottland. Orphans should receive additional ids from a better source if at all possible. With the great number of seemingly legit non-functional uses of the Scotland IDs, they cannot be used in practice to detect duplicates.
Regards,
Markus
On 02.06.2015 13:01, Markus Krötzsch wrote:
On 02.06.2015 11:30, Magnus Manske wrote:
Update 2: For example, https://www.wikidata.org/wiki/Q17847522 and https://www.wikidata.org/wiki/Q17847537 have the same Scotland ID, but refer to different entities (church
and
churchyard, respectively). They were as two entities in the original dataset, sharing the same ID.
Yes, I noticed such cases too. From the information Wikidata, it is
not
clear to me why this is sometimes done and sometimes not done.
For example, these adjacent houses have the same Scotland ID but different items that each have their own coordinates (where did the coordinates come from?):
https://www.wikidata.org/wiki/Q17576211 https://www.wikidata.org/wiki/Q17576182 https://www.wikidata.org/wiki/Q17576185
In many other cases, adjacent houses with the same ID are combined
into
one item:
https://www.wikidata.org/wiki/Q17806587
(note, however, that the house addresses given in the ID and in the
item
label do not match, though they overlap on most of the houses.)
Finally, there are also cases where there are different IDs and we
have
several items, but they have the same labels that merge the contents
of
the two IDs:
https://www.wikidata.org/wiki/Q17810121 https://www.wikidata.org/wiki/Q17810137
It seems that the data was not taken from the Historic Sites database but from some different source that has its own coordinate data and a different (but seemingly arbitrary) approach to grouping sites.
However,
the coordinated give Historic Scotland as their reference -- I wonder
if
Historic Scotland might be changing frequently or exist in several versions.
Regards,
Markus
On Tue, Jun 2, 2015 at 10:26 AM Magnus Manske <magnusmanske@googlemail.com mailto:magnusmanske@googlemail.com> wrote:
Update: There appear to be quite a few items with duplicate
Scotland IDs (not all of them may be erroneous!): http://wdq.wmflabs.org/stats?action=doublestring&prop=709
On Tue, Jun 2, 2015 at 10:23 AM Magnus Manske <magnusmanske@googlemail.com <mailto:
magnusmanske@googlemail.com>>
wrote: I created (some/most of) these items as part of the Wiki
Loves
Monuments UK 2014 drive, to run the campaign from Wikidata rather than from a bespoke database. This allows the
community
(TM) to maintain the data, rather than one poor sod (e.g., myself) having to frantically update all of it every year
;-)
"Consumer" tool is here: https://tools.wmflabs.org/wlmuk/index_wd.html These are based on "official" data from National Heritage, provided to me via Wikimedia UK. Grade A (or Grade I/II* in England) structures should be noteworthy by default. It appears (as per your examples) that some of these were created as duplicates/with wrong IDs. As I said, this is
based
on "official" data, so it's the best I could do at the time. With mass creation, there are bound to be a few strays. If
you
can find some large-scale, systemic issue I'll try to fix
it,
but the one-offs will always fall back to manual fixing. At least, with Wikidata, we can fix them together. On Tue, Jun 2, 2015 at 10:01 AM Daniel Kinzler <daniel.kinzler@wikimedia.de <mailto:daniel.kinzler@wikimedia.de>> wrote: Am 01.06.2015 um 22:26 schrieb Markus Krötzsch: > Finally, the technical question is: Why is this even possible? I thought that, > in each language, label+description are a key
(globally
unique), yet here we > have many pairs of items with exactly the same label
and description. Or is the > problem that no description was entered and so the system does not apply the > key?
The uniqueness constraint does indeed not apply if there
is no description.
-- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
--
- Andrew Gray andrew.gray@dunelm.org.uk
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata