No links, wrong data: Scottland's orphans need help

List overview All Threads
Download

newer

older

Way to check how many claims there...

weekly summary #161

Markus Krötzsch

1 Jun 2015 1 Jun '15

1:26 p.m.

Hi all,

Looking at more "orphaned items", I found several pairs of items that look like these two:

https://www.wikidata.org/wiki/Q17574663 https://www.wikidata.org/wiki/Q17569687

Same label and description, same coordinates, no Wikidata articles, "identified" by different Historic Scotland IDs. If you follow the ID links, however, you can see that the first of the items has data that does not match the ID, while the second is correct.

The direct question is: How to fix these errors? There are other cases, such as Q17572335 and Q17570206. I did not do a systematic study, but something seems to have gone wrong here in more than one case. I cannot fix mass edits one by one without having a clue what has happened and why.

The indirect question is: How can I find out who did this and maybe ask the person to fix it? The history is of no help (Reinheitsgebot/Widar). Posting every error in Wikidata to this list to ask also seems like a bad idea.

Finally, the technical question is: Why is this even possible? I thought that, in each language, label+description are a key (globally unique), yet here we have many pairs of items with exactly the same label and description. Or is the problem that no description was entered and so the system does not apply the key? In any case, a data integration helper application that looks at equal labels+descriptions would probably make sense, especially for orphaned items. (As I know Wikidata, someone might well reply to this email with a link to where this is already found ;-).

Regards

Markus

Show replies by date

Federico Leva (Nemo)

1 Jun 1 Jun

1:37 p.m.

Markus Krötzsch, 01/06/2015 22:26:

...

How can I find out who did this

The tool (https://meta.wikimedia.org/wiki/Mix%27n%27match ) does have a log, though it's not so easy to search it IIRC.

Nemo

Markus Krötzsch

1:42 p.m.

On 01.06.2015 22:37, Federico Leva (Nemo) wrote:

...

Markus Krötzsch, 01/06/2015 22:26:

...
How can I find out who did this

The tool (https://meta.wikimedia.org/wiki/Mix%27n%27match ) does have a log, though it's not so easy to search it IIRC.

How do you know that the data comes from this tool? The history does not mention it:

https://www.wikidata.org/w/index.php?title=Q17574663&action=history

Markus

Federico Leva (Nemo)

2:18 p.m.

Markus Krötzsch, 01/06/2015 22:42:

...

How do you know that the data comes from this tool? The history does not mention it:

I know from memory. IIRC it's stated on the user page of the account, but might be stated elsewhere.

Nemo

Andy Mabbett

2:45 p.m.

On 1 June 2015 at 21:26, Markus Krötzsch markus@semantic-mediawiki.org wrote:

...

The indirect question is: How can I find out who did this and maybe ask the person to fix it? The history is of no help (Reinheitsgebot/Widar).

Did you look at:

https://www.wikidata.org/wiki/User:Reinheitsgebot ?

-- Andy Mabbett @pigsonthewing http://pigsonthewing.org.uk

Markus Krötzsch

11:43 p.m.

On 01.06.2015 23:45, Andy Mabbett wrote:

...

On 1 June 2015 at 21:26, Markus Krötzsch markus@semantic-mediawiki.org wrote:

...
The indirect question is: How can I find out who did this and maybe ask the person to fix it? The history is of no help (Reinheitsgebot/Widar).

Did you look at:
 https://www.wikidata.org/wiki/User:Reinheitsgebot ?

Yes, that's what I did first, but the page just says that the bot makes mass edits on behalf of other (unknown) users. But you are right that one should probably still ask the bot author first:

Magnus, do you know on which basis these edits were made and how the errors could have sneaked in? Do you have any idea of the scale of the problem? (So far I have no idea: maybe I was just very (un)lucky to find several such cases in a row, or maybe the problem affects a relevant portion of the >80,000 orphaned items in the UK ...).

Regards,

Markus

Daniel Kinzler

2 Jun 2 Jun

2:01 a.m.

Am 01.06.2015 um 22:26 schrieb Markus Krötzsch:

...

Finally, the technical question is: Why is this even possible? I thought that, in each language, label+description are a key (globally unique), yet here we have many pairs of items with exactly the same label and description. Or is the problem that no description was entered and so the system does not apply the key?

The uniqueness constraint does indeed not apply if there is no description.

-- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Magnus Manske

2:23 a.m.

I created (some/most of) these items as part of the Wiki Loves Monuments UK 2014 drive, to run the campaign from Wikidata rather than from a bespoke database. This allows the community (TM) to maintain the data, rather than one poor sod (e.g., myself) having to frantically update all of it every year ;-)

"Consumer" tool is here: https://tools.wmflabs.org/wlmuk/index_wd.html

These are based on "official" data from National Heritage, provided to me via Wikimedia UK. Grade A (or Grade I/II* in England) structures should be noteworthy by default.

It appears (as per your examples) that some of these were created as duplicates/with wrong IDs. As I said, this is based on "official" data, so it's the best I could do at the time. With mass creation, there are bound to be a few strays. If you can find some large-scale, systemic issue I'll try to fix it, but the one-offs will always fall back to manual fixing. At least, with Wikidata, we can fix them together.

On Tue, Jun 2, 2015 at 10:01 AM Daniel Kinzler daniel.kinzler@wikimedia.de wrote:

...

Am 01.06.2015 um 22:26 schrieb Markus Krötzsch:

...
Finally, the technical question is: Why is this even possible? I thought

that,

...
in each language, label+description are a key (globally unique), yet

here we

...
have many pairs of items with exactly the same label and description. Or

is the

...
problem that no description was entered and so the system does not apply

the

...
key?

The uniqueness constraint does indeed not apply if there is no description.

-- Daniel Kinzler Senior Software Developer

Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Magnus Manske

2:26 a.m.

Update: There appear to be quite a few items with duplicate Scotland IDs (not all of them may be erroneous!): http://wdq.wmflabs.org/stats?action=doublestring&prop=709

On Tue, Jun 2, 2015 at 10:23 AM Magnus Manske magnusmanske@googlemail.com wrote:

...

I created (some/most of) these items as part of the Wiki Loves Monuments UK 2014 drive, to run the campaign from Wikidata rather than from a bespoke database. This allows the community (TM) to maintain the data, rather than one poor sod (e.g., myself) having to frantically update all of it every year ;-)

"Consumer" tool is here: https://tools.wmflabs.org/wlmuk/index_wd.html

These are based on "official" data from National Heritage, provided to me via Wikimedia UK. Grade A (or Grade I/II* in England) structures should be noteworthy by default.

It appears (as per your examples) that some of these were created as duplicates/with wrong IDs. As I said, this is based on "official" data, so it's the best I could do at the time. With mass creation, there are bound to be a few strays. If you can find some large-scale, systemic issue I'll try to fix it, but the one-offs will always fall back to manual fixing. At least, with Wikidata, we can fix them together.

On Tue, Jun 2, 2015 at 10:01 AM Daniel Kinzler < daniel.kinzler@wikimedia.de> wrote:

...
Am 01.06.2015 um 22:26 schrieb Markus Krötzsch:

...
Finally, the technical question is: Why is this even possible? I

thought that,

...
in each language, label+description are a key (globally unique), yet

here we

...
have many pairs of items with exactly the same label and description.

Or is the

...
problem that no description was entered and so the system does not

apply the

...
key?

The uniqueness constraint does indeed not apply if there is no description.

-- Daniel Kinzler Senior Software Developer

Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Magnus Manske

2:30 a.m.

Update 2: For example, https://www.wikidata.org/wiki/Q17847522 and https://www.wikidata.org/wiki/Q17847537 have the same Scotland ID, but refer to different entities (church and churchyard, respectively). They were as two entities in the original dataset, sharing the same ID.

On Tue, Jun 2, 2015 at 10:26 AM Magnus Manske magnusmanske@googlemail.com wrote:

...

Update: There appear to be quite a few items with duplicate Scotland IDs (not all of them may be erroneous!): http://wdq.wmflabs.org/stats?action=doublestring&prop=709

On Tue, Jun 2, 2015 at 10:23 AM Magnus Manske magnusmanske@googlemail.com wrote:

...
I created (some/most of) these items as part of the Wiki Loves Monuments UK 2014 drive, to run the campaign from Wikidata rather than from a bespoke database. This allows the community (TM) to maintain the data, rather than one poor sod (e.g., myself) having to frantically update all of it every year ;-)

"Consumer" tool is here: https://tools.wmflabs.org/wlmuk/index_wd.html

These are based on "official" data from National Heritage, provided to me via Wikimedia UK. Grade A (or Grade I/II* in England) structures should be noteworthy by default.

It appears (as per your examples) that some of these were created as duplicates/with wrong IDs. As I said, this is based on "official" data, so it's the best I could do at the time. With mass creation, there are bound to be a few strays. If you can find some large-scale, systemic issue I'll try to fix it, but the one-offs will always fall back to manual fixing. At least, with Wikidata, we can fix them together.

On Tue, Jun 2, 2015 at 10:01 AM Daniel Kinzler < daniel.kinzler@wikimedia.de> wrote:

...
Am 01.06.2015 um 22:26 schrieb Markus Krötzsch:

...
Finally, the technical question is: Why is this even possible? I

thought that,

...
in each language, label+description are a key (globally unique), yet

here we

...
have many pairs of items with exactly the same label and description.

Or is the

...
problem that no description was entered and so the system does not

apply the

...
key?

The uniqueness constraint does indeed not apply if there is no description.

-- Daniel Kinzler Senior Software Developer

Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Markus Krötzsch

4:01 a.m.

New subject: No links, wrong data: Scotland's orphans need help

On 02.06.2015 11:30, Magnus Manske wrote:

...

Update 2: For example, https://www.wikidata.org/wiki/Q17847522 and https://www.wikidata.org/wiki/Q17847537 have the same Scotland ID, but refer to different entities (church and churchyard, respectively). They were as two entities in the original dataset, sharing the same ID.

Yes, I noticed such cases too. From the information Wikidata, it is not clear to me why this is sometimes done and sometimes not done.

For example, these adjacent houses have the same Scotland ID but different items that each have their own coordinates (where did the coordinates come from?):

https://www.wikidata.org/wiki/Q17576211 https://www.wikidata.org/wiki/Q17576182 https://www.wikidata.org/wiki/Q17576185

In many other cases, adjacent houses with the same ID are combined into one item:

https://www.wikidata.org/wiki/Q17806587

(note, however, that the house addresses given in the ID and in the item label do not match, though they overlap on most of the houses.)

Finally, there are also cases where there are different IDs and we have several items, but they have the same labels that merge the contents of the two IDs:

https://www.wikidata.org/wiki/Q17810121 https://www.wikidata.org/wiki/Q17810137

It seems that the data was not taken from the Historic Sites database but from some different source that has its own coordinate data and a different (but seemingly arbitrary) approach to grouping sites. However, the coordinated give Historic Scotland as their reference -- I wonder if Historic Scotland might be changing frequently or exist in several versions.

Regards,

Markus

...

On Tue, Jun 2, 2015 at 10:26 AM Magnus Manske <magnusmanske@googlemail.com mailto:magnusmanske@googlemail.com> wrote:

Update: There appear to be quite a few items with duplicate Scotland
IDs (not all of them may be erroneous!):
http://wdq.wmflabs.org/stats?action=doublestring&prop=709

On Tue, Jun 2, 2015 at 10:23 AM Magnus Manske
<magnusmanske@googlemail.com <mailto:magnusmanske@googlemail.com>>
wrote:

    I created (some/most of) these items as part of the Wiki Loves
    Monuments UK 2014 drive, to run the campaign from Wikidata
    rather than from a bespoke database. This allows the community
    (TM) to maintain the data, rather than one poor sod (e.g.,
    myself) having to frantically update all of it every year ;-)

    "Consumer" tool is here:
    https://tools.wmflabs.org/wlmuk/index_wd.html

    These are based on "official" data from National Heritage,
    provided to me via Wikimedia UK. Grade A (or Grade I/II* in
    England) structures should be noteworthy by default.

    It appears (as per your examples) that some of these were
    created as duplicates/with wrong IDs. As I said, this is based
    on "official" data, so it's the best I could do at the time.
    With mass creation, there are bound to be a few strays. If you
    can find some large-scale, systemic issue I'll try to fix it,
    but the one-offs will always fall back to manual fixing. At
    least, with Wikidata, we can fix them together.

    On Tue, Jun 2, 2015 at 10:01 AM Daniel Kinzler
    <daniel.kinzler@wikimedia.de
    <mailto:daniel.kinzler@wikimedia.de>> wrote:

        Am 01.06.2015 um 22:26 schrieb Markus Krötzsch:
         > Finally, the technical question is: Why is this even
        possible? I thought that,
         > in each language, label+description are a key (globally
        unique), yet here we
         > have many pairs of items with exactly the same label and
        description. Or is the
         > problem that no description was entered and so the system
        does not apply the
         > key?

        The uniqueness constraint does indeed not apply if there is
        no description.

        --
        Daniel Kinzler
        Senior Software Developer

        Wikimedia Deutschland
        Gesellschaft zur Förderung Freien Wissens e.V.

        _______________________________________________
        Wikidata mailing list
        Wikidata@lists.wikimedia.org
        <mailto:Wikidata@lists.wikimedia.org>
        https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Markus Krötzsch

4:12 a.m.

New subject: No links, wrong data: Scotland's orphans need help

Another interesting type of Scottish historic orphans are those that are duplicates of items that do have site links. Even very prominent ones are duplicated, such as

https://www.wikidata.org/wiki/Q17569486 (dup) https://www.wikidata.org/wiki/Q933000 (real item)

Interestingly, they use different Scotland IDs, and it does indeed seem that Historic Scotland also contains duplicates:

http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL... http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...

Overall, this seems to be an example of an ID that really should not be considered "identity providing" since there seems to be an many-to-many relationship between Wikidata and Historic Scottland. Orphans should receive additional ids from a better source if at all possible. With the great number of seemingly legit non-functional uses of the Scotland IDs, they cannot be used in practice to detect duplicates.

Regards,

Markus

On 02.06.2015 13:01, Markus Krötzsch wrote:

...

On 02.06.2015 11:30, Magnus Manske wrote:

...
Update 2: For example, https://www.wikidata.org/wiki/Q17847522 and https://www.wikidata.org/wiki/Q17847537 have the same Scotland ID, but refer to different entities (church and churchyard, respectively). They were as two entities in the original dataset, sharing the same ID.

Yes, I noticed such cases too. From the information Wikidata, it is not clear to me why this is sometimes done and sometimes not done.

For example, these adjacent houses have the same Scotland ID but different items that each have their own coordinates (where did the coordinates come from?):

https://www.wikidata.org/wiki/Q17576211 https://www.wikidata.org/wiki/Q17576182 https://www.wikidata.org/wiki/Q17576185

In many other cases, adjacent houses with the same ID are combined into one item:

https://www.wikidata.org/wiki/Q17806587

(note, however, that the house addresses given in the ID and in the item label do not match, though they overlap on most of the houses.)

Finally, there are also cases where there are different IDs and we have several items, but they have the same labels that merge the contents of the two IDs:

https://www.wikidata.org/wiki/Q17810121 https://www.wikidata.org/wiki/Q17810137

It seems that the data was not taken from the Historic Sites database but from some different source that has its own coordinate data and a different (but seemingly arbitrary) approach to grouping sites. However, the coordinated give Historic Scotland as their reference -- I wonder if Historic Scotland might be changing frequently or exist in several versions.

Regards,

Markus

...
On Tue, Jun 2, 2015 at 10:26 AM Magnus Manske <magnusmanske@googlemail.com mailto:magnusmanske@googlemail.com> wrote:
Update: There appear to be quite a few items with duplicate Scotland
IDs (not all of them may be erroneous!):
http://wdq.wmflabs.org/stats?action=doublestring&prop=709

On Tue, Jun 2, 2015 at 10:23 AM Magnus Manske
<magnusmanske@googlemail.com <mailto:magnusmanske@googlemail.com>>
wrote:

    I created (some/most of) these items as part of the Wiki Loves
    Monuments UK 2014 drive, to run the campaign from Wikidata
    rather than from a bespoke database. This allows the community
    (TM) to maintain the data, rather than one poor sod (e.g.,
    myself) having to frantically update all of it every year ;-)

    "Consumer" tool is here:
    https://tools.wmflabs.org/wlmuk/index_wd.html

    These are based on "official" data from National Heritage,
    provided to me via Wikimedia UK. Grade A (or Grade I/II* in
    England) structures should be noteworthy by default.

    It appears (as per your examples) that some of these were
    created as duplicates/with wrong IDs. As I said, this is based
    on "official" data, so it's the best I could do at the time.
    With mass creation, there are bound to be a few strays. If you
    can find some large-scale, systemic issue I'll try to fix it,
    but the one-offs will always fall back to manual fixing. At
    least, with Wikidata, we can fix them together.

    On Tue, Jun 2, 2015 at 10:01 AM Daniel Kinzler
    <daniel.kinzler@wikimedia.de
    <mailto:daniel.kinzler@wikimedia.de>> wrote:

        Am 01.06.2015 um 22:26 schrieb Markus Krötzsch:
         > Finally, the technical question is: Why is this even
        possible? I thought that,
         > in each language, label+description are a key (globally
        unique), yet here we
         > have many pairs of items with exactly the same label and
        description. Or is the
         > problem that no description was entered and so the system
        does not apply the
         > key?

        The uniqueness constraint does indeed not apply if there is
        no description.

        --
        Daniel Kinzler
        Senior Software Developer

        Wikimedia Deutschland
        Gesellschaft zur Förderung Freien Wissens e.V.

        _______________________________________________
        Wikidata mailing list
        Wikidata@lists.wikimedia.org
        <mailto:Wikidata@lists.wikimedia.org>
        https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Scott MacLeod

3:34 p.m.

New subject: No links, wrong data: Scotland's orphans need help

Hi Markus, Magnus and Wikidatans,

I can't yet add data to, for example, this - https://www.wikidata.org/wiki/Q933000 (real item) - by clicking "save," since the "save" button isn't an active link, but the "cancel" button is. I tried to add this URL - http://www.forthroadbridge.org/home (which I"m not actually able to see in my browser presently - all I see is a blank white page, unusually) - as well as to add the word "Fife" to various fields to this "Forth Road Bridge" Q item. Will this be possible in the near future?

Scott

On Tue, Jun 2, 2015 at 4:12 AM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:

...

Another interesting type of Scottish historic orphans are those that are duplicates of items that do have site links. Even very prominent ones are duplicated, such as

https://www.wikidata.org/wiki/Q17569486 (dup) https://www.wikidata.org/wiki/Q933000 (real item)

Interestingly, they use different Scotland IDs, and it does indeed seem that Historic Scotland also contains duplicates:

http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...

http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...

Overall, this seems to be an example of an ID that really should not be considered "identity providing" since there seems to be an many-to-many relationship between Wikidata and Historic Scottland. Orphans should receive additional ids from a better source if at all possible. With the great number of seemingly legit non-functional uses of the Scotland IDs, they cannot be used in practice to detect duplicates.

Regards,

Markus

On 02.06.2015 13:01, Markus Krötzsch wrote:

...
On 02.06.2015 11:30, Magnus Manske wrote:

...
Update 2: For example, https://www.wikidata.org/wiki/Q17847522 and https://www.wikidata.org/wiki/Q17847537 have the same Scotland ID, but refer to different entities (church and churchyard, respectively). They were as two entities in the original dataset, sharing the same ID.

Yes, I noticed such cases too. From the information Wikidata, it is not clear to me why this is sometimes done and sometimes not done.

For example, these adjacent houses have the same Scotland ID but different items that each have their own coordinates (where did the coordinates come from?):

https://www.wikidata.org/wiki/Q17576211 https://www.wikidata.org/wiki/Q17576182 https://www.wikidata.org/wiki/Q17576185

In many other cases, adjacent houses with the same ID are combined into one item:

https://www.wikidata.org/wiki/Q17806587

(note, however, that the house addresses given in the ID and in the item label do not match, though they overlap on most of the houses.)

Finally, there are also cases where there are different IDs and we have several items, but they have the same labels that merge the contents of the two IDs:

https://www.wikidata.org/wiki/Q17810121 https://www.wikidata.org/wiki/Q17810137

It seems that the data was not taken from the Historic Sites database but from some different source that has its own coordinate data and a different (but seemingly arbitrary) approach to grouping sites. However, the coordinated give Historic Scotland as their reference -- I wonder if Historic Scotland might be changing frequently or exist in several versions.

Regards,

Markus

...
On Tue, Jun 2, 2015 at 10:26 AM Magnus Manske <magnusmanske@googlemail.com mailto:magnusmanske@googlemail.com> wrote:
Update: There appear to be quite a few items with duplicate Scotland
IDs (not all of them may be erroneous!):
http://wdq.wmflabs.org/stats?action=doublestring&prop=709

On Tue, Jun 2, 2015 at 10:23 AM Magnus Manske
<magnusmanske@googlemail.com <mailto:magnusmanske@googlemail.com>>
wrote:

    I created (some/most of) these items as part of the Wiki Loves
    Monuments UK 2014 drive, to run the campaign from Wikidata
    rather than from a bespoke database. This allows the community
    (TM) to maintain the data, rather than one poor sod (e.g.,
    myself) having to frantically update all of it every year ;-)

    "Consumer" tool is here:
    https://tools.wmflabs.org/wlmuk/index_wd.html

    These are based on "official" data from National Heritage,
    provided to me via Wikimedia UK. Grade A (or Grade I/II* in
    England) structures should be noteworthy by default.

    It appears (as per your examples) that some of these were
    created as duplicates/with wrong IDs. As I said, this is based
    on "official" data, so it's the best I could do at the time.
    With mass creation, there are bound to be a few strays. If you
    can find some large-scale, systemic issue I'll try to fix it,
    but the one-offs will always fall back to manual fixing. At
    least, with Wikidata, we can fix them together.

    On Tue, Jun 2, 2015 at 10:01 AM Daniel Kinzler
    <daniel.kinzler@wikimedia.de
    <mailto:daniel.kinzler@wikimedia.de>> wrote:

        Am 01.06.2015 um 22:26 schrieb Markus Krötzsch:
         > Finally, the technical question is: Why is this even
        possible? I thought that,
         > in each language, label+description are a key (globally
        unique), yet here we
         > have many pairs of items with exactly the same label and
        description. Or is the
         > problem that no description was entered and so the system
        does not apply the
         > key?

        The uniqueness constraint does indeed not apply if there is
        no description.

        --
        Daniel Kinzler
        Senior Software Developer

        Wikimedia Deutschland
        Gesellschaft zur Förderung Freien Wissens e.V.

        _______________________________________________
        Wikidata mailing list
        Wikidata@lists.wikimedia.org
        <mailto:Wikidata@lists.wikimedia.org>
        https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- - Scott MacLeod - Founder & President - http://worlduniversityandschool.org - 415 480 4577 - PO Box 442, (86 Ridgecrest Road), Canyon, CA 94516 - World University and School - like Wikipedia with best STEM-centric OpenCourseWare - incorporated as a nonprofit university and school in California, and is a U.S. 501 (c) (3) tax-exempt educational organization, both effective April 2010. World University and School is sending you this because of your interest in free, online, higher education. If you don't want to receive these, please reply with 'unsubscribe' in the body of the email, leaving the subject line intact. Thank you.

Magnus Manske

3:59 p.m.

New subject: No links, wrong data: Scotland's orphans need help

I have added the (un-broken) URL as "official website".

Not sure which property to use for "Fife", though.

On Tue, Jun 2, 2015 at 11:35 PM Scott MacLeod < worlduniversityandschool@gmail.com> wrote:

...

Hi Markus, Magnus and Wikidatans,

I can't yet add data to, for example, this - https://www.wikidata.org/wiki/Q933000 (real item) - by clicking "save," since the "save" button isn't an active link, but the "cancel" button is. I tried to add this URL - http://www.forthroadbridge.org/home (which I"m not actually able to see in my browser presently - all I see is a blank white page, unusually) - as well as to add the word "Fife" to various fields to this "Forth Road Bridge" Q item. Will this be possible in the near future?

Scott

On Tue, Jun 2, 2015 at 4:12 AM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:

...
Another interesting type of Scottish historic orphans are those that are duplicates of items that do have site links. Even very prominent ones are duplicated, such as

https://www.wikidata.org/wiki/Q17569486 (dup) https://www.wikidata.org/wiki/Q933000 (real item)

Interestingly, they use different Scotland IDs, and it does indeed seem that Historic Scotland also contains duplicates:

http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...

http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...

Overall, this seems to be an example of an ID that really should not be considered "identity providing" since there seems to be an many-to-many relationship between Wikidata and Historic Scottland. Orphans should receive additional ids from a better source if at all possible. With the great number of seemingly legit non-functional uses of the Scotland IDs, they cannot be used in practice to detect duplicates.

Regards,

Markus

On 02.06.2015 13:01, Markus Krötzsch wrote:

...
On 02.06.2015 11:30, Magnus Manske wrote:

...
Update 2: For example, https://www.wikidata.org/wiki/Q17847522 and https://www.wikidata.org/wiki/Q17847537 have the same Scotland ID, but refer to different entities (church and churchyard, respectively). They were as two entities in the original dataset, sharing the same ID.

Yes, I noticed such cases too. From the information Wikidata, it is not clear to me why this is sometimes done and sometimes not done.

For example, these adjacent houses have the same Scotland ID but different items that each have their own coordinates (where did the coordinates come from?):

https://www.wikidata.org/wiki/Q17576211 https://www.wikidata.org/wiki/Q17576182 https://www.wikidata.org/wiki/Q17576185

In many other cases, adjacent houses with the same ID are combined into one item:

https://www.wikidata.org/wiki/Q17806587

(note, however, that the house addresses given in the ID and in the item label do not match, though they overlap on most of the houses.)

Finally, there are also cases where there are different IDs and we have several items, but they have the same labels that merge the contents of the two IDs:

https://www.wikidata.org/wiki/Q17810121 https://www.wikidata.org/wiki/Q17810137

It seems that the data was not taken from the Historic Sites database but from some different source that has its own coordinate data and a different (but seemingly arbitrary) approach to grouping sites. However, the coordinated give Historic Scotland as their reference -- I wonder if Historic Scotland might be changing frequently or exist in several versions.

Regards,

Markus

...
On Tue, Jun 2, 2015 at 10:26 AM Magnus Manske <magnusmanske@googlemail.com mailto:magnusmanske@googlemail.com> wrote:
Update: There appear to be quite a few items with duplicate Scotland
IDs (not all of them may be erroneous!):
http://wdq.wmflabs.org/stats?action=doublestring&prop=709

On Tue, Jun 2, 2015 at 10:23 AM Magnus Manske
<magnusmanske@googlemail.com <mailto:magnusmanske@googlemail.com>>
wrote:

    I created (some/most of) these items as part of the Wiki Loves
    Monuments UK 2014 drive, to run the campaign from Wikidata
    rather than from a bespoke database. This allows the community
    (TM) to maintain the data, rather than one poor sod (e.g.,
    myself) having to frantically update all of it every year ;-)

    "Consumer" tool is here:
    https://tools.wmflabs.org/wlmuk/index_wd.html

    These are based on "official" data from National Heritage,
    provided to me via Wikimedia UK. Grade A (or Grade I/II* in
    England) structures should be noteworthy by default.

    It appears (as per your examples) that some of these were
    created as duplicates/with wrong IDs. As I said, this is based
    on "official" data, so it's the best I could do at the time.
    With mass creation, there are bound to be a few strays. If you
    can find some large-scale, systemic issue I'll try to fix it,
    but the one-offs will always fall back to manual fixing. At
    least, with Wikidata, we can fix them together.

    On Tue, Jun 2, 2015 at 10:01 AM Daniel Kinzler
    <daniel.kinzler@wikimedia.de
    <mailto:daniel.kinzler@wikimedia.de>> wrote:

        Am 01.06.2015 um 22:26 schrieb Markus Krötzsch:
         > Finally, the technical question is: Why is this even
        possible? I thought that,
         > in each language, label+description are a key (globally
        unique), yet here we
         > have many pairs of items with exactly the same label and
        description. Or is the
         > problem that no description was entered and so the system
        does not apply the
         > key?

        The uniqueness constraint does indeed not apply if there is
        no description.

        --
        Daniel Kinzler
        Senior Software Developer

        Wikimedia Deutschland
        Gesellschaft zur Förderung Freien Wissens e.V.

        _______________________________________________
        Wikidata mailing list
        Wikidata@lists.wikimedia.org
        <mailto:Wikidata@lists.wikimedia.org>
        https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
--

Scott MacLeod - Founder & President

http://worlduniversityandschool.org

415 480 4577

PO Box 442, (86 Ridgecrest Road), Canyon, CA 94516

World University and School - like Wikipedia with best STEM-centric

OpenCourseWare - incorporated as a nonprofit university and school in California, and is a U.S. 501 (c) (3) tax-exempt educational organization, both effective April 2010.

World University and School is sending you this because of your interest in free, online, higher education. If you don't want to receive these, please reply with 'unsubscribe' in the body of the email, leaving the subject line intact. Thank you. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Scott MacLeod

4:21 p.m.

New subject: No links, wrong data: Scotland's orphans need help

Thanks, Magnus,

What I was hoping to be able to do is wiki-add resources myself to this and other Q items, but I still can't do this even after the changes you made (thank you!) ... I'd add Fife, for example, to - located in the administrative territorial entity https://www.wikidata.org/wiki/Property:P131 - but would also like to add further resources to Wikidata and Q items too ... which is what I think makes Wikidata so potentially great ... and which is what lead to the growth of Wikipedia too I think. Thank you again, M, M & Wikidatans!

Best, Scott

On Tue, Jun 2, 2015 at 3:59 PM, Magnus Manske magnusmanske@googlemail.com wrote:

...

I have added the (un-broken) URL as "official website".

Not sure which property to use for "Fife", though.

On Tue, Jun 2, 2015 at 11:35 PM Scott MacLeod < worlduniversityandschool@gmail.com> wrote:

...
Hi Markus, Magnus and Wikidatans,

I can't yet add data to, for example, this - https://www.wikidata.org/wiki/Q933000 (real item) - by clicking "save," since the "save" button isn't an active link, but the "cancel" button is. I tried to add this URL - http://www.forthroadbridge.org/home (which I"m not actually able to see in my browser presently - all I see is a blank white page, unusually) - as well as to add the word "Fife" to various fields to this "Forth Road Bridge" Q item. Will this be possible in the near future?

Scott

On Tue, Jun 2, 2015 at 4:12 AM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:

...
Another interesting type of Scottish historic orphans are those that are duplicates of items that do have site links. Even very prominent ones are duplicated, such as

https://www.wikidata.org/wiki/Q17569486 (dup) https://www.wikidata.org/wiki/Q933000 (real item)

Interestingly, they use different Scotland IDs, and it does indeed seem that Historic Scotland also contains duplicates:

http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...

http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...

Overall, this seems to be an example of an ID that really should not be considered "identity providing" since there seems to be an many-to-many relationship between Wikidata and Historic Scottland. Orphans should receive additional ids from a better source if at all possible. With the great number of seemingly legit non-functional uses of the Scotland IDs, they cannot be used in practice to detect duplicates.

Regards,

Markus

On 02.06.2015 13:01, Markus Krötzsch wrote:

...
On 02.06.2015 11:30, Magnus Manske wrote:

...
Update 2: For example, https://www.wikidata.org/wiki/Q17847522 and https://www.wikidata.org/wiki/Q17847537 have the same Scotland ID, but refer to different entities (church and churchyard, respectively). They were as two entities in the original dataset, sharing the same ID.

Yes, I noticed such cases too. From the information Wikidata, it is not clear to me why this is sometimes done and sometimes not done.

For example, these adjacent houses have the same Scotland ID but different items that each have their own coordinates (where did the coordinates come from?):

https://www.wikidata.org/wiki/Q17576211 https://www.wikidata.org/wiki/Q17576182 https://www.wikidata.org/wiki/Q17576185

In many other cases, adjacent houses with the same ID are combined into one item:

https://www.wikidata.org/wiki/Q17806587

(note, however, that the house addresses given in the ID and in the item label do not match, though they overlap on most of the houses.)

Finally, there are also cases where there are different IDs and we have several items, but they have the same labels that merge the contents of the two IDs:

https://www.wikidata.org/wiki/Q17810121 https://www.wikidata.org/wiki/Q17810137

It seems that the data was not taken from the Historic Sites database but from some different source that has its own coordinate data and a different (but seemingly arbitrary) approach to grouping sites. However, the coordinated give Historic Scotland as their reference -- I wonder if Historic Scotland might be changing frequently or exist in several versions.

Regards,

Markus

...
On Tue, Jun 2, 2015 at 10:26 AM Magnus Manske <magnusmanske@googlemail.com mailto:magnusmanske@googlemail.com> wrote:
Update: There appear to be quite a few items with duplicate
Scotland IDs (not all of them may be erroneous!): http://wdq.wmflabs.org/stats?action=doublestring&prop=709
On Tue, Jun 2, 2015 at 10:23 AM Magnus Manske
<magnusmanske@googlemail.com <mailto:magnusmanske@googlemail.com>>
wrote:

    I created (some/most of) these items as part of the Wiki Loves
    Monuments UK 2014 drive, to run the campaign from Wikidata
    rather than from a bespoke database. This allows the community
    (TM) to maintain the data, rather than one poor sod (e.g.,
    myself) having to frantically update all of it every year ;-)

    "Consumer" tool is here:
    https://tools.wmflabs.org/wlmuk/index_wd.html

    These are based on "official" data from National Heritage,
    provided to me via Wikimedia UK. Grade A (or Grade I/II* in
    England) structures should be noteworthy by default.

    It appears (as per your examples) that some of these were
    created as duplicates/with wrong IDs. As I said, this is based
    on "official" data, so it's the best I could do at the time.
    With mass creation, there are bound to be a few strays. If you
    can find some large-scale, systemic issue I'll try to fix it,
    but the one-offs will always fall back to manual fixing. At
    least, with Wikidata, we can fix them together.

    On Tue, Jun 2, 2015 at 10:01 AM Daniel Kinzler
    <daniel.kinzler@wikimedia.de
    <mailto:daniel.kinzler@wikimedia.de>> wrote:

        Am 01.06.2015 um 22:26 schrieb Markus Krötzsch:
         > Finally, the technical question is: Why is this even
        possible? I thought that,
         > in each language, label+description are a key (globally
        unique), yet here we
         > have many pairs of items with exactly the same label and
        description. Or is the
         > problem that no description was entered and so the
system does not apply the > key?
        The uniqueness constraint does indeed not apply if there is
        no description.

        --
        Daniel Kinzler
        Senior Software Developer

        Wikimedia Deutschland
        Gesellschaft zur Förderung Freien Wissens e.V.

        _______________________________________________
        Wikidata mailing list
        Wikidata@lists.wikimedia.org
        <mailto:Wikidata@lists.wikimedia.org>
        https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
--

Scott MacLeod - Founder & President

http://worlduniversityandschool.org

415 480 4577

PO Box 442, (86 Ridgecrest Road), Canyon, CA 94516

World University and School - like Wikipedia with best STEM-centric

OpenCourseWare - incorporated as a nonprofit university and school in California, and is a U.S. 501 (c) (3) tax-exempt educational organization, both effective April 2010.

World University and School is sending you this because of your interest in free, online, higher education. If you don't want to receive these, please reply with 'unsubscribe' in the body of the email, leaving the subject line intact. Thank you. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Magnus Manske

4:04 p.m.

New subject: No links, wrong data: Scotland's orphans need help

On Tue, Jun 2, 2015 at 12:12 PM Markus Krötzsch < markus@semantic-mediawiki.org> wrote:

...

Another interesting type of Scottish historic orphans are those that are duplicates of items that do have site links. Even very prominent ones are duplicated, such as

https://www.wikidata.org/wiki/Q17569486 (dup) https://www.wikidata.org/wiki/Q933000 (real item)

Interestingly, they use different Scotland IDs, and it does indeed seem that Historic Scotland also contains duplicates:

http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...

http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...

Overall, this seems to be an example of an ID that really should not be considered "identity providing" since there seems to be an many-to-many relationship between Wikidata and Historic Scottland. Orphans should receive additional ids from a better source if at all possible. With the great number of seemingly legit non-functional uses of the Scotland IDs, they cannot be used in practice to detect duplicates.

They are not unique on the Historic Scorland site, but they can still have the correct IDs on WIkidata, even if they are non-unique. What will be required for this (and other external IDs) in the long run is an automated or semi-automated check against the foreign data corpus, with heuristics highlighting potential issues. This includes new items in the external source (or ones we missed during initial import).

Given that I received the original data as a CSV from WMUK, who got it from Historic Scotland under Freedom of Information (IIRC), this might prove tricky.

...

Regards,

Markus

On 02.06.2015 13:01, Markus Krötzsch wrote:

...
On 02.06.2015 11:30, Magnus Manske wrote:

...
Update 2: For example, https://www.wikidata.org/wiki/Q17847522 and https://www.wikidata.org/wiki/Q17847537 have the same Scotland ID, but refer to different entities (church and churchyard, respectively). They were as two entities in the original dataset, sharing the same ID.

Yes, I noticed such cases too. From the information Wikidata, it is not clear to me why this is sometimes done and sometimes not done.

For example, these adjacent houses have the same Scotland ID but different items that each have their own coordinates (where did the coordinates come from?):

https://www.wikidata.org/wiki/Q17576211 https://www.wikidata.org/wiki/Q17576182 https://www.wikidata.org/wiki/Q17576185

In many other cases, adjacent houses with the same ID are combined into one item:

https://www.wikidata.org/wiki/Q17806587

(note, however, that the house addresses given in the ID and in the item label do not match, though they overlap on most of the houses.)

Finally, there are also cases where there are different IDs and we have several items, but they have the same labels that merge the contents of the two IDs:

https://www.wikidata.org/wiki/Q17810121 https://www.wikidata.org/wiki/Q17810137

It seems that the data was not taken from the Historic Sites database but from some different source that has its own coordinate data and a different (but seemingly arbitrary) approach to grouping sites. However, the coordinated give Historic Scotland as their reference -- I wonder if Historic Scotland might be changing frequently or exist in several versions.

Regards,

Markus

...
On Tue, Jun 2, 2015 at 10:26 AM Magnus Manske <magnusmanske@googlemail.com mailto:magnusmanske@googlemail.com>

wrote:

...
...
Update: There appear to be quite a few items with duplicate Scotland
IDs (not all of them may be erroneous!):
http://wdq.wmflabs.org/stats?action=doublestring&prop=709

On Tue, Jun 2, 2015 at 10:23 AM Magnus Manske
<magnusmanske@googlemail.com <mailto:magnusmanske@googlemail.com>>
wrote:

    I created (some/most of) these items as part of the Wiki Loves
    Monuments UK 2014 drive, to run the campaign from Wikidata
    rather than from a bespoke database. This allows the community
    (TM) to maintain the data, rather than one poor sod (e.g.,
    myself) having to frantically update all of it every year ;-)

    "Consumer" tool is here:
    https://tools.wmflabs.org/wlmuk/index_wd.html

    These are based on "official" data from National Heritage,
    provided to me via Wikimedia UK. Grade A (or Grade I/II* in
    England) structures should be noteworthy by default.

    It appears (as per your examples) that some of these were
    created as duplicates/with wrong IDs. As I said, this is based
    on "official" data, so it's the best I could do at the time.
    With mass creation, there are bound to be a few strays. If you
    can find some large-scale, systemic issue I'll try to fix it,
    but the one-offs will always fall back to manual fixing. At
    least, with Wikidata, we can fix them together.

    On Tue, Jun 2, 2015 at 10:01 AM Daniel Kinzler
    <daniel.kinzler@wikimedia.de
    <mailto:daniel.kinzler@wikimedia.de>> wrote:

        Am 01.06.2015 um 22:26 schrieb Markus Krötzsch:
         > Finally, the technical question is: Why is this even
        possible? I thought that,
         > in each language, label+description are a key (globally
        unique), yet here we
         > have many pairs of items with exactly the same label and
        description. Or is the
         > problem that no description was entered and so the system
        does not apply the
         > key?

        The uniqueness constraint does indeed not apply if there is
        no description.

        --
        Daniel Kinzler
        Senior Software Developer

        Wikimedia Deutschland
        Gesellschaft zur Förderung Freien Wissens e.V.

        _______________________________________________
        Wikidata mailing list
        Wikidata@lists.wikimedia.org
        <mailto:Wikidata@lists.wikimedia.org>
        https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Andrew Gray

3 Jun 3 Jun

4:48 a.m.

New subject: [Spam] Re: No links, wrong data: Scotland's orphans need help

This particular case is something of a known problem - we've encountered it with some of the other heritage-building identifier lists as well.

Bridges often span a river which is the border for two jurisdictions (in this case, council areas). Each local area counts it as a historic building, and because the national lists are aggregated from local lists, it gets two entries in the main list, one as Fife and one as Edinburgh. A similar case in Wales is the Menai Suspension Bridge, which is 4049 from the Gwynedd register and 18572 from the Anglesey one (Wikidata, at Q581526, only lists one identifer).

The lack of deduplication is probably intentional rather than a bug, and both entries are "correct". Perhaps one way to handle this for Wikidata would be to, hmm, say something like "if the item is some kind of a bridge, then allow two IDs" in the constraints?

I can't immediately think of any bridges which cross national borders *and* are a heritage building in both countries, but we'd see the same thing there, with it having identifiers from both sides.

Andrew.

On 2 June 2015 at 12:12, Markus Krötzsch markus@semantic-mediawiki.org wrote:

...

Another interesting type of Scottish historic orphans are those that are duplicates of items that do have site links. Even very prominent ones are duplicated, such as

https://www.wikidata.org/wiki/Q17569486 (dup) https://www.wikidata.org/wiki/Q933000 (real item)

Interestingly, they use different Scotland IDs, and it does indeed seem that Historic Scotland also contains duplicates:

http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL... http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...

Overall, this seems to be an example of an ID that really should not be considered "identity providing" since there seems to be an many-to-many relationship between Wikidata and Historic Scottland. Orphans should receive additional ids from a better source if at all possible. With the great number of seemingly legit non-functional uses of the Scotland IDs, they cannot be used in practice to detect duplicates.

Regards,

Markus

On 02.06.2015 13:01, Markus Krötzsch wrote:

...
On 02.06.2015 11:30, Magnus Manske wrote:

...
Update 2: For example, https://www.wikidata.org/wiki/Q17847522 and https://www.wikidata.org/wiki/Q17847537 have the same Scotland ID, but refer to different entities (church and churchyard, respectively). They were as two entities in the original dataset, sharing the same ID.

Yes, I noticed such cases too. From the information Wikidata, it is not clear to me why this is sometimes done and sometimes not done.

For example, these adjacent houses have the same Scotland ID but different items that each have their own coordinates (where did the coordinates come from?):

https://www.wikidata.org/wiki/Q17576211 https://www.wikidata.org/wiki/Q17576182 https://www.wikidata.org/wiki/Q17576185

In many other cases, adjacent houses with the same ID are combined into one item:

https://www.wikidata.org/wiki/Q17806587

(note, however, that the house addresses given in the ID and in the item label do not match, though they overlap on most of the houses.)

Finally, there are also cases where there are different IDs and we have several items, but they have the same labels that merge the contents of the two IDs:

https://www.wikidata.org/wiki/Q17810121 https://www.wikidata.org/wiki/Q17810137

It seems that the data was not taken from the Historic Sites database but from some different source that has its own coordinate data and a different (but seemingly arbitrary) approach to grouping sites. However, the coordinated give Historic Scotland as their reference -- I wonder if Historic Scotland might be changing frequently or exist in several versions.

Regards,

Markus

...
On Tue, Jun 2, 2015 at 10:26 AM Magnus Manske <magnusmanske@googlemail.com mailto:magnusmanske@googlemail.com> wrote:
Update: There appear to be quite a few items with duplicate Scotland
IDs (not all of them may be erroneous!):
http://wdq.wmflabs.org/stats?action=doublestring&prop=709

On Tue, Jun 2, 2015 at 10:23 AM Magnus Manske
<magnusmanske@googlemail.com <mailto:magnusmanske@googlemail.com>>
wrote:

    I created (some/most of) these items as part of the Wiki Loves
    Monuments UK 2014 drive, to run the campaign from Wikidata
    rather than from a bespoke database. This allows the community
    (TM) to maintain the data, rather than one poor sod (e.g.,
    myself) having to frantically update all of it every year ;-)

    "Consumer" tool is here:
    https://tools.wmflabs.org/wlmuk/index_wd.html

    These are based on "official" data from National Heritage,
    provided to me via Wikimedia UK. Grade A (or Grade I/II* in
    England) structures should be noteworthy by default.

    It appears (as per your examples) that some of these were
    created as duplicates/with wrong IDs. As I said, this is based
    on "official" data, so it's the best I could do at the time.
    With mass creation, there are bound to be a few strays. If you
    can find some large-scale, systemic issue I'll try to fix it,
    but the one-offs will always fall back to manual fixing. At
    least, with Wikidata, we can fix them together.

    On Tue, Jun 2, 2015 at 10:01 AM Daniel Kinzler
    <daniel.kinzler@wikimedia.de
    <mailto:daniel.kinzler@wikimedia.de>> wrote:

        Am 01.06.2015 um 22:26 schrieb Markus Krötzsch:
         > Finally, the technical question is: Why is this even
        possible? I thought that,
         > in each language, label+description are a key (globally
        unique), yet here we
         > have many pairs of items with exactly the same label and
        description. Or is the
         > problem that no description was entered and so the system
        does not apply the
         > key?

        The uniqueness constraint does indeed not apply if there is
        no description.

        --
        Daniel Kinzler
        Senior Software Developer

        Wikimedia Deutschland
        Gesellschaft zur Förderung Freien Wissens e.V.

        _______________________________________________
        Wikidata mailing list
        Wikidata@lists.wikimedia.org
        <mailto:Wikidata@lists.wikimedia.org>
        https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- - Andrew Gray andrew.gray@dunelm.org.uk

Magnus Manske

4:57 a.m.

New subject: [Spam] Re: No links, wrong data: Scotland's orphans need help

Maybe there is a case to separate import and verification here?

There are many statements in Wikidata nowadays, but they get really "trustworthy" through references (other than "imported from Wikipedia"). But for external IDs, references are superfluous; they are their own reference, by definition. So how about marking IDs with a "verified" (or "last verified on") qualifier? Much of such work could be done by bots; we could then filter the problematic ones out for manual verification.

As we have no control over external lists, this would have to be re-checked ever so often; but, again bots to the rescue.

On Wed, Jun 3, 2015 at 12:49 PM Andrew Gray andrew.gray@dunelm.org.uk wrote:

...

This particular case is something of a known problem - we've encountered it with some of the other heritage-building identifier lists as well.

Bridges often span a river which is the border for two jurisdictions (in this case, council areas). Each local area counts it as a historic building, and because the national lists are aggregated from local lists, it gets two entries in the main list, one as Fife and one as Edinburgh. A similar case in Wales is the Menai Suspension Bridge, which is 4049 from the Gwynedd register and 18572 from the Anglesey one (Wikidata, at Q581526, only lists one identifer).

The lack of deduplication is probably intentional rather than a bug, and both entries are "correct". Perhaps one way to handle this for Wikidata would be to, hmm, say something like "if the item is some kind of a bridge, then allow two IDs" in the constraints?

I can't immediately think of any bridges which cross national borders *and* are a heritage building in both countries, but we'd see the same thing there, with it having identifiers from both sides.

Andrew.

On 2 June 2015 at 12:12, Markus Krötzsch markus@semantic-mediawiki.org wrote:

...
Another interesting type of Scottish historic orphans are those that are duplicates of items that do have site links. Even very prominent ones are duplicated, such as

https://www.wikidata.org/wiki/Q17569486 (dup) https://www.wikidata.org/wiki/Q933000 (real item)

Interestingly, they use different Scotland IDs, and it does indeed seem

that

...
Historic Scotland also contains duplicates:

http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...

...
http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...

...
Overall, this seems to be an example of an ID that really should not be considered "identity providing" since there seems to be an many-to-many relationship between Wikidata and Historic Scottland. Orphans should

receive

...
additional ids from a better source if at all possible. With the great number of seemingly legit non-functional uses of the Scotland IDs, they cannot be used in practice to detect duplicates.

Regards,

Markus

On 02.06.2015 13:01, Markus Krötzsch wrote:

...
On 02.06.2015 11:30, Magnus Manske wrote:

...
Update 2: For example, https://www.wikidata.org/wiki/Q17847522 and https://www.wikidata.org/wiki/Q17847537 have the same Scotland ID, but refer to different entities (church and churchyard, respectively). They were as two entities in the original dataset, sharing the same ID.

Yes, I noticed such cases too. From the information Wikidata, it is not clear to me why this is sometimes done and sometimes not done.

For example, these adjacent houses have the same Scotland ID but different items that each have their own coordinates (where did the coordinates come from?):

https://www.wikidata.org/wiki/Q17576211 https://www.wikidata.org/wiki/Q17576182 https://www.wikidata.org/wiki/Q17576185

In many other cases, adjacent houses with the same ID are combined into one item:

https://www.wikidata.org/wiki/Q17806587

(note, however, that the house addresses given in the ID and in the item label do not match, though they overlap on most of the houses.)

Finally, there are also cases where there are different IDs and we have several items, but they have the same labels that merge the contents of the two IDs:

https://www.wikidata.org/wiki/Q17810121 https://www.wikidata.org/wiki/Q17810137

It seems that the data was not taken from the Historic Sites database but from some different source that has its own coordinate data and a different (but seemingly arbitrary) approach to grouping sites. However, the coordinated give Historic Scotland as their reference -- I wonder if Historic Scotland might be changing frequently or exist in several versions.

Regards,

Markus

...
On Tue, Jun 2, 2015 at 10:26 AM Magnus Manske <magnusmanske@googlemail.com mailto:magnusmanske@googlemail.com>

wrote:

...
...
...
Update: There appear to be quite a few items with duplicate
Scotland

...
...
...
IDs (not all of them may be erroneous!):
http://wdq.wmflabs.org/stats?action=doublestring&prop=709

On Tue, Jun 2, 2015 at 10:23 AM Magnus Manske
<magnusmanske@googlemail.com <mailto:magnusmanske@googlemail.com>>
wrote:

    I created (some/most of) these items as part of the Wiki Loves
    Monuments UK 2014 drive, to run the campaign from Wikidata
    rather than from a bespoke database. This allows the community
    (TM) to maintain the data, rather than one poor sod (e.g.,
    myself) having to frantically update all of it every year ;-)

    "Consumer" tool is here:
    https://tools.wmflabs.org/wlmuk/index_wd.html

    These are based on "official" data from National Heritage,
    provided to me via Wikimedia UK. Grade A (or Grade I/II* in
    England) structures should be noteworthy by default.

    It appears (as per your examples) that some of these were
    created as duplicates/with wrong IDs. As I said, this is based
    on "official" data, so it's the best I could do at the time.
    With mass creation, there are bound to be a few strays. If you
    can find some large-scale, systemic issue I'll try to fix it,
    but the one-offs will always fall back to manual fixing. At
    least, with Wikidata, we can fix them together.

    On Tue, Jun 2, 2015 at 10:01 AM Daniel Kinzler
    <daniel.kinzler@wikimedia.de
    <mailto:daniel.kinzler@wikimedia.de>> wrote:

        Am 01.06.2015 um 22:26 schrieb Markus Krötzsch:
         > Finally, the technical question is: Why is this even
        possible? I thought that,
         > in each language, label+description are a key (globally
        unique), yet here we
         > have many pairs of items with exactly the same label and
        description. Or is the
         > problem that no description was entered and so the
system

...
...
...
        does not apply the
         > key?

        The uniqueness constraint does indeed not apply if there is
        no description.

        --
        Daniel Kinzler
        Senior Software Developer

        Wikimedia Deutschland
        Gesellschaft zur Förderung Freien Wissens e.V.

        _______________________________________________
        Wikidata mailing list
        Wikidata@lists.wikimedia.org
        <mailto:Wikidata@lists.wikimedia.org>
        https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
--

Andrew Gray andrew.gray@dunelm.org.uk

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Markus Krötzsch

6:56 a.m.

New subject: [Spam] Re: No links, wrong data: Scotland's orphans need help

On 03.06.2015 13:57, Magnus Manske wrote:

...

Maybe there is a case to separate import and verification here?

There are many statements in Wikidata nowadays, but they get really "trustworthy" through references (other than "imported from Wikipedia"). But for external IDs, references are superfluous; they are their own reference, by definition. So how about marking IDs with a "verified" (or "last verified on") qualifier? Much of such work could be done by bots; we could then filter the problematic ones out for manual verification.

As we have no control over external lists, this would have to be re-checked ever so often; but, again bots to the rescue.

Yes, I fully support this proposal.

What do you think about making "last verified on" not a qualifier but (part of) the reference information? The reference could state where the bot has looked up the ID and give a time. This would be somewhat similar to what is now used in Freebase Ids, e.g., in https://www.wikidata.org/wiki/Q42.

In general, it might be useful to have such a "last verified on" property that can be added to arbitrary references. There are many other uses for this. One common case would be that a user has changed the value without even being aware of the reference -- then one would be able to detect this automatically by comparing the last modification time with the "last verified on" date.

Putting the "last verified on" into the references also makes it possible to have different dates for different references there.

Regards,

Markus

Dario Taraborelli

7:14 a.m.

New subject: [Spam] Re: No links, wrong data: Scotland's orphans need help

I second this. For a related effort, see:

https://github.com/pav-ontology/pav/

in particular, pav:sourceLastAccessedOn, pav:lastRefreshedOn, pav:lastUpdateOn http://pav-ontology.github.io/pav/#d4e846

...

On Jun 3, 2015, at 3:56 PM, Markus Krötzsch markus@semantic-mediawiki.org wrote:

On 03.06.2015 13:57, Magnus Manske wrote:

...
Maybe there is a case to separate import and verification here?

There are many statements in Wikidata nowadays, but they get really "trustworthy" through references (other than "imported from Wikipedia"). But for external IDs, references are superfluous; they are their own reference, by definition. So how about marking IDs with a "verified" (or "last verified on") qualifier? Much of such work could be done by bots; we could then filter the problematic ones out for manual verification.

As we have no control over external lists, this would have to be re-checked ever so often; but, again bots to the rescue.

Yes, I fully support this proposal.

What do you think about making "last verified on" not a qualifier but (part of) the reference information? The reference could state where the bot has looked up the ID and give a time. This would be somewhat similar to what is now used in Freebase Ids, e.g., in https://www.wikidata.org/wiki/Q42.

In general, it might be useful to have such a "last verified on" property that can be added to arbitrary references. There are many other uses for this. One common case would be that a user has changed the value without even being aware of the reference -- then one would be able to detect this automatically by comparing the last modification time with the "last verified on" date.

Putting the "last verified on" into the references also makes it possible to have different dates for different references there.

Regards,

Markus

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Markus Krötzsch

7 Jun 7 Jun

8:18 a.m.

New subject: No links, wrong data: Scotland's orphans need help

Coming back to Magnus's suggestion ... I think the existing property "retrieved" (P813) could be used for this "last verified on" property, that is, for setting the time a which some external reference was last compared to a claim in Wikidata.

Magnus also pointed out that many external IDs are "self-verifying" in that they are their own reference. The situation is somewhat similar for homepages. Should we adopt the practice of giving a single retrieved value (without any further information) as the reference for such cases?

Adding P813 dates more widely would also open up new ways of maintaining data, since one would have a way to filter statements by how long ago they had last been checked.

Best wishes,

Markus

On 03.06.2015 15:56, Markus Krötzsch wrote:

...

On 03.06.2015 13:57, Magnus Manske wrote:

...
Maybe there is a case to separate import and verification here?

There are many statements in Wikidata nowadays, but they get really "trustworthy" through references (other than "imported from Wikipedia"). But for external IDs, references are superfluous; they are their own reference, by definition. So how about marking IDs with a "verified" (or "last verified on") qualifier? Much of such work could be done by bots; we could then filter the problematic ones out for manual verification.

As we have no control over external lists, this would have to be re-checked ever so often; but, again bots to the rescue.

Yes, I fully support this proposal.

What do you think about making "last verified on" not a qualifier but (part of) the reference information? The reference could state where the bot has looked up the ID and give a time. This would be somewhat similar to what is now used in Freebase Ids, e.g., in https://www.wikidata.org/wiki/Q42.

In general, it might be useful to have such a "last verified on" property that can be added to arbitrary references. There are many other uses for this. One common case would be that a user has changed the value without even being aware of the reference -- then one would be able to detect this automatically by comparing the last modification time with the "last verified on" date.

Putting the "last verified on" into the references also makes it possible to have different dates for different references there.

Regards,

Markus

Magnus Manske

9:29 a.m.

New subject: No links, wrong data: Scotland's orphans need help

One question remaining is: Should there be a difference between "human-verified" and "bot-verified"? A bot can check if e.g. the label (or the words in the label) occur on the page at the URL to check, but it can't know for sure. Human review is more reliable, but vastly slower and not likely to happen for many/most such statements. Two different properties could act as different confidence levels. But maybe I'm just over-engineering this ;-)

On Sun, Jun 7, 2015 at 4:19 PM Markus Krötzsch < markus@semantic-mediawiki.org> wrote:

...

Coming back to Magnus's suggestion ... I think the existing property "retrieved" (P813) could be used for this "last verified on" property, that is, for setting the time a which some external reference was last compared to a claim in Wikidata.

Magnus also pointed out that many external IDs are "self-verifying" in that they are their own reference. The situation is somewhat similar for homepages. Should we adopt the practice of giving a single retrieved value (without any further information) as the reference for such cases?

Adding P813 dates more widely would also open up new ways of maintaining data, since one would have a way to filter statements by how long ago they had last been checked.

Best wishes,

Markus

On 03.06.2015 15:56, Markus Krötzsch wrote:

...
On 03.06.2015 13:57, Magnus Manske wrote:

...
Maybe there is a case to separate import and verification here?

There are many statements in Wikidata nowadays, but they get really "trustworthy" through references (other than "imported from Wikipedia"). But for external IDs, references are superfluous; they are their own reference, by definition. So how about marking IDs with a "verified" (or "last verified on") qualifier? Much of such work could be done by bots; we could then filter the problematic ones out for manual verification.

As we have no control over external lists, this would have to be re-checked ever so often; but, again bots to the rescue.

Yes, I fully support this proposal.

What do you think about making "last verified on" not a qualifier but (part of) the reference information? The reference could state where the bot has looked up the ID and give a time. This would be somewhat similar to what is now used in Freebase Ids, e.g., in https://www.wikidata.org/wiki/Q42.

In general, it might be useful to have such a "last verified on" property that can be added to arbitrary references. There are many other uses for this. One common case would be that a user has changed the value without even being aware of the reference -- then one would be able to detect this automatically by comparing the last modification time with the "last verified on" date.

Putting the "last verified on" into the references also makes it possible to have different dates for different references there.

Regards,

Markus

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Markus Krötzsch

2:23 p.m.

New subject: No links, wrong data: Scotland's orphans need help

On 07.06.2015 18:29, Magnus Manske wrote:

...

One question remaining is: Should there be a difference between "human-verified" and "bot-verified"? A bot can check if e.g. the label (or the words in the label) occur on the page at the URL to check, but it can't know for sure. Human review is more reliable, but vastly slower and not likely to happen for many/most such statements. Two different properties could act as different confidence levels. But maybe I'm just over-engineering this ;-)

It depends. For structured data sources, a bot should be able to do a thorough verification (possibly better than a human), e.g., by comparing name, birthdate and deathdate of a person at once. I would focus on these cases first since we have enough of them ;-)

For cases where a bot con only make a guess, it might be better to add a human to the loop, as in your (truly amazing!) sourcerer game. The game also shows that it may depend on the items how well this approach works, since text matches are sometimes completely meaningless (e.g., "Human parent taxon homo" can not be verified by looking for "Homo" since every page that might contain this fact also mentions "Homo sapiens" many times). For such difficult cases, I am not sure if a bot-defined information "looked correct, but I am not sure" would really be very helpful. It depends ;-)

Cheers,

Markus

...

On Sun, Jun 7, 2015 at 4:19 PM Markus Krötzsch <markus@semantic-mediawiki.org mailto:markus@semantic-mediawiki.org> wrote:

Coming back to Magnus's suggestion ... I think the existing property
"retrieved" (P813) could be used for this "last verified on" property,
that is, for setting the time a which some external reference was last
compared to a claim in Wikidata.

Magnus also pointed out that many external IDs are "self-verifying" in
that they are their own reference. The situation is somewhat similar for
homepages. Should we adopt the practice of giving a single retrieved
value (without any further information) as the reference for such cases?

Adding P813 dates more widely would also open up new ways of maintaining
data, since one would have a way to filter statements by how long ago
they had last been checked.

Best wishes,

Markus

On 03.06.2015 15:56, Markus Krötzsch wrote:
 > On 03.06.2015 13:57, Magnus Manske wrote:
 >> Maybe there is a case to separate import and verification here?
 >>
 >> There are many statements in Wikidata nowadays, but they get really
 >> "trustworthy" through references (other than "imported from
Wikipedia").
 >> But for external IDs, references are superfluous; they are their own
 >> reference, by definition. So how about marking IDs with a
"verified" (or
 >> "last verified on") qualifier? Much of such work could be done
by bots;
 >> we could then filter the problematic ones out for manual
verification.
 >>
 >> As we have no control over external lists, this would have to be
 >> re-checked ever so often; but, again bots to the rescue.
 >>
 >
 > Yes, I fully support this proposal.
 >
 > What do you think about making "last verified on" not a qualifier but
 > (part of) the reference information? The reference could state
where the
 > bot has looked up the ID and give a time. This would be somewhat
similar
 > to what is now used in Freebase Ids, e.g., in
 > https://www.wikidata.org/wiki/Q42.
 >
 > In general, it might be useful to have such a "last verified on"
 > property that can be added to arbitrary references. There are
many other
 > uses for this. One common case would be that a user has changed the
 > value without even being aware of the reference -- then one would be
 > able to detect this automatically by comparing the last modification
 > time with the "last verified on" date.
 >
 > Putting the "last verified on" into the references also makes it
 > possible to have different dates for different references there.
 >
 > Regards,
 >
 > Markus
 >
 >
 >
 >
 >


_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Luca Martinelli

11:07 a.m.

New subject: No links, wrong data: Scotland's orphans need help

Il 07/giu/2015 17:19, "Markus Krötzsch" markus@semantic-mediawiki.org ha scritto:

...

Coming back to Magnus's suggestion ... I think the existing property

"retrieved" (P813) could be used for this "last verified on" property, that is, for setting the time a which some external reference was last compared to a claim in Wikidata.

...

Magnus also pointed out that many external IDs are "self-verifying" in

that they are their own reference. The situation is somewhat similar for homepages. Should we adopt the practice of giving a single retrieved value (without any further information) as the reference for such cases?

...

Adding P813 dates more widely would also open up new ways of maintaining

data, since one would have a way to filter statements by how long ago they had last been checked.

Sounds ok, but how will we do it? And should we wait for the identifier datatype to be ready?

Markus Krötzsch

2:30 p.m.

New subject: No links, wrong data: Scotland's orphans need help

On 07.06.2015 20:07, Luca Martinelli wrote:

...

Il 07/giu/2015 17:19, "Markus Krötzsch" <markus@semantic-mediawiki.org mailto:markus@semantic-mediawiki.org> ha scritto:

...
Coming back to Magnus's suggestion ... I think the existing property

"retrieved" (P813) could be used for this "last verified on" property, that is, for setting the time a which some external reference was last compared to a claim in Wikidata.

...
Magnus also pointed out that many external IDs are "self-verifying"

in that they are their own reference. The situation is somewhat similar for homepages. Should we adopt the practice of giving a single retrieved value (without any further information) as the reference for such cases?

...
Adding P813 dates more widely would also open up new ways of

maintaining data, since one would have a way to filter statements by how long ago they had last been checked.

Sounds ok, but how will we do it?

As editors, we can just do it from now on. I was always unsure what to use as a reference for ids and homepages. Now I'll use this.

Bot operators can do the same. Magnus is already using P813 in the sourcerer game as well.

I don't know if there is a good place on Wikidata to document such things. I always struggle to find documentation about how to do references (best practices, e.g., how to cite an online news portal correctly).

...

And should we wait for the identifier datatype to be ready?

Time information makes sense for any online reference, so we do not really need to know if the statement we are editing is for an ID property. Whether a statement is "self-verifying" so that a single P813 would already work as reference depends on the context. The properties that this is mainly true for are those of type URL and those where you can get a URL or URI to verify things (i.e., those with properties P1630 or P1921).

Markus

Lydia Pintscher

8 Jun 8 Jun

3:52 a.m.

New subject: No links, wrong data: Scotland's orphans need help

On Sun, Jun 7, 2015 at 11:30 PM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:

...

I don't know if there is a good place on Wikidata to document such things. I always struggle to find documentation about how to do references (best practices, e.g., how to cite an online news portal correctly).

That'd be https://www.wikidata.org/wiki/Help:Sources

Cheers Lydia

-- Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

Bene*

2:16 a.m.

New subject: No links, wrong data: Scotland's orphans need help

Am 07.06.2015 um 17:18 schrieb Markus Krötzsch:

...

Magnus also pointed out that many external IDs are "self-verifying" in that they are their own reference. The situation is somewhat similar for homepages. Should we adopt the practice of giving a single retrieved value (without any further information) as the reference for such cases?

I'd use the reference URL the value was imported from together with the retrieved value.

Best regards Bene

Markus Krötzsch

5:50 a.m.

New subject: No links, wrong data: Scotland's orphans need help

On 08.06.2015 11:16, Bene* wrote:

...

Hi

Am 07.06.2015 um 17:18 schrieb Markus Krötzsch:

...
Magnus also pointed out that many external IDs are "self-verifying" in that they are their own reference. The situation is somewhat similar for homepages. Should we adopt the practice of giving a single retrieved value (without any further information) as the reference for such cases?

I'd use the reference URL the value was imported from together with the retrieved value.

Yes, that's a good point. Even if an ID has a value for "formatter URL" that defines the URL, it might be good to record the URL that was used to verify the data, since the formatter URL might change. Especially bots should add this, since it's no extra work for them. However, a single "retrieved" value is still better than nothing there. For homepages and other URL properties, I would not maybe store the URL again in the reference.

Note that most often the reference is not where the value is "imported from" but simply an external reference. In many cases, we have imported data from Wikipedia but it is then verified from another dataset.

Cheers,

Markus

Andy Mabbett

3 Jun 3 Jun

5:18 a.m.

New subject: [Spam] Re: No links, wrong data: Scotland's orphans need help

On 3 June 2015 at 12:48, Andrew Gray andrew.gray@dunelm.org.uk wrote:

...

The lack of deduplication is probably intentional rather than a bug, and both entries are "correct". Perhaps one way to handle this for Wikidata would be to, hmm, say something like "if the item is some kind of a bridge, then allow two IDs" in the constraints?

The constraint should be "usually one ID" (i.e. "SHOULD only have one ID), not "MUST have only one ID.

Wikidata already allows for this, and the constraints are editable.

See also the talk page and report for P496 for an example of a listed exception.

-- Andy Mabbett @pigsonthewing http://pigsonthewing.org.uk

Markus Krötzsch

6:48 a.m.

New subject: [Spam] Re: No links, wrong data: Scotland's orphans need help

Thanks, Andrew, for the clarification. This makes perfect sense.

I don't see a problem with one bridge having two IDs in some external database. We already have this for other ID-like properties for other reasons. What is important though is that it still is a single bridge, and should therefore be one item.

Your clarification is reassuring since it suggests that the problem is not overly common after all. Maybe one can just merge these cases manually. Once the (multiple) ids are found in the merged items, avoiding future duplicates will be done as usual (which is still difficult with the Scottish Heritage ids since we have many legit Wikidata items that have the same id -- but this at least is an independent problem).

Regards,

Markus

On 03.06.2015 13:48, Andrew Gray wrote:

...

This particular case is something of a known problem - we've encountered it with some of the other heritage-building identifier lists as well.

Bridges often span a river which is the border for two jurisdictions (in this case, council areas). Each local area counts it as a historic building, and because the national lists are aggregated from local lists, it gets two entries in the main list, one as Fife and one as Edinburgh. A similar case in Wales is the Menai Suspension Bridge, which is 4049 from the Gwynedd register and 18572 from the Anglesey one (Wikidata, at Q581526, only lists one identifer).

The lack of deduplication is probably intentional rather than a bug, and both entries are "correct". Perhaps one way to handle this for Wikidata would be to, hmm, say something like "if the item is some kind of a bridge, then allow two IDs" in the constraints?

I can't immediately think of any bridges which cross national borders *and* are a heritage building in both countries, but we'd see the same thing there, with it having identifiers from both sides.

Andrew.

On 2 June 2015 at 12:12, Markus Krötzsch markus@semantic-mediawiki.org wrote:

...
Another interesting type of Scottish historic orphans are those that are duplicates of items that do have site links. Even very prominent ones are duplicated, such as

https://www.wikidata.org/wiki/Q17569486 (dup) https://www.wikidata.org/wiki/Q933000 (real item)

Interestingly, they use different Scotland IDs, and it does indeed seem that Historic Scotland also contains duplicates:

http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL... http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...

Overall, this seems to be an example of an ID that really should not be considered "identity providing" since there seems to be an many-to-many relationship between Wikidata and Historic Scottland. Orphans should receive additional ids from a better source if at all possible. With the great number of seemingly legit non-functional uses of the Scotland IDs, they cannot be used in practice to detect duplicates.

Regards,

Markus

On 02.06.2015 13:01, Markus Krötzsch wrote:

...
On 02.06.2015 11:30, Magnus Manske wrote:

...
Update 2: For example, https://www.wikidata.org/wiki/Q17847522 and https://www.wikidata.org/wiki/Q17847537 have the same Scotland ID, but refer to different entities (church and churchyard, respectively). They were as two entities in the original dataset, sharing the same ID.

Yes, I noticed such cases too. From the information Wikidata, it is not clear to me why this is sometimes done and sometimes not done.

For example, these adjacent houses have the same Scotland ID but different items that each have their own coordinates (where did the coordinates come from?):

https://www.wikidata.org/wiki/Q17576211 https://www.wikidata.org/wiki/Q17576182 https://www.wikidata.org/wiki/Q17576185

In many other cases, adjacent houses with the same ID are combined into one item:

https://www.wikidata.org/wiki/Q17806587

(note, however, that the house addresses given in the ID and in the item label do not match, though they overlap on most of the houses.)

Finally, there are also cases where there are different IDs and we have several items, but they have the same labels that merge the contents of the two IDs:

https://www.wikidata.org/wiki/Q17810121 https://www.wikidata.org/wiki/Q17810137

It seems that the data was not taken from the Historic Sites database but from some different source that has its own coordinate data and a different (but seemingly arbitrary) approach to grouping sites. However, the coordinated give Historic Scotland as their reference -- I wonder if Historic Scotland might be changing frequently or exist in several versions.

Regards,

Markus

...
On Tue, Jun 2, 2015 at 10:26 AM Magnus Manske <magnusmanske@googlemail.com mailto:magnusmanske@googlemail.com> wrote:
 Update: There appear to be quite a few items with duplicate Scotland
 IDs (not all of them may be erroneous!):
 http://wdq.wmflabs.org/stats?action=doublestring&prop=709

 On Tue, Jun 2, 2015 at 10:23 AM Magnus Manske
 <magnusmanske@googlemail.com <mailto:magnusmanske@googlemail.com>>
 wrote:

     I created (some/most of) these items as part of the Wiki Loves
     Monuments UK 2014 drive, to run the campaign from Wikidata
     rather than from a bespoke database. This allows the community
     (TM) to maintain the data, rather than one poor sod (e.g.,
     myself) having to frantically update all of it every year ;-)

     "Consumer" tool is here:
     https://tools.wmflabs.org/wlmuk/index_wd.html

     These are based on "official" data from National Heritage,
     provided to me via Wikimedia UK. Grade A (or Grade I/II* in
     England) structures should be noteworthy by default.

     It appears (as per your examples) that some of these were
     created as duplicates/with wrong IDs. As I said, this is based
     on "official" data, so it's the best I could do at the time.
     With mass creation, there are bound to be a few strays. If you
     can find some large-scale, systemic issue I'll try to fix it,
     but the one-offs will always fall back to manual fixing. At
     least, with Wikidata, we can fix them together.

     On Tue, Jun 2, 2015 at 10:01 AM Daniel Kinzler
     <daniel.kinzler@wikimedia.de
     <mailto:daniel.kinzler@wikimedia.de>> wrote:

         Am 01.06.2015 um 22:26 schrieb Markus Krötzsch:
          > Finally, the technical question is: Why is this even
         possible? I thought that,
          > in each language, label+description are a key (globally
         unique), yet here we
          > have many pairs of items with exactly the same label and
         description. Or is the
          > problem that no description was entered and so the system
         does not apply the
          > key?

         The uniqueness constraint does indeed not apply if there is
         no description.

         --
         Daniel Kinzler
         Senior Software Developer

         Wikimedia Deutschland
         Gesellschaft zur Förderung Freien Wissens e.V.

         _______________________________________________
         Wikidata mailing list
         Wikidata@lists.wikimedia.org
         <mailto:Wikidata@lists.wikimedia.org>
         https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Andrew Gray

7 Jun 7 Jun

10:44 a.m.

New subject: [Spam] Re: [Spam] Re: No links, wrong data: Scotland's orphans need help

A related suggestion... I've wondered before if what we could use for such imports is a "meta" value for the P31 property - something like "instance of: imported unchecked item". When a person has corrected or checked the items, added sitelinks, etc it's easy to remove this value. This would let us easily identify ones that might still need assistance, eg to check for duplicates or to mark them as a part of a larger item, without continually having to go through the list.

Commons does something similar with hidden tracking categories for bulk uploads, and it's quite useful there.

Andrew.

On 3 June 2015 at 14:48, Markus Krötzsch markus@semantic-mediawiki.org wrote:

...

Thanks, Andrew, for the clarification. This makes perfect sense.

I don't see a problem with one bridge having two IDs in some external database. We already have this for other ID-like properties for other reasons. What is important though is that it still is a single bridge, and should therefore be one item.

Your clarification is reassuring since it suggests that the problem is not overly common after all. Maybe one can just merge these cases manually. Once the (multiple) ids are found in the merged items, avoiding future duplicates will be done as usual (which is still difficult with the Scottish Heritage ids since we have many legit Wikidata items that have the same id -- but this at least is an independent problem).

Regards,

Markus

On 03.06.2015 13:48, Andrew Gray wrote:

...
This particular case is something of a known problem - we've encountered it with some of the other heritage-building identifier lists as well.

Bridges often span a river which is the border for two jurisdictions (in this case, council areas). Each local area counts it as a historic building, and because the national lists are aggregated from local lists, it gets two entries in the main list, one as Fife and one as Edinburgh. A similar case in Wales is the Menai Suspension Bridge, which is 4049 from the Gwynedd register and 18572 from the Anglesey one (Wikidata, at Q581526, only lists one identifer).

The lack of deduplication is probably intentional rather than a bug, and both entries are "correct". Perhaps one way to handle this for Wikidata would be to, hmm, say something like "if the item is some kind of a bridge, then allow two IDs" in the constraints?

I can't immediately think of any bridges which cross national borders *and* are a heritage building in both countries, but we'd see the same thing there, with it having identifiers from both sides.

Andrew.

On 2 June 2015 at 12:12, Markus Krötzsch markus@semantic-mediawiki.org wrote:

...
Another interesting type of Scottish historic orphans are those that are duplicates of items that do have site links. Even very prominent ones are duplicated, such as

https://www.wikidata.org/wiki/Q17569486 (dup) https://www.wikidata.org/wiki/Q933000 (real item)

Interestingly, they use different Scotland IDs, and it does indeed seem that Historic Scotland also contains duplicates:

http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...

http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...

Overall, this seems to be an example of an ID that really should not be considered "identity providing" since there seems to be an many-to-many relationship between Wikidata and Historic Scottland. Orphans should receive additional ids from a better source if at all possible. With the great number of seemingly legit non-functional uses of the Scotland IDs, they cannot be used in practice to detect duplicates.

Regards,

Markus

On 02.06.2015 13:01, Markus Krötzsch wrote:

...
On 02.06.2015 11:30, Magnus Manske wrote:

...
Update 2: For example, https://www.wikidata.org/wiki/Q17847522 and https://www.wikidata.org/wiki/Q17847537 have the same Scotland ID, but refer to different entities (church and churchyard, respectively). They were as two entities in the original dataset, sharing the same ID.

Yes, I noticed such cases too. From the information Wikidata, it is not clear to me why this is sometimes done and sometimes not done.

For example, these adjacent houses have the same Scotland ID but different items that each have their own coordinates (where did the coordinates come from?):

https://www.wikidata.org/wiki/Q17576211 https://www.wikidata.org/wiki/Q17576182 https://www.wikidata.org/wiki/Q17576185

In many other cases, adjacent houses with the same ID are combined into one item:

https://www.wikidata.org/wiki/Q17806587

(note, however, that the house addresses given in the ID and in the item label do not match, though they overlap on most of the houses.)

Finally, there are also cases where there are different IDs and we have several items, but they have the same labels that merge the contents of the two IDs:

https://www.wikidata.org/wiki/Q17810121 https://www.wikidata.org/wiki/Q17810137

It seems that the data was not taken from the Historic Sites database but from some different source that has its own coordinate data and a different (but seemingly arbitrary) approach to grouping sites. However, the coordinated give Historic Scotland as their reference -- I wonder if Historic Scotland might be changing frequently or exist in several versions.

Regards,

Markus

...
On Tue, Jun 2, 2015 at 10:26 AM Magnus Manske <magnusmanske@googlemail.com mailto:magnusmanske@googlemail.com> wrote:
 Update: There appear to be quite a few items with duplicate
Scotland IDs (not all of them may be erroneous!): http://wdq.wmflabs.org/stats?action=doublestring&prop=709
 On Tue, Jun 2, 2015 at 10:23 AM Magnus Manske
 <magnusmanske@googlemail.com <mailto:magnusmanske@googlemail.com>>
 wrote:

     I created (some/most of) these items as part of the Wiki Loves
     Monuments UK 2014 drive, to run the campaign from Wikidata
     rather than from a bespoke database. This allows the community
     (TM) to maintain the data, rather than one poor sod (e.g.,
     myself) having to frantically update all of it every year ;-)

     "Consumer" tool is here:
     https://tools.wmflabs.org/wlmuk/index_wd.html

     These are based on "official" data from National Heritage,
     provided to me via Wikimedia UK. Grade A (or Grade I/II* in
     England) structures should be noteworthy by default.

     It appears (as per your examples) that some of these were
     created as duplicates/with wrong IDs. As I said, this is based
     on "official" data, so it's the best I could do at the time.
     With mass creation, there are bound to be a few strays. If you
     can find some large-scale, systemic issue I'll try to fix it,
     but the one-offs will always fall back to manual fixing. At
     least, with Wikidata, we can fix them together.

     On Tue, Jun 2, 2015 at 10:01 AM Daniel Kinzler
     <daniel.kinzler@wikimedia.de
     <mailto:daniel.kinzler@wikimedia.de>> wrote:

         Am 01.06.2015 um 22:26 schrieb Markus Krötzsch:
          > Finally, the technical question is: Why is this even
         possible? I thought that,
          > in each language, label+description are a key (globally
         unique), yet here we
          > have many pairs of items with exactly the same label
and description. Or is the > problem that no description was entered and so the system does not apply the > key?
         The uniqueness constraint does indeed not apply if there
is no description.
         --
         Daniel Kinzler
         Senior Software Developer

         Wikimedia Deutschland
         Gesellschaft zur Förderung Freien Wissens e.V.

         _______________________________________________
         Wikidata mailing list
         Wikidata@lists.wikimedia.org
         <mailto:Wikidata@lists.wikimedia.org>
         https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- - Andrew Gray andrew.gray@dunelm.org.uk

Gerard Meijssen

11:05 a.m.

New subject: [Spam] Re: [Spam] Re: No links, wrong data: Scotland's orphans need help

Hoi, What is wrong with identifying NOTHING with there being a problem?

To me it seems a bit too much.. I like to keep tabs on the items that have 0 statements.. They are useless in many ways. Thanks, GerardM

On 7 June 2015 at 19:44, Andrew Gray andrew.gray@dunelm.org.uk wrote:

...

A related suggestion... I've wondered before if what we could use for such imports is a "meta" value for the P31 property - something like "instance of: imported unchecked item". When a person has corrected or checked the items, added sitelinks, etc it's easy to remove this value. This would let us easily identify ones that might still need assistance, eg to check for duplicates or to mark them as a part of a larger item, without continually having to go through the list.

Commons does something similar with hidden tracking categories for bulk uploads, and it's quite useful there.

Andrew.

On 3 June 2015 at 14:48, Markus Krötzsch markus@semantic-mediawiki.org wrote:

...
Thanks, Andrew, for the clarification. This makes perfect sense.

I don't see a problem with one bridge having two IDs in some external database. We already have this for other ID-like properties for other reasons. What is important though is that it still is a single bridge,

and

...
should therefore be one item.

Your clarification is reassuring since it suggests that the problem is

not

...
overly common after all. Maybe one can just merge these cases manually.

Once

...
the (multiple) ids are found in the merged items, avoiding future

duplicates

...
will be done as usual (which is still difficult with the Scottish

Heritage

...
ids since we have many legit Wikidata items that have the same id -- but this at least is an independent problem).

Regards,

Markus

On 03.06.2015 13:48, Andrew Gray wrote:

...
This particular case is something of a known problem - we've encountered it with some of the other heritage-building identifier lists as well.

Bridges often span a river which is the border for two jurisdictions (in this case, council areas). Each local area counts it as a historic building, and because the national lists are aggregated from local lists, it gets two entries in the main list, one as Fife and one as Edinburgh. A similar case in Wales is the Menai Suspension Bridge, which is 4049 from the Gwynedd register and 18572 from the Anglesey one (Wikidata, at Q581526, only lists one identifer).

The lack of deduplication is probably intentional rather than a bug, and both entries are "correct". Perhaps one way to handle this for Wikidata would be to, hmm, say something like "if the item is some kind of a bridge, then allow two IDs" in the constraints?

I can't immediately think of any bridges which cross national borders *and* are a heritage building in both countries, but we'd see the same thing there, with it having identifiers from both sides.

Andrew.

On 2 June 2015 at 12:12, Markus Krötzsch <markus@semantic-mediawiki.org

...
wrote:

...
Another interesting type of Scottish historic orphans are those that

are

...
...
...
duplicates of items that do have site links. Even very prominent ones

are

...
...
...
duplicated, such as

https://www.wikidata.org/wiki/Q17569486 (dup) https://www.wikidata.org/wiki/Q933000 (real item)

Interestingly, they use different Scotland IDs, and it does indeed seem that Historic Scotland also contains duplicates:

http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...

...
...
...
http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...

...
...
...
Overall, this seems to be an example of an ID that really should not be considered "identity providing" since there seems to be an many-to-many relationship between Wikidata and Historic Scottland. Orphans should receive additional ids from a better source if at all possible. With the great number of seemingly legit non-functional uses of the Scotland IDs, they cannot be used in practice to detect duplicates.

Regards,

Markus

On 02.06.2015 13:01, Markus Krötzsch wrote:

...
On 02.06.2015 11:30, Magnus Manske wrote:

...
Update 2: For example, https://www.wikidata.org/wiki/Q17847522 and https://www.wikidata.org/wiki/Q17847537 have the same Scotland ID, but refer to different entities (church

and

...
...
...
...
...
churchyard, respectively). They were as two entities in the original dataset, sharing the same ID.

Yes, I noticed such cases too. From the information Wikidata, it is

not

...
...
...
...
clear to me why this is sometimes done and sometimes not done.

For example, these adjacent houses have the same Scotland ID but different items that each have their own coordinates (where did the coordinates come from?):

https://www.wikidata.org/wiki/Q17576211 https://www.wikidata.org/wiki/Q17576182 https://www.wikidata.org/wiki/Q17576185

In many other cases, adjacent houses with the same ID are combined

into

...
...
...
...
one item:

https://www.wikidata.org/wiki/Q17806587

(note, however, that the house addresses given in the ID and in the

item

...
...
...
...
label do not match, though they overlap on most of the houses.)

Finally, there are also cases where there are different IDs and we

have

...
...
...
...
several items, but they have the same labels that merge the contents

of

...
...
...
...
the two IDs:

https://www.wikidata.org/wiki/Q17810121 https://www.wikidata.org/wiki/Q17810137

It seems that the data was not taken from the Historic Sites database but from some different source that has its own coordinate data and a different (but seemingly arbitrary) approach to grouping sites.

However,

...
...
...
...
the coordinated give Historic Scotland as their reference -- I wonder

if

...
...
...
...
Historic Scotland might be changing frequently or exist in several versions.

Regards,

Markus

...
On Tue, Jun 2, 2015 at 10:26 AM Magnus Manske <magnusmanske@googlemail.com mailto:magnusmanske@googlemail.com> wrote:
 Update: There appear to be quite a few items with duplicate
Scotland IDs (not all of them may be erroneous!): http://wdq.wmflabs.org/stats?action=doublestring&prop=709
 On Tue, Jun 2, 2015 at 10:23 AM Magnus Manske
 <magnusmanske@googlemail.com <mailto:
magnusmanske@googlemail.com>>

...
...
...
...
...
 wrote:

     I created (some/most of) these items as part of the Wiki
Loves

...
...
...
...
...
     Monuments UK 2014 drive, to run the campaign from Wikidata
     rather than from a bespoke database. This allows the
community

...
...
...
...
...
     (TM) to maintain the data, rather than one poor sod (e.g.,
     myself) having to frantically update all of it every year
;-)

...
...
...
...
...
     "Consumer" tool is here:
     https://tools.wmflabs.org/wlmuk/index_wd.html

     These are based on "official" data from National Heritage,
     provided to me via Wikimedia UK. Grade A (or Grade I/II* in
     England) structures should be noteworthy by default.

     It appears (as per your examples) that some of these were
     created as duplicates/with wrong IDs. As I said, this is
based

...
...
...
...
...
     on "official" data, so it's the best I could do at the time.
     With mass creation, there are bound to be a few strays. If
you

...
...
...
...
...
     can find some large-scale, systemic issue I'll try to fix
it,

...
...
...
...
...
     but the one-offs will always fall back to manual fixing. At
     least, with Wikidata, we can fix them together.

     On Tue, Jun 2, 2015 at 10:01 AM Daniel Kinzler
     <daniel.kinzler@wikimedia.de
     <mailto:daniel.kinzler@wikimedia.de>> wrote:

         Am 01.06.2015 um 22:26 schrieb Markus Krötzsch:
          > Finally, the technical question is: Why is this even
         possible? I thought that,
          > in each language, label+description are a key
(globally

...
...
...
...
...
         unique), yet here we
          > have many pairs of items with exactly the same label
and description. Or is the > problem that no description was entered and so the system does not apply the > key?
         The uniqueness constraint does indeed not apply if there
is no description.
         --
         Daniel Kinzler
         Senior Software Developer

         Wikimedia Deutschland
         Gesellschaft zur Förderung Freien Wissens e.V.

         _______________________________________________
         Wikidata mailing list
         Wikidata@lists.wikimedia.org
         <mailto:Wikidata@lists.wikimedia.org>
         https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
--

Andrew Gray andrew.gray@dunelm.org.uk

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Magnus Manske

12:14 p.m.

New subject: [Spam] Re: [Spam] Re: No links, wrong data: Scotland's orphans need help

Ah, that might be a little bit overkill. It would have to be "re-instanced" on every subsequent edit. Not to mention the "contamination" of P31 with maintenance items.

On Sun, Jun 7, 2015 at 6:45 PM Andrew Gray andrew.gray@dunelm.org.uk wrote:

...

A related suggestion... I've wondered before if what we could use for such imports is a "meta" value for the P31 property - something like "instance of: imported unchecked item". When a person has corrected or checked the items, added sitelinks, etc it's easy to remove this value. This would let us easily identify ones that might still need assistance, eg to check for duplicates or to mark them as a part of a larger item, without continually having to go through the list.

Commons does something similar with hidden tracking categories for bulk uploads, and it's quite useful there.

Andrew.

On 3 June 2015 at 14:48, Markus Krötzsch markus@semantic-mediawiki.org wrote:

...
Thanks, Andrew, for the clarification. This makes perfect sense.

I don't see a problem with one bridge having two IDs in some external database. We already have this for other ID-like properties for other reasons. What is important though is that it still is a single bridge,

and

...
should therefore be one item.

Your clarification is reassuring since it suggests that the problem is

not

...
overly common after all. Maybe one can just merge these cases manually.

Once

...
the (multiple) ids are found in the merged items, avoiding future

duplicates

...
will be done as usual (which is still difficult with the Scottish

Heritage

...
ids since we have many legit Wikidata items that have the same id -- but this at least is an independent problem).

Regards,

Markus

On 03.06.2015 13:48, Andrew Gray wrote:

...
This particular case is something of a known problem - we've encountered it with some of the other heritage-building identifier lists as well.

Bridges often span a river which is the border for two jurisdictions (in this case, council areas). Each local area counts it as a historic building, and because the national lists are aggregated from local lists, it gets two entries in the main list, one as Fife and one as Edinburgh. A similar case in Wales is the Menai Suspension Bridge, which is 4049 from the Gwynedd register and 18572 from the Anglesey one (Wikidata, at Q581526, only lists one identifer).

The lack of deduplication is probably intentional rather than a bug, and both entries are "correct". Perhaps one way to handle this for Wikidata would be to, hmm, say something like "if the item is some kind of a bridge, then allow two IDs" in the constraints?

I can't immediately think of any bridges which cross national borders *and* are a heritage building in both countries, but we'd see the same thing there, with it having identifiers from both sides.

Andrew.

On 2 June 2015 at 12:12, Markus Krötzsch <markus@semantic-mediawiki.org

...
wrote:

...
Another interesting type of Scottish historic orphans are those that

are

...
...
...
duplicates of items that do have site links. Even very prominent ones

are

...
...
...
duplicated, such as

https://www.wikidata.org/wiki/Q17569486 (dup) https://www.wikidata.org/wiki/Q933000 (real item)

Interestingly, they use different Scotland IDs, and it does indeed seem that Historic Scotland also contains duplicates:

http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...

...
...
...
http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL...

...
...
...
Overall, this seems to be an example of an ID that really should not be considered "identity providing" since there seems to be an many-to-many relationship between Wikidata and Historic Scottland. Orphans should receive additional ids from a better source if at all possible. With the great number of seemingly legit non-functional uses of the Scotland IDs, they cannot be used in practice to detect duplicates.

Regards,

Markus

On 02.06.2015 13:01, Markus Krötzsch wrote:

...
On 02.06.2015 11:30, Magnus Manske wrote:

...
Update 2: For example, https://www.wikidata.org/wiki/Q17847522 and https://www.wikidata.org/wiki/Q17847537 have the same Scotland ID, but refer to different entities (church

and

...
...
...
...
...
churchyard, respectively). They were as two entities in the original dataset, sharing the same ID.

Yes, I noticed such cases too. From the information Wikidata, it is

not

...
...
...
...
clear to me why this is sometimes done and sometimes not done.

For example, these adjacent houses have the same Scotland ID but different items that each have their own coordinates (where did the coordinates come from?):

https://www.wikidata.org/wiki/Q17576211 https://www.wikidata.org/wiki/Q17576182 https://www.wikidata.org/wiki/Q17576185

In many other cases, adjacent houses with the same ID are combined

into

...
...
...
...
one item:

https://www.wikidata.org/wiki/Q17806587

(note, however, that the house addresses given in the ID and in the

item

...
...
...
...
label do not match, though they overlap on most of the houses.)

Finally, there are also cases where there are different IDs and we

have

...
...
...
...
several items, but they have the same labels that merge the contents

of

...
...
...
...
the two IDs:

https://www.wikidata.org/wiki/Q17810121 https://www.wikidata.org/wiki/Q17810137

It seems that the data was not taken from the Historic Sites database but from some different source that has its own coordinate data and a different (but seemingly arbitrary) approach to grouping sites.

However,

...
...
...
...
the coordinated give Historic Scotland as their reference -- I wonder

if

...
...
...
...
Historic Scotland might be changing frequently or exist in several versions.

Regards,

Markus

...
On Tue, Jun 2, 2015 at 10:26 AM Magnus Manske <magnusmanske@googlemail.com mailto:magnusmanske@googlemail.com> wrote:
 Update: There appear to be quite a few items with duplicate
Scotland IDs (not all of them may be erroneous!): http://wdq.wmflabs.org/stats?action=doublestring&prop=709
 On Tue, Jun 2, 2015 at 10:23 AM Magnus Manske
 <magnusmanske@googlemail.com <mailto:
magnusmanske@googlemail.com>>

...
...
...
...
...
 wrote:

     I created (some/most of) these items as part of the Wiki
Loves

...
...
...
...
...
     Monuments UK 2014 drive, to run the campaign from Wikidata
     rather than from a bespoke database. This allows the
community

...
...
...
...
...
     (TM) to maintain the data, rather than one poor sod (e.g.,
     myself) having to frantically update all of it every year
;-)

...
...
...
...
...
     "Consumer" tool is here:
     https://tools.wmflabs.org/wlmuk/index_wd.html

     These are based on "official" data from National Heritage,
     provided to me via Wikimedia UK. Grade A (or Grade I/II* in
     England) structures should be noteworthy by default.

     It appears (as per your examples) that some of these were
     created as duplicates/with wrong IDs. As I said, this is
based

...
...
...
...
...
     on "official" data, so it's the best I could do at the time.
     With mass creation, there are bound to be a few strays. If
you

...
...
...
...
...
     can find some large-scale, systemic issue I'll try to fix
it,

...
...
...
...
...
     but the one-offs will always fall back to manual fixing. At
     least, with Wikidata, we can fix them together.

     On Tue, Jun 2, 2015 at 10:01 AM Daniel Kinzler
     <daniel.kinzler@wikimedia.de
     <mailto:daniel.kinzler@wikimedia.de>> wrote:

         Am 01.06.2015 um 22:26 schrieb Markus Krötzsch:
          > Finally, the technical question is: Why is this even
         possible? I thought that,
          > in each language, label+description are a key
(globally

...
...
...
...
...
         unique), yet here we
          > have many pairs of items with exactly the same label
and description. Or is the > problem that no description was entered and so the system does not apply the > key?
         The uniqueness constraint does indeed not apply if there
is no description.
         --
         Daniel Kinzler
         Senior Software Developer

         Wikimedia Deutschland
         Gesellschaft zur Förderung Freien Wissens e.V.

         _______________________________________________
         Wikidata mailing list
         Wikidata@lists.wikimedia.org
         <mailto:Wikidata@lists.wikimedia.org>
         https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
--

Andrew Gray andrew.gray@dunelm.org.uk

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

3499

Age (days ago)

3506

Last active (days ago)

wikidata@lists.wikimedia.org

32 comments

12 participants

tags (0)

participants (12)

Andrew Gray
Andy Mabbett
Bene*
Daniel Kinzler
Dario Taraborelli
Federico Leva (Nemo)
Gerard Meijssen
Luca Martinelli
Lydia Pintscher
Magnus Manske
Markus Krötzsch
Scott MacLeod