On Tue, Jun 2, 2015 at 12:12 PM Markus Krötzsch <markus@semantic-mediawiki.org> wrote:

Another interesting type of Scottish historic orphans are those that are
duplicates of items that do have site links. Even very prominent ones
are duplicated, such as

https://www.wikidata.org/wiki/Q17569486 (dup)
https://www.wikidata.org/wiki/Q933000 (real item)

Interestingly, they use different Scotland IDs, and it does indeed seem
that Historic Scotland also contains duplicates:

http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL:47778
http://data.historic-scotland.gov.uk/pls/htmldb/f?p=2200:15:0::::BUILDING,HL:49165

Overall, this seems to be an example of an ID that really should not be
considered "identity providing" since there seems to be an many-to-many
relationship between Wikidata and Historic Scottland. Orphans should
receive additional ids from a better source if at all possible. With the
great number of seemingly legit non-functional uses of the Scotland IDs,
they cannot be used in practice to detect duplicates.

They are not unique on the Historic Scorland site, but they can still have the correct IDs on WIkidata, even if they are non-unique. What will be required for this (and other external IDs) in the long run is an automated or semi-automated check against the foreign data corpus, with heuristics highlighting potential issues. This includes new items in the external source (or ones we missed during initial import).

Given that I received the original data as a CSV from WMUK, who got it from Historic Scotland under Freedom of Information (IIRC), this might prove tricky.

Regards,

Markus

On 02.06.2015 13:01, Markus Krötzsch wrote:
> On 02.06.2015 11:30, Magnus Manske wrote:
>> Update 2:
>> For example,
>> https://www.wikidata.org/wiki/Q17847522
>> and
>> https://www.wikidata.org/wiki/Q17847537
>> have the same Scotland ID, but refer to different entities (church and
>> churchyard, respectively). They were as two entities in the original
>> dataset, sharing the same ID.
>
> Yes, I noticed such cases too. From the information Wikidata, it is not
> clear to me why this is sometimes done and sometimes not done.
>
> For example, these adjacent houses have the same Scotland ID but
> different items that each have their own coordinates (where did the
> coordinates come from?):
>
> https://www.wikidata.org/wiki/Q17576211
> https://www.wikidata.org/wiki/Q17576182
> https://www.wikidata.org/wiki/Q17576185
>
> In many other cases, adjacent houses with the same ID are combined into
> one item:
>
> https://www.wikidata.org/wiki/Q17806587
>
> (note, however, that the house addresses given in the ID and in the item
> label do not match, though they overlap on most of the houses.)
>
> Finally, there are also cases where there are different IDs and we have
> several items, but they have the same labels that merge the contents of
> the two IDs:
>
> https://www.wikidata.org/wiki/Q17810121
> https://www.wikidata.org/wiki/Q17810137
>
>
> It seems that the data was not taken from the Historic Sites database
> but from some different source that has its own coordinate data and a
> different (but seemingly arbitrary) approach to grouping sites. However,
> the coordinated give Historic Scotland as their reference -- I wonder if
> Historic Scotland might be changing frequently or exist in several
> versions.
>
> Regards,
>
> Markus
>
>
>>
>> On Tue, Jun 2, 2015 at 10:26 AM Magnus Manske
>> <magnusmanske@googlemail.com <mailto:magnusmanske@googlemail.com>> wrote:
>>
>> Update: There appear to be quite a few items with duplicate Scotland
>> IDs (not all of them may be erroneous!):
>> http://wdq.wmflabs.org/stats?action=doublestring&prop=709
>>
>> On Tue, Jun 2, 2015 at 10:23 AM Magnus Manske
>> <magnusmanske@googlemail.com <mailto:magnusmanske@googlemail.com>>
>> wrote:
>>
>> I created (some/most of) these items as part of the Wiki Loves
>> Monuments UK 2014 drive, to run the campaign from Wikidata
>> rather than from a bespoke database. This allows the community
>> (TM) to maintain the data, rather than one poor sod (e.g.,
>> myself) having to frantically update all of it every year ;-)
>>
>> "Consumer" tool is here:
>> https://tools.wmflabs.org/wlmuk/index_wd.html
>>
>> These are based on "official" data from National Heritage,
>> provided to me via Wikimedia UK. Grade A (or Grade I/II* in
>> England) structures should be noteworthy by default.
>>
>> It appears (as per your examples) that some of these were
>> created as duplicates/with wrong IDs. As I said, this is based
>> on "official" data, so it's the best I could do at the time.
>> With mass creation, there are bound to be a few strays. If you
>> can find some large-scale, systemic issue I'll try to fix it,
>> but the one-offs will always fall back to manual fixing. At
>> least, with Wikidata, we can fix them together.
>>
>> On Tue, Jun 2, 2015 at 10:01 AM Daniel Kinzler
>> <daniel.kinzler@wikimedia.de
>> <mailto:daniel.kinzler@wikimedia.de>> wrote:
>>
>> Am 01.06.2015 um 22:26 schrieb Markus Krötzsch:
>> > Finally, the technical question is: Why is this even
>> possible? I thought that,
>> > in each language, label+description are a key (globally
>> unique), yet here we
>> > have many pairs of items with exactly the same label and
>> description. Or is the
>> > problem that no description was entered and so the system
>> does not apply the
>> > key?
>>
>> The uniqueness constraint does indeed not apply if there is
>> no description.
>>
>> --
>> Daniel Kinzler
>> Senior Software Developer
>>
>> Wikimedia Deutschland
>> Gesellschaft zur Förderung Freien Wissens e.V.
>>
>> _______________________________________________
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> <mailto:Wikidata@lists.wikimedia.org>
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>>
>>
>> _______________________________________________
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata