Hoi,
Sandra, these issues are the most obvious area for improving quality not
only but also at our enWikidatad. The potential is bigger. All these
sources have information and when we have mappings between their statements
and our statements, we can verify if they are the same. This allows us to
automate the process to a large extend. When there are differences, we can
produce reports and these reports are 100% relevant issues. Finding sources
for these items or statements makes sense.
One issue I came across today was that some of our ISNI records are wrong.
How do we deal with that. The point is that we are firmly in data add mode
and quality issues like these are currently handled in a one off manner.
Another example is that when wiki links are added, it does not follow that
a label is added. Amir has a bot, it runs on request but it could be
automated.
Once we are aware of mappings and the data that is held externally, we can
show a source in green when its information agrees with what we hold and
red when there is a discrepancy. Additional data is initially to be
excluded for all kinds of reasons.. A visual indicator like this urges
people to look into the differing data. It is relatively easy to show a
comparison between Wikidata and external sources that hold the same item /
statement.
Sandra, quality is in addressing issues where we find them. We do find them
but that is where it typically stops. Continuous quality improvement
requires development. So far it has not been a priority and I can
understand that with all the other low hanging fruit out there.. Maybe 2016
will shine favourably on issues like these :)
Thanks,
Gerard
On 31 December 2015 at 12:22, Sandra Fauconnier <sandra.fauconnier(a)gmail.com
wrote:
> This exchange with VIAF/OCLC is only one case. We are increasingly
> becoming a hub for authority sources/files/records, and staying mutually
> up-to-date with them is (IMHO) really important.
> And indeed, we don’t have reliable tools/workflows in place yet to take
> good care of this…
>
> I haven’t given it much thought yet either, but what occurs from the top
> of my mind (among other things):
> 1. We have changes and updates on our own side (sometimes in error,
> sometimes correctly)
> 1.1 We add new entries that correspond with not-yet-linked entries in
> external authority files
> 1.2 We remove duplicates in Wikidata
> 1.3 We add errors that may or may not be tracked by the external party
> (e.g. the VIAF case here) -> this feedback is indeed gold to us
>
> 2. There are changes and updates on the authority file’s side
> 2.1 Entries are added on their side, and some or all may correspond to
> items on Wikidata
> 2.2. Entries are removed on their side (RKDartists has deleted many
> entries recently, for instance)
> 2.3 We discover errors, duplicates… in the external authority file -> this
> information is probably gold to the external party too
>
> I have the feeling that this is one large issue that would ideally be
> covered in one overall solution/system. Mix’n’Match is the first step in
> that direction, but is only very rudimentary and doesn’t catch all issues
> described above.
>
> Any thoughts or input on this? (I’m not a developer, just an active
> user/contributor to this…)
>
> Greetings, Sandra/User:Spinster
>
> On 28 Dec 2015, at 20:29, Tom Morris <tfmorris(a)gmail.com
wrote:
>
> I think there are at least two uses for information like this. Fixing the
> actual errors is good, but perhaps more important is looking at why they
> happened in the first place. Are there infrastructure/process issues which
> need to be improved? Are there systemic problems with particular tool
> chains, users, domains, etc? What patterns does the data show?
>
> I've attached a munged version of the list in a format which I find a
> little easier to work with and added Wikidata links
>
> Looking at the 30 oldest entities, 22 (!) of the duplicates were added by
> a single user (bot?) who was adding botanists, apparently based on data
> from the International Plant Names Index
> <http://www.ipni.org/ipni/authorsearchpage.do>, without first checking to
> see if they already existed. The user's page indicates that they've had
> 2.5 *million* items deleted or merged (~12% of everything they've added).
> I'd hope to see high volume users/bots/tools in the 99%+ range for quality,
> not <90%.
>
> One pair is not a duplicate, but rather a father
> <https://www.wikidata.org/wiki/Q1228686> & son
> <https://www.wikidata.org/wiki/Q716655> with the same name, apparently
> flagged because they were both born in the 2nd century and died in the 3rd
> century, making them a "match."
>
> A few of remaining the duplicates were created by a variety of bots
> importing Wikipedia entries with incompletely fused sitelinks (not terribly
> surprising when the only structured information is a name and a sitelink).
>
> The last few pairs of duplicates don't really have enough provenance to
> figure out the source of the data. One was created just a couple of weeks
> ago by a bot
> <https://www.wikidata.org/w/index.php?title=Q18603442&action=history>
> using "data from the Rijksmuseum" (no link or other provenance given),
> apparently without checking for existing entries first. A few
> <https://www.wikidata.org/w/index.php?title=Q16825734&action=history>
> others
> <https://www.wikidata.org/w/index.php?title=Q19619143&action=history> was
created
> by Widar
> <https://www.wikidata.org/w/index.php?title=Q19933200&action=history>,
> but I can't tell what game, what data source, etc.
>
> Looking at three pairs of entries which were created at nearly the same
> time (min QNumberDelta), each pair was created by a single game/bot,
> indicating inadequate internal duplicate checks on the input data.
>
> It seems like post hoc analysis of merged entries to mine for patterns
> would be a very useful tool to identify systemic issues. Is that something
> that is done currently?
>
> Tom
>
>
>
> On Wed, Dec 23, 2015 at 5:05 PM, Proffitt,Merrilee <proffitm(a)oclc.org>
wrote:
>
>> Hello colleagues,
>>
>>
>>
>> During the most recent VIAF harvest we encountered a number of duplicate
>> records in Wikidata. Forwarding on in case this is of interest (there is an
>> attached file – not sure if that will go through on this list or not).
>>
>>
>>
>> Some discussion from OCLC colleagues is included below.
>>
>>
>>
>> Merrilee Proffitt, Senior Program Officer
>> OCLC Research
>>
>>
>>
>> *From:* Toves,Jenny
>> *Sent:* Tuesday, December 22, 2015 6:02 AM
>> *To:* Proffitt,Merrilee
>> *Subject:* FW: 201551 vs 201552
>>
>>
>>
>> Good morning Merrilee,
>>
>>
>>
>> You probably know that we harvest wikidata monthly for ingest into VIAF.
>> This month we found 315 pairs of records that appear to be duplicates. That
>> was a jump from previous months. I am not sure who would be interested in
>> this but Thom & I thought you might be. The attached report has 630 lines
>> showing what viaf saw as duplicates. So this pair of lines:
>>
>>
>>
>> WKP|Q21518392 =998 $aCharles du Bois
>> Larbalestier$2WKP|Q21341290$3duplicate
>>
>> WKP|Q21341290 =998 $aCharles du Bois
>> Larbalestier$2WKP|Q21518392$3duplicate
>>
>>
>>
>> Shows that those two wikidata numbers are linked to one another by viaf.
>>
>>
>>
>> I don’t think we expect you to do anything with this unless you find it
>> interesting. I suspect there are bots to clean this stuff up but maybe not.
>>
>>
>>
>> --Jenny.
>>
>>
>>
>> *From:* Hickey,Thom
>> *Sent:* Monday, December 21, 2015 9:47 PM
>> *To:* Toves,Jenny
>> *Subject:* RE: 201551 vs 201552
>>
>>
>>
>> She probably would be interested.
>>
>>
>>
>> --Th
>>
>>
>>
>>
>> *From: *Toves,Jenny <tovesj(a)oclc.org>
>> *Sent: *Monday, December 21, 2015 9:35 PM
>> *To: *Hickey,Thom <hickey(a)oclc.org>
>> *Subject: *RE: 201551 vs 201552
>>
>>
>>
>> Exact same name + dates. Do you a list of them? Do you think Merrilee or
>> anyone would be interested?
>>
>>
>>
>> *From:* Hickey,Thom
>> *Sent:* Monday, December 21, 2015 8:04 PM
>> *To:* Toves,Jenny
>> *Subject:* FW: 201551 vs 201552
>>
>>
>>
>> Noticed WKP duplicates went way up
>>
>> --Th
>>
>>
>>
>>
>> *From: *Jenny Toves <toves(a)orhddb01dxdu.dev.oclc.org>
>> *Sent: *Monday, December 21, 2015 5:12 PM
>> *To: *Hickey,Thom <hickey(a)oclc.org>rg>; Toves,Jenny <tovesj(a)oclc.org>
>> *Subject: *201551 vs 201552
>>
>>
>>
>>
>>
>> REPORT for records
>>
>> Changed 13.51%: geographic 3369217.0 -> 3824513.0
>>
>> Change in % of 8: NLR at_least_one_match 16% -> 24%
>>
>> Changed 19.83%: NLR all_matches 181437.0 -> 217423.0
>>
>> Change in % of 88: NLR with_bibs 0% -> 88%
>>
>> Changed 17.99%: WKP geographic 2529990.0 -> 2985194.0
>>
>> Changed -19.95%: WKP corporate 369224.0 -> 295579.0
>>
>>
>>
>> REPORT for matches
>>
>> Changed 12.70%: exact corporate name 1021239.0 -> 1150899.0
>>
>> Changed 14.29%: XR viafid 7.0 -> 8.0
>>
>> Changed -10.42%: XR expression title to sibling 48.0 -> 43.0
>>
>> Changed -16.16%: PTBNP forced 229.0 -> 192.0
>>
>> Changed -37.50%: NSZL forced 8.0 -> 5.0
>>
>> Changed 38.46%: NLP suggested 13.0 -> 18.0
>>
>> No longer zero: NLR standard number 0.0 -> 21479.0
>>
>> No longer zero: NLR exact title 0.0 -> 5166.0
>>
>> No longer zero: NLR partial date and partial title 0.0 -> 618.0
>>
>> No longer zero: NLR name as subject 0.0 -> 62.0
>>
>> No longer zero: NLR partial title and publisher 0.0 -> 88.0
>>
>> No longer zero: NLR title 0.0 -> 5093.0
>>
>> Changed -47.66%: NLR forced single date 37125.0 -> 19430.0
>>
>> Changed 14.29%: NLR viafid 14.0 -> 16.0
>>
>> No longer zero: NLR partial date and publisher 0.0 -> 15894.0
>>
>> No longer zero: NLR joint author 0.0 -> 5228.0
>>
>> Changed -14.49%: LC suggested 7594.0 -> 6494.0
>>
>> Changed 33.33%: CYT viafid 12.0 -> 16.0
>>
>> Changed -21.08%: NLA forced 223.0 -> 176.0
>>
>> Changed 233.33%: LNL forced 3.0 -> 10.0
>>
>> Changed 12.50%: NLB viafid 8.0 -> 9.0
>>
>> Changed 16.67%: NLB ngram corporate name 6.0 -> 7.0
>>
>> Changed 25.71%: VLACC forced 35.0 -> 44.0
>>
>> Changed 19.13%: DNB exact corporate name 315872.0 -> 376304.0
>>
>> Changed 14.29%: DNB expression title to sibling 7.0 -> 8.0
>>
>> Changed 16.67%: BNF expression title to sibling 6.0 -> 7.0
>>
>> Changed 15.91%: ICCU forced 44.0 -> 51.0
>>
>> Changed 25.54%: NTA forced 9699.0 -> 12176.0
>>
>> Changed 28.62%: WKP exact corporate name 224787.0 -> 289112.0
>>
>> Changed 23.73%: WKP longer corporate name 76057.0 -> 94106.0
>>
>> Changed 584.78%: WKP duplicate record 92.0 -> 630.0
>>
>> Changed -18.92%: EGAXA forced 37.0 -> 30.0
>>
>>
>>
>> REPORT for tags
>>
>> Changed 11.56%: NSZL work links (993) 225.0 -> 251.0
>>
>> No longer zero: NLR wrote about (955) 0.0 -> 106.0
>>
>> No longer zero: NLR bibs (999) 0.0 -> 108202.0
>>
>> No longer zero: NLR was a subject (960) 0.0 -> 16423.0
>>
>> No longer zero: NLR relator code (941) 0.0 -> 103950.0
>>
>> No longer zero: NLR language of work (940) 0.0 -> 108193.0
>>
>> No longer zero: NLR issn (902) 0.0 -> 34.0
>>
>> No longer zero: NLR bib title (910) 0.0 -> 107895.0
>>
>> No longer zero: NLR joint corporate author (951) 0.0 -> 24235.0
>>
>> Changed 146.67%: NLR compared (996) 27448.0 -> 67705.0
>>
>> No longer zero: NLR rectype + biblvl (944) 0.0 -> 108194.0
>>
>> No longer zero: NLR country of publication (922) 0.0 -> 108169.0
>>
>> No longer zero: NLR publisher (921) 0.0 -> 93904.0
>>
>> No longer zero: NLR isbn (901) 0.0 -> 78978.0
>>
>> No longer zero: NLR publisher id (920) 0.0 -> 78978.0
>>
>> Changed 50.05%: NLR matched (998) 19864.0 -> 29806.0
>>
>> No longer zero: NLR name from statement of responsibility (930) 0.0 ->
>> 72478.0
>>
>> No longer zero: NLR noise title (912) 0.0 -> 3543.0
>>
>> No longer zero: NLR lc class number (942) 0.0 -> 1.0
>>
>> No longer zero: NLR joint author (950) 0.0 -> 69048.0
>>
>> No longer zero: NLR was a subject (969) 0.0 -> 115.0
>>
>> Changed -14.29%: XA work links (993) 7.0 -> 6.0
>>
>> No longer zero: SRP work links (993) 0.0 -> 1.0
>>
>> Changed 22.50%: BNL work links (993) 551.0 -> 675.0
>>
>> Changed 11.16%: WKP auth title (919) 45779.0 -> 50890.0
>>
>> Changed 12.56%: WKP noise title (912) 8249.0 -> 9285.0
>>
>>
>>
>>
>>
>> _______________________________________________
>> Wikidata mailing list
>> Wikidata(a)lists.wikimedia.org
>>
https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>>
> <VIAF-Wikidata-duplicates.tsv>
> _______________________________________________
> Wikidata mailing list
> Wikidata(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
>
> _______________________________________________
> Wikidata mailing list
> Wikidata(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>