Hello colleagues,
During the most recent VIAF harvest we encountered a number of duplicate records in Wikidata. Forwarding on in case this is of interest (there is an attached file – not sure if that will go through on this list or not).
Some discussion from OCLC colleagues is included below.
Merrilee Proffitt, Senior Program Officer
OCLC Research
From: Toves,Jenny
Sent: Tuesday, December 22, 2015 6:02 AM
To: Proffitt,Merrilee
Subject: FW: 201551 vs 201552
Good morning Merrilee,
You probably know that we harvest wikidata monthly for ingest into VIAF. This month we found 315 pairs of records that appear to be duplicates. That was a jump from previous months. I am not sure who would be interested in this but Thom & I thought you might be. The attached report has 630 lines showing what viaf saw as duplicates. So this pair of lines:
WKP|Q21518392 =998 $aCharles du Bois Larbalestier$2WKP|Q21341290$3duplicate
WKP|Q21341290 =998 $aCharles du Bois Larbalestier$2WKP|Q21518392$3duplicate
Shows that those two wikidata numbers are linked to one another by viaf.
I don’t think we expect you to do anything with this unless you find it interesting. I suspect there are bots to clean this stuff up but maybe not.
--Jenny.
From: Hickey,Thom
Sent: Monday, December 21, 2015 9:47 PM
To: Toves,Jenny
Subject: RE: 201551 vs 201552
She probably would be interested.
--Th
From: Toves,Jenny<mailto:tovesj@oclc.org>
Sent: Monday, December 21, 2015 9:35 PM
To: Hickey,Thom<mailto:hickey@oclc.org>
Subject: RE: 201551 vs 201552
Exact same name + dates. Do you a list of them? Do you think Merrilee or anyone would be interested?
From: Hickey,Thom
Sent: Monday, December 21, 2015 8:04 PM
To: Toves,Jenny
Subject: FW: 201551 vs 201552
Noticed WKP duplicates went way up
--Th
From: Jenny Toves<mailto:toves@orhddb01dxdu.dev.oclc.org>
Sent: Monday, December 21, 2015 5:12 PM
To: Hickey,Thom<mailto:hickey@oclc.org>; Toves,Jenny<mailto:tovesj@oclc.org>
Subject: 201551 vs 201552
REPORT for records
Changed 13.51%: geographic 3369217.0 -> 3824513.0
Change in % of 8: NLR at_least_one_match 16% -> 24%
Changed 19.83%: NLR all_matches 181437.0 -> 217423.0
Change in % of 88: NLR with_bibs 0% -> 88%
Changed 17.99%: WKP geographic 2529990.0 -> 2985194.0
Changed -19.95%: WKP corporate 369224.0 -> 295579.0
REPORT for matches
Changed 12.70%: exact corporate name 1021239.0 -> 1150899.0
Changed 14.29%: XR viafid 7.0 -> 8.0
Changed -10.42%: XR expression title to sibling 48.0 -> 43.0
Changed -16.16%: PTBNP forced 229.0 -> 192.0
Changed -37.50%: NSZL forced 8.0 -> 5.0
Changed 38.46%: NLP suggested 13.0 -> 18.0
No longer zero: NLR standard number 0.0 -> 21479.0
No longer zero: NLR exact title 0.0 -> 5166.0
No longer zero: NLR partial date and partial title 0.0 -> 618.0
No longer zero: NLR name as subject 0.0 -> 62.0
No longer zero: NLR partial title and publisher 0.0 -> 88.0
No longer zero: NLR title 0.0 -> 5093.0
Changed -47.66%: NLR forced single date 37125.0 -> 19430.0
Changed 14.29%: NLR viafid 14.0 -> 16.0
No longer zero: NLR partial date and publisher 0.0 -> 15894.0
No longer zero: NLR joint author 0.0 -> 5228.0
Changed -14.49%: LC suggested 7594.0 -> 6494.0
Changed 33.33%: CYT viafid 12.0 -> 16.0
Changed -21.08%: NLA forced 223.0 -> 176.0
Changed 233.33%: LNL forced 3.0 -> 10.0
Changed 12.50%: NLB viafid 8.0 -> 9.0
Changed 16.67%: NLB ngram corporate name 6.0 -> 7.0
Changed 25.71%: VLACC forced 35.0 -> 44.0
Changed 19.13%: DNB exact corporate name 315872.0 -> 376304.0
Changed 14.29%: DNB expression title to sibling 7.0 -> 8.0
Changed 16.67%: BNF expression title to sibling 6.0 -> 7.0
Changed 15.91%: ICCU forced 44.0 -> 51.0
Changed 25.54%: NTA forced 9699.0 -> 12176.0
Changed 28.62%: WKP exact corporate name 224787.0 -> 289112.0
Changed 23.73%: WKP longer corporate name 76057.0 -> 94106.0
Changed 584.78%: WKP duplicate record 92.0 -> 630.0
Changed -18.92%: EGAXA forced 37.0 -> 30.0
REPORT for tags
Changed 11.56%: NSZL work links (993) 225.0 -> 251.0
No longer zero: NLR wrote about (955) 0.0 -> 106.0
No longer zero: NLR bibs (999) 0.0 -> 108202.0
No longer zero: NLR was a subject (960) 0.0 -> 16423.0
No longer zero: NLR relator code (941) 0.0 -> 103950.0
No longer zero: NLR language of work (940) 0.0 -> 108193.0
No longer zero: NLR issn (902) 0.0 -> 34.0
No longer zero: NLR bib title (910) 0.0 -> 107895.0
No longer zero: NLR joint corporate author (951) 0.0 -> 24235.0
Changed 146.67%: NLR compared (996) 27448.0 -> 67705.0
No longer zero: NLR rectype + biblvl (944) 0.0 -> 108194.0
No longer zero: NLR country of publication (922) 0.0 -> 108169.0
No longer zero: NLR publisher (921) 0.0 -> 93904.0
No longer zero: NLR isbn (901) 0.0 -> 78978.0
No longer zero: NLR publisher id (920) 0.0 -> 78978.0
Changed 50.05%: NLR matched (998) 19864.0 -> 29806.0
No longer zero: NLR name from statement of responsibility (930) 0.0 -> 72478.0
No longer zero: NLR noise title (912) 0.0 -> 3543.0
No longer zero: NLR lc class number (942) 0.0 -> 1.0
No longer zero: NLR joint author (950) 0.0 -> 69048.0
No longer zero: NLR was a subject (969) 0.0 -> 115.0
Changed -14.29%: XA work links (993) 7.0 -> 6.0
No longer zero: SRP work links (993) 0.0 -> 1.0
Changed 22.50%: BNL work links (993) 551.0 -> 675.0
Changed 11.16%: WKP auth title (919) 45779.0 -> 50890.0
Changed 12.56%: WKP noise title (912) 8249.0 -> 9285.0
I'm pretty sure ambiguation is not a word, but what are the guidelines on
removing disambiguation information from names/labels of items. Usually
this disambiguation information was added to a Wikipedia article name to
enforce their uniqueness requirements, but Wikidata has no such need and
the presence of the information makes it very difficult to construct things
like placename hierarchies which don't look clumsy/weird.
It appears that, in general, disambiguation text has been removed from
names, but this isn't always the case. Is it safe/recommended to clean up
those that have been missed?
Here are some examples that I've collected:
Linden Park, Massachusetts https://www.wikidata.org/wiki/Q6552338
boulevard carnot cannes https://www.wikidata.org/wiki/Q2921787
Newton Lower Falls, Massachusetts https://www.wikidata.org/wiki/Q7020301
Harrington House (Weston, Massachusetts)
https://www.wikidata.org/wiki/Q14715508
Thomas Fleming House (Sherborn, Massachusetts)
https://www.wikidata.org/wiki/Q14715881
Peabody, Cambridge, Massachusetts https://www.wikidata.org/wiki/Q7157211
Is the standard the "natural" name or the invented name that Wikpedians
came up with?
Tom
Hi Dario,
Date: Wed, 23 Dec 2015 08:04:33 -0800
> From: Dario Taraborelli <dtaraborelli(a)wikimedia.org>
> To: "Discussion list for the Wikidata project."
> <wikidata(a)lists.wikimedia.org>
> Subject: Re: [Wikidata] [ANNOUNCEMENT] StrepHit IEG project kick-off
> seminar
> Message-ID:
> <CAHSVRZQ4kGCiHJfJYOzbSB8djBwyAB7e0zQ=
> ugVJnBB-mJxsCQ(a)mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hi Marco,
>
> will the seminar be streamed or recorded?
>
I have to check with FBK's staff, it should be straightforward.
I will take care of sharing the link with everyone.
Cheers,
Marco
>
> Dario
>
> On Wed, Dec 23, 2015 at 8:03 AM, Marco Fossati <fossati(a)spaziodati.eu>
> wrote:
>
> > [Begging pardon if you read this multiple times]
> >
> > Hi everyone,
> >
> > I would like to announce with great pleasure the StrepHit IEG project
> > kick-off seminar.
> > Of course, you are all invited to attend.
> >
> > The event will be held in a special day: Wikipedia's birthday!
> >
> > Below you can find the details.
> >
> > Schedule: 15 January 2016, 11:00 am, Luigi Stringa Conference Room
> > Location: Fondazione Bruno Kessler, Via Sommarive 18, Povo, Trento, Italy
> > - http://www.openstreetmap.org/way/28933739
> >
> > Abstract: We kick-off StrepHit, a project funded by the Wikimedia
> > Foundation through the Individual Engagement Grants program.
> > StrepHit is a Natural Language Processing pipeline that understands human
> > language, extracts facts from text and produces Wikidata statements with
> > reference URLs.
> > It will enhance the data quality of Wikidata by suggesting references to
> > validate statements, and will help Wikidata become the gold-standard hub
> of
> > the Open Data landscape.
> >
> > Link:
> >
> https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Va…
> >
> > Speaker's bio: Marco Fossati is a researcher with a double background in
> > Natural Languages and Information Technologies. He works at the Data and
> > Knowledge Management (DKM) research unit at Fondazione Bruno Kessler,
> > Trento, Italy. He is member of the DBpedia Association board of trustees,
> > founder and representative of its Italian chapter. He has
> interdisciplinary
> > skills both in linguistics and in programming. His research focuses on
> > bridging the gap between Natural Language Processing techniques and Large
> > Scale Structured Knowledge Bases in order to drive the Web of Data
> towards
> > its full potential.
> >
> > See you in Trento and long live Wikipedia!
> > Cheers,
> >
> > Marco
> >
> > _______________________________________________
> > Wikidata mailing list
> > Wikidata(a)lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikidata
> >
> >
>
>
> --
>
>
> *Dario Taraborelli *Head of Research, Wikimedia Foundation
> wikimediafoundation.org • nitens.org • @readermeter
> <http://twitter.com/readermeter>
>
Hey all,
I would like to ask about programming web apps using Wikidata. I'm
completely new to programming, and I just started learning Python. My
question is if I would like to make apps like Mix 'n' Match or the
Wikidata Game, or stuff like Histropedia, Tree of Life, and Sum of All
Paintings, what programming languages/skills do I need to learn? Also, is
it worth to start learning how to use frameworks like Django or Ruby or
these come latar? (And if yes, which framework do you think is suitable?)
Thanks!
Best,
Adam
[Begging pardon if you read this multiple times]
Hi everyone,
I would like to announce with great pleasure the StrepHit IEG project
kick-off seminar.
Of course, you are all invited to attend.
The event will be held in a special day: Wikipedia's birthday!
Below you can find the details.
Schedule: 15 January 2016, 11:00 am, Luigi Stringa Conference Room
Location: Fondazione Bruno Kessler, Via Sommarive 18, Povo, Trento, Italy -
http://www.openstreetmap.org/way/28933739
Abstract: We kick-off StrepHit, a project funded by the Wikimedia
Foundation through the Individual Engagement Grants program.
StrepHit is a Natural Language Processing pipeline that understands human
language, extracts facts from text and produces Wikidata statements with
reference URLs.
It will enhance the data quality of Wikidata by suggesting references to
validate statements, and will help Wikidata become the gold-standard hub of
the Open Data landscape.
Link:
https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Va…
Speaker's bio: Marco Fossati is a researcher with a double background in
Natural Languages and Information Technologies. He works at the Data and
Knowledge Management (DKM) research unit at Fondazione Bruno Kessler,
Trento, Italy. He is member of the DBpedia Association board of trustees,
founder and representative of its Italian chapter. He has interdisciplinary
skills both in linguistics and in programming. His research focuses on
bridging the gap between Natural Language Processing techniques and Large
Scale Structured Knowledge Bases in order to drive the Web of Data towards
its full potential.
See you in Trento and long live Wikipedia!
Cheers,
Marco
Hi there,
Edgard Marx wants you to try Dropbox! Dropbox lets you bring all your photos, docs, and videos with you anywhere and share them easily.
Accept invite[1]
Thanks!
- The Dropbox Team
____________________________________________________
If you prefer not to receive invites from Dropbox, please go here[2].
Dropbox, Inc., PO Box 77767, San Francisco, CA 94107
[1]: https://www.dropbox.com/l/okUWMKxsaptUUKvDKdWQ1i?text=1
[2]: https://www.dropbox.com/l/5suSpsCDYUWFAeBZ0rYUOp?text=1
Hi there,
Edgard Marx wants you to try Dropbox! Dropbox lets you bring all your photos, docs, and videos with you anywhere and share them easily.
Accept invite[1]
Thanks!
- The Dropbox Team
____________________________________________________
If you prefer not to receive invites from Dropbox, please go here[2].
Dropbox, Inc., PO Box 77767, San Francisco, CA 94107
[1]: https://www.dropbox.com/l/J8D8B1VpFl4ZbeNIhxiexh?text=1
[2]: https://www.dropbox.com/l/gZ20XDbWXnJKRHS4yKkikp?text=1