2015-01-17 4:27 GMT-05:00 Lydia Pintscher lydia.pintscher@wikimedia.de:
The log is at https://meta.wikimedia.org/wiki/IRC_office_hours/Office_hours_2015-01-16 for anyone who couldn't make it.
Denny discusses importing all missing VIAF keys from Freebase using "multichill" (unclear what that is from the context) on the assumption that the error rate is low. It would be worth checking assumptions like that with folks who are familiar with the Freebase data before acting on them.
Here are some things that I think are true about the VIAF keys in Freebase:
- they were assigned by a user, not by Google/Metawab (not necessarily a bad thing since some of the biggest problems in Freebase were created by G/M and some users have contributed very high quality data)
- they keys were, I believe, assigned based heavily on existing Library of Congress identifiers that had previously been assigned by Metaweb. Those key assignments are not as high quality as other parts of Freebase. One easy thing to check for is people with two LC (and thus two VIAF) keys assigned. In cases where there are more than key and the extra keys don't represent pseudonyms, this is a clear error.
- Freebase doesn't create separate entities for pseudonyms, unlike the library cataloging world. Depending on what decision Wikidata makes in this regard, it's something to watch out for when reusing Freebase author data (including VIAF keys)
- much Freebase author data was imported from OpenLibrary which has its own set of quality issues. A bunch of this data was later deleted, leaving that portion of the graph somewhat thready and moth-eaten. It's unclear whether that was a net gain or loss in overall bibliographic data quality for Freebase.
- I suspect that most VIAF keys which are in Freebase and not Wikidata represent entities which are not in Wikidata which means they aren't useful anyway since he wants to focus on creating new links, not new entities (a direction that I'm not sure I agree with, but that's a whole separate discussion).
One of the key inputs to judging the quality of assertions is their provenance. Fortunately, this is recorded for all assertions in Freebase and it's possible to trace a given fact back to the user, toolchain, or process that added it to the database. Unfortunately, this information is only available through the Freebase API, not the bulk data dump. Hopefully, this will change before Google completely abandons Freebase.
If any Wikidata folk want to discuss VIAF keys in Freebase (or its author data in general), feel free to get in touch.
Tom
Tom Morris, 17/01/2015 17:17:
Denny discusses importing all missing VIAF keys from Freebase using "multichill" (unclear what that is from the context)
Hi Tom,
Tom Morris schreef op 17-1-2015 om 17:17:
2015-01-17 4:27 GMT-05:00 Lydia Pintscher <lydia.pintscher@wikimedia.de mailto:lydia.pintscher@wikimedia.de>:
The log is at https://meta.wikimedia.org/wiki/IRC_office_hours/Office_hours_2015-01-16 for anyone who couldn't make it.
Denny discusses importing all missing VIAF keys from Freebase using "multichill" (unclear what that is from the context) on the assumption that the error rate is low. It would be worth checking assumptions like that with folks who are familiar with the Freebase data before acting on them.
I guess you are refereing to "18:57:55 <vrandecic> If you ask me, I am happy with just letting multichill to upload the VIAFs that are still missing"
That would be me. VIAF is a very good starting point for getting more authority data. If you have viaf, you can add other authority control data based on that. So getting more links to viaf would be nice. Not sure how many are still missing. I recently did that for ULAN and NTA ( https://www.wikidata.org/w/index.php?title=Q120609&diff=182583270&ol... / https://www.wikidata.org/w/index.php?title=Q1610938&diff=182636686&o... ). Was able to add over 100.000 new links. I still have to do this for other types of authority control. The more tightly connected things get, the easier it gets to find problems or duplicates.
Maarten
On Sun, Jan 18, 2015 at 9:11 AM, Maarten Dammers maarten@mdammers.nl wrote:
Hi Tom,
Tom Morris schreef op 17-1-2015 om 17:17:
2015-01-17 4:27 GMT-05:00 Lydia Pintscher lydia.pintscher@wikimedia.de:
The log is at https://meta.wikimedia.org/wiki/IRC_office_hours/Office_hours_2015-01-16 for anyone who couldn't make it.
Denny discusses importing all missing VIAF keys from Freebase using "multichill" (unclear what that is from the context) on the assumption that the error rate is low. It would be worth checking assumptions like that with folks who are familiar with the Freebase data before acting on them.
I guess you are refereing to "18:57:55 <vrandecic> If you ask me, I am happy with just letting multichill to upload the VIAFs that are still missing"
That would be me. VIAF is a very good starting point for getting more authority data. If you have viaf, you can add other authority control data based on that. So getting more links to viaf would be nice. Not sure how many are still missing. I recently did that for ULAN and NTA ( https://www.wikidata.org/w/index.php?title=Q120609&diff=182583270&ol... / https://www.wikidata.org/w/index.php?title=Q1610938&diff=182636686&o... ). Was able to add over 100.000 new links. I still have to do this for other types of authority control. The more tightly connected things get, the easier it gets to find problems or duplicates.
Maarten
Hello All,
It was me that originally imported the about 400,000 VIAF links into Wikidata. The way that they were matched was using by using a name-and-date-of-birth-matching algorithm with an English Wikipedia dump, and it was done by a team that works for VIAF.org . Then then those matches were imported into English Wikipedia. Additionally other languages like Italian Wikipedia and Commons had done some manual matching. After about a year of manual correcting in Wikipedias, I took the authority control from (I think about 9) different Wikis, and imported it into Wikidata. I also later did some of what Maarten/multichill is doing now which is to do some lookups on VIAF and import subsequent data over, like sex/gender and alternative names. I'm going to ping the people at VIAF (which is part of OCLC, for whom I used to work, but no longer) about this to see if they have any thoughts to add as well.
Make a great day, Max Klein ‽ http://notconfusing.com/
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l