On Nov 21, 2015, at 10:31, Magnus Manske magnusmanske@googlemail.com wrote:
To address the first point: So the auto-matches are just simple label-mmatches. Removing the automatch in mix'n'match just says that this was not the same person etc. and the entry is moved back to the "unmatched" pool.
This does /not/ mean there isn't a match on Wikidata! You only say that by setting the entry to "not on Wikidata".
Apologies, I was indeed referring to items explicitly flagged as "not on WD", not simply unmerged ones.
And I do occasionally batch-create items for those, usually when all entries are processed. Which can have other issues, like an item was created in the meantime, and now I create a duplicate.
+1
A soultion could be to change the "not on Wikidata" button (or link) to a "create new item" button. The new item would have a label, a description (maybe), a statement with the catalog ID (if there is an associated WIkidata property!), and "instance of:human" if the entry is internally marked as "person", but nothing else.
Would that be welcomed by "mix'n'matchers", and Wikidata people? I think it would make sense, for catalogs with a Wikidata property at least.
I would strongly support this, with the restrictions you suggest.
As for the second point, I think in most cases the mere existence of a new, better-fitting item (or at least one equally fitting at first glance) will prevent false assignments. Sure, there are some cases, like the one given as an example, which would profit from a P1889 "different from" statement. We have run into that problem with the "merge game" I'm running, where people do a lot of false merges because the items seem identical at first glance.
However, I don't think this is prevalent enough to warrant special treatment in mix'n'match itself. For the few cases were it would help, Wikidata can always be edited manually. Besides, where would we draw the line? "John Smith" returns hundreds of search results; that would translate into tens of thousands of "different from" statements.
I think once your "Giulio Baldigara" example brother is created, and both will show up in search results, that alone will be enough to serve as a "different from" warning in most settings. Mix'n'match automatch, for example, will only match entries where the exact label is unique in labels and aliases; two items with a "Giulio Baldigara" label or alias would not automatch an entry with that name.
These are valid concerns, happy to withdraw the second part of the proposal. Thanks Maarten for pointing me to the right property.
On Sat, Nov 21, 2015 at 5:35 PM Dario Taraborelli dtaraborelli@wikimedia.org wrote:
I finally found the time to play extensively with Mix’n’match and it’s by far one of the most promising models I’ve come across for Wikidata growth. A short conversation with Magnus on Twitter got me thinking on how to best preserve the output of costly human curation.[1]
I spent most of my time manually auditing automatically matched entries from the Dizionario Biografico degli Italiani [2]. These entries are long, unstructured biographical entries and it takes quite a lot of effort to understand if the two individuals referenced by Wikidata and DBI actually are the same person. This is a great example of a task that’s still pretty hard for a machine to perform, no matter how sophisticated the algorithm.
My favorite example? Mix’n’ match suggested a match between Giulio Baldigara (Q1010811) and Giulio Baldigara (DBI) which looked totally legitimate: these two individuals are both Italian architects from the 16th century with the same name, they were both born around the same years in the same city, they were both active in Hungary at the same time: strong indication that they are the same person, right? It turns out they are brothers and the full name of the person referenced in Wikidata is Giulio Cesare Baldigara (the least known in a family of architects). I unmatched the suggestion and flagged the DBI entry as non existing in Wikidata.
My question at the moment is: the output of a labor-intensive review of a potential match is currently stored as a volatile flag in a tool hosted on labs, but is invisible in Wikidata. Should something happen to Mix’n’match (god forbid) the result of my work would get lost. Which got me thinking:
- shouldn’t a manually unmatched item be created directly on Wikidata (after all DBI is all about notable individuals who would easily pass Wikidata’s notability threshold for biographies)
- shouldn’t the relation between Giulio (Cesare) Baldigara (Q1010811) and the newly created item for Giulio Baldigara be explicitly represented via a not the same as property, to prevent future humans or machines from accidentally remerging the two items based on some kind of heuristics
Thoughts welcome,
Dario
[1] https://twitter.com/ReaderMeter/status/667214565621432320 [2] https://tools.wmflabs.org/mix-n-match/?mode=catalog&catalog=55&offse...
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata