On Nov 21, 2015, at 10:31, Magnus Manske
<magnusmanske(a)googlemail.com> wrote:
To address the first point:
So the auto-matches are just simple label-mmatches. Removing the automatch in
mix'n'match just says that this was not the same person etc. and the entry is
moved back to the "unmatched" pool.
This does /not/ mean there isn't a match on Wikidata! You only say that by setting
the entry to "not on Wikidata".
Apologies, I was indeed referring to items explicitly flagged as "not on WD",
not simply unmerged ones.
And I do occasionally batch-create items for those,
usually when all entries are processed. Which can have other issues, like an item was
created in the meantime, and now I create a duplicate.
+1
A soultion could be to change the "not on
Wikidata" button (or link) to a "create new item" button. The new item
would have a label, a description (maybe), a statement with the catalog ID (if there is an
associated WIkidata property!), and "instance of:human" if the entry is
internally marked as "person", but nothing else.
Would that be welcomed by "mix'n'matchers", and Wikidata people? I
think it would make sense, for catalogs with a Wikidata property at least.
I would strongly support this, with the restrictions you suggest.
As for the second point, I think in most cases the
mere existence of a new, better-fitting item (or at least one equally fitting at first
glance) will prevent false assignments. Sure, there are some cases, like the one given as
an example, which would profit from a P1889 "different from" statement. We have
run into that problem with the "merge game" I'm running, where people do a
lot of false merges because the items seem identical at first glance.
However, I don't think this is prevalent enough to warrant special treatment in
mix'n'match itself. For the few cases were it would help, Wikidata can always be
edited manually. Besides, where would we draw the line? "John Smith" returns
hundreds of search results; that would translate into tens of thousands of "different
from" statements.
I think once your "Giulio Baldigara" example brother is created, and both will
show up in search results, that alone will be enough to serve as a "different
from" warning in most settings.
Mix'n'match automatch, for example, will only match entries where the exact label
is unique in labels and aliases; two items with a "Giulio Baldigara" label or
alias would not automatch an entry with that name.
These are valid concerns, happy to withdraw the second part of the proposal. Thanks
Maarten for pointing me to the right property.
On Sat, Nov 21, 2015 at 5:35 PM Dario Taraborelli
<dtaraborelli(a)wikimedia.org> wrote:
I finally found the time to play extensively with
Mix’n’match and it’s by far one of the most promising models I’ve come across for Wikidata
growth. A short conversation with Magnus on Twitter got me thinking on how to best
preserve the output of costly human curation.[1]
I spent most of my time manually auditing automatically matched entries from the
Dizionario Biografico degli Italiani [2]. These entries are long, unstructured
biographical entries and it takes quite a lot of effort to understand if the two
individuals referenced by Wikidata and DBI actually are the same person. This is a great
example of a task that’s still pretty hard for a machine to perform, no matter how
sophisticated the algorithm.
My favorite example? Mix’n’ match suggested a match between Giulio Baldigara (Q1010811)
and Giulio Baldigara (DBI) which looked totally legitimate: these two individuals are both
Italian architects from the 16th century with the same name, they were both born around
the same years in the same city, they were both active in Hungary at the same time: strong
indication that they are the same person, right? It turns out they are brothers and the
full name of the person referenced in Wikidata is Giulio Cesare Baldigara (the least known
in a family of architects). I unmatched the suggestion and flagged the DBI entry as non
existing in Wikidata.
My question at the moment is: the output of a labor-intensive review of a potential match
is currently stored as a volatile flag in a tool hosted on labs, but is invisible in
Wikidata. Should something happen to Mix’n’match (god forbid) the result of my work would
get lost. Which got me thinking:
- shouldn’t a manually unmatched item be created directly on Wikidata (after all DBI is
all about notable individuals who would easily pass Wikidata’s notability threshold for
biographies)
- shouldn’t the relation between Giulio (Cesare) Baldigara (Q1010811) and the newly
created item for Giulio Baldigara be explicitly represented via a not the same as
property, to prevent future humans or machines from accidentally remerging the two items
based on some kind of heuristics
Thoughts welcome,
Dario
[1]
https://twitter.com/ReaderMeter/status/667214565621432320
[2]
https://tools.wmflabs.org/mix-n-match/?mode=catalog&catalog=55&offs…
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata