I like that phrase
"Is the data going to be used? Data that is not used is exponentially
harder to maintain because less people see it"
To take a specific example I've been building a name and identifier
recognition system based on data from Freebase that is focused on certain
kinds of spatial regions. I'm going to underline that this is not an
academic project (where, in the worst case, I might proudly announce that
I got 81.2% accuracy and that this beats the last group that got 80.3%) but
a commercial system that (1) needs to be hyperaccurate (at least three
nines if not four) and (2) where I need to fix anything that management or
customers find wrong right away.
Another aspect of it is that I can get (barely) two nine accuracy for
entities while only resolving about 40% of place names that appear once
because these entities are concentrated in certain places. Many of the
most popular regions need data corrections to resolve correctly because
they tend to be national capitals where there are multiple geographic
entities that occupy the same land area or they are ontologically troubled
islands.
Looking in Freebase I don't find 100% of the identifiers that are used in
my data set, and another issues is that some containment relationships are
missing because sometimes @fbase couldn't figure out the relative hierarchy
of places.
I address both of those issues by applying "fact patches" to my knowledge
base.
In principle I could push these changes back to @fbase, but since mqlwrite
is broken and @fbase is heading towards EOL, I won't.
There are other problems though, that I end up addressing in my rule base,
or that I have add different vocabulary if I want to solve them. For
instance, I get a lot of references to "Hong Kong Island" which is not to
be confused with
http://en.wikipedia.org/wiki/Islands_District
it turns out HKI has four administrative districts. With a little more
logic I can probably figure out which district these things are in, but
maybe it doesn't make any real difference to end users and I'm not sure
mail would be delivered to the "Central and Western District", so I could
make HKI an "honorary" administrative district (something I wouldn't push
back to upstream)
So you notice two themes here. Some of my patches are things that would
belong in Wikidata because they are filling in fields that Wikidata already
has and following conventional conventions.
There are other patches I need to make to reflect requirements of my
application that I'd never want to push upstream because they are "correct"
in the context of my application but "incorrect" or "arguable" in
general.
-----
One of the troubles people have consistently had with DBpedia has been
trying to get a list of the top cities in the world (by one or another
metric) It's hard to do for two reasons:
(1) some facts are absent in DBpedia, and
(2) many of the biggest/most important "cities" in the world such as London
and Tokyo are not, technically, cities.
Success in this project, therefore, requires patching absent or incorrect
facts in DBpedia, but also the creation of a vernacular concept of "city"
which reflects the "common sense" perception here.
On Sat, Jan 3, 2015 at 9:48 AM, Lydia Pintscher <
lydia.pintscher(a)wikimedia.de> wrote:
Hey folks :)
Happy new year everyone. It is surely going to be an exciting one for
Wikidata. Over the last weeks I've been thinking a lot about the year
ahead of us. One thing is clear to me: It will be about successfully
scaling Wikidata and keeping all the amazing things we have achieved
in the process.
I've written down my thoughts on the subject in a blog post to kick
off some thinking and discussions:
http://blog.wikimedia.de/2015/01/03/scaling-wikidata-success-means-making-t…
Cheers
Lydia
--
Lydia Pintscher -
http://about.me/lydia.pintscher
Product Manager for Wikidata
Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
_______________________________________________
Wikidata-l mailing list
Wikidata-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l
--
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254 paul.houle on Skype ontology2(a)gmail.com
http://legalentityidentifier.info/lei/lookup