I like that phrase

"Is the data going to be used? Data that is not used is exponentially harder to maintain because less people see it"

To take a specific example I've been building a name and identifier
recognition system based on data from Freebase that is focused on certain kinds of spatial regions.  I'm going to underline that this is not an academic project (where,  in the worst case,  I might proudly announce that I got 81.2% accuracy and that this beats the last group that got 80.3%) but a commercial system that (1) needs to be hyperaccurate (at least three nines if not four) and (2) where I need to fix anything that management or customers find wrong right away.

Another aspect of it is that I can get (barely) two nine accuracy for entities while only resolving about 40% of place names that appear once because these entities are concentrated in certain places.  Many of the most popular regions need data corrections to resolve correctly because they tend to be national capitals where there are multiple geographic entities that occupy the same land area or they are ontologically troubled islands.

Looking in Freebase I don't find 100% of the identifiers that are used in my data set, and another issues is that some containment relationships are missing because sometimes @fbase couldn't figure out the relative hierarchy of places.

I address both of those issues by applying "fact patches" to my knowledge base.

In principle I could push these changes back to @fbase,  but since mqlwrite is broken and @fbase is heading towards EOL,  I won't.

There are other problems though,  that I end up addressing in my rule base, or that I have add different vocabulary if I want to solve them.  For instance,  I get a lot of references to "Hong Kong Island" which is not to be confused with

http://en.wikipedia.org/wiki/Islands_District

it turns out HKI has four administrative districts. With a little more logic I can probably figure out which district these things are in,  but maybe it doesn't make any real difference to end users and I'm not sure mail would be delivered to the "Central and Western District",  so I could make HKI an "honorary" administrative district (something I wouldn't push back to upstream)

So you notice two themes here.  Some of my patches are things that would belong in Wikidata because they are filling in fields that Wikidata already has and following conventional conventions.

There are other patches I need to make to reflect requirements of my application that I'd never want to push upstream because they are "correct" in the context of my application but "incorrect" or "arguable" in general.

-----

One of the troubles people have consistently had with DBpedia has been trying to get a list of the top cities in the world (by one or another metric)  It's hard to do for two reasons:

(1) some facts are absent in DBpedia,  and
(2) many of the biggest/most important "cities" in the world such as London and Tokyo are not,  technically,  cities.

Success in this project,  therefore,  requires patching absent or incorrect facts in DBpedia,  but also the creation of a vernacular concept of "city" which reflects the "common sense" perception here.


On Sat, Jan 3, 2015 at 9:48 AM, Lydia Pintscher <lydia.pintscher@wikimedia.de> wrote:
Hey folks :)

Happy new year everyone. It is surely going to be an exciting one for
Wikidata. Over the last weeks I've been thinking a lot about the year
ahead of us. One thing is clear to me: It will be about successfully
scaling Wikidata and keeping all the amazing things we have achieved
in the process.

I've written down my thoughts on the subject in a blog post to kick
off some thinking and discussions:
http://blog.wikimedia.de/2015/01/03/scaling-wikidata-success-means-making-the-pie-bigger/


Cheers
Lydia

--
Lydia Pintscher - http://about.me/lydia.pintscher
Product Manager for Wikidata

Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l



--
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254    paul.houle on Skype   ontology2@gmail.com
http://legalentityidentifier.info/lei/lookup