On Tue, Jun 16, 2015 at 2:43 PM, Thomas Pellissier-Tanon thomaspt@google.com wrote:
I have only added the most used 1000 properties in order to don't hide important properties with less important ones. But feel free to add properties that are not listed there.
Most frequent isn't the same as most important. Has this been reviewed by anyone who's familiar with the Freebase schema? Google knows all the stuff I've listed below and could save a lot of wasted effort by people who aren't familiar with the schema.
Most of the stuff in the /base/* domains should probably be ignored for a first pass and some, like /base/schemastaging should be ignored permanently unless they've also got an alias in the commons namespace due to being promoted. Ditto for the /user/* domains. I don't see anything from the /authority namespace which is arguably the most important part of Freebase -- all it's reconciled strong identifiers for IMDB, Library of Congress, New York Times, etc.
Some of the properties which are included have been replaced by keys in the /authority namespace tree, e.g.
https://www.freebase.com/book/author/openlibrary_id https://www.freebase.com/user/narphorium/people/nndb_person/nndb_id
These can be identified programaticly by looking at the schema where the /type/property/enumeration property will point at the namespace where the identifier is stored (/authority/openlibrary/author & /authority/nndb, respectively). Note that, for historical reasons, some of the earlier key namespaces have aliases outside of the tree rooted at /authority. For example, the property https://www.freebase.com/biology/organism_classification/itis_tsn enumerates its identifiers in a namespace which is aliased as both /biology/itis and /authority/itis. https://www.freebase.com/biology/itis?keys=
All hidden properties should probably be ignored. Most (all?) deprecated properties should probably be ignored. There was a discussion about ISBN, but these can be identified by introspecting the schema: https://www.freebase.com/book/book_edition/ISBN
There's a bunch of internal bookkeeping cruft included in that list that should be excluded, e.g.:
https://www.freebase.com/dataworld/gardening_hint/split_to https://www.freebase.com/dataworld/mass_data_operation/authority https://www.freebase.com/dataworld/mass_data_operation/ended_operation https://www.freebase.com/dataworld/mass_data_operation/estimated_primitive_c... https://www.freebase.com/dataworld/mass_data_operation/operator https://www.freebase.com/dataworld/mass_data_operation/software_tool_used https://www.freebase.com/dataworld/mass_data_operation/started_operation https://www.freebase.com/dataworld/mass_data_operation/using_account https://www.freebase.com/dataworld/provenance/data_operation https://www.freebase.com/dataworld/provenance/tool https://www.freebase.com/dataworld/software_tool/provenances
https://www.freebase.com/freebase/acre_doc/based_on https://www.freebase.com/freebase/acre_doc/handler https://www.freebase.com/freebase/domain_profile/expert_group https://www.freebase.com/freebase/domain_profile/featured_views https://www.freebase.com/freebase/domain_profile/hidden https://www.freebase.com/freebase/domain_profile/show_commons https://www.freebase.com/freebase/flag_judgment/flag https://www.freebase.com/freebase/flag_judgment/item https://www.freebase.com/freebase/flag_judgment/vote https://www.freebase.com/freebase/flag_kind/flags https://www.freebase.com/freebase/flag_vote/judgments
https://www.freebase.com/freebase/review_flag/item https://www.freebase.com/freebase/review_flag/judgments https://www.freebase.com/freebase/review_flag/kind
https://www.freebase.com/freebase/type_profile/instance_count https://www.freebase.com/freebase/user_activity/primitives_live https://www.freebase.com/freebase/user_activity/primitives_written https://www.freebase.com/freebase/user_activity/topics_live https://www.freebase.com/freebase/user_activity/types_live https://www.freebase.com/freebase/user_activity/user
https://www.freebase.com/pipeline/delete_task/delete_guid https://www.freebase.com/pipeline/task/status https://www.freebase.com/pipeline/task/votes https://www.freebase.com/pipeline/vote/vote_value
The properties below are for text and images which were uploaded and are of questionable provenance/rights status, so can probably be ignored (and aren't made available by Google in current data dumps):
https://www.freebase.com/type/content/blob_id https://www.freebase.com/type/content_import/content https://www.freebase.com/type/content_import/header_blob_id https://www.freebase.com/type/content_import/uri https://www.freebase.com/type/content/languagelanguage of work (or name) (P407)See also original language of work (P364) https://www.freebase.com/type/content/length https://www.freebase.com/type/content/media_type https://www.freebase.com/type/content/source https://www.freebase.com/type/content/text_encoding https://www.freebase.com/type/content/uploaded_by
Overall my impression is that there's still a significant amount of very, very basic groundwork to be completed before it's reasonable to ask people to contribute their effort to doing/reviewing the mappings. Have you asked Google staffers familiar with the Freebase schema to review?
I'll second Marco's questions about the status of the previous attempts at automated mappings and automated mappings in general. Has any effort been made to do an initial automated mapping that humans could then review?
Tom