Re: [Wikidata] Documentation sprint for Wikidata during the Wikimedia Hackathon

15 Mar 2017

      Kingsley,
Wanted to thank you very much for your valuable post! Its a great 
introduction to making the transition from a table/Excel/spreadsheet 
view of data over to, as you say, /"a//
//collection of RDF statements grouped by statement Predicate"/
Those of us working on the Company Data project typically come with 
that  table orientation background. Having a "learning path" laid out 
transitioning to the SPARQL world is very helpful.
I'm very fuzzy on basic "inheritance" here at Wikidata.
For example Company->Financial Statements->Income Statement for 
2016Q4->total revenue->some number
* Total revenue needs the /*time period*/ attached to it (here start
    and end dates for the quarter); others need point-in-time
    measurements, e.g. as of 12/31/2016)
  * The total revenue needs to have an associated */currency /*attached
    to it.
  * The Income Statement for 2016Q4 needs to have a specific
    */accounting standard/* attached to it (for example US GAAP 2017,
    IFRS 2016, more at
    https://www.sec.gov/info/edgar/edgartaxonomies.shtml, and more
    outside the U.S.. The accounting standard followed in preparing the
    numbers must be very specific to help with concordance across
    different standards (especially across countries)
  * The company needs to have a "dominate" or "default" /*industry
    code*/ attached to it. WikiData might best go with 56 industries
    classified according to the '''International Standard Industrial
    Classification revision 4 (ISIC Rev. 4)'''. This is the set used by
    the World Input-Output tables http://www.wiod.org/home. They take
    data from all 28 EU countries and 15 other major countries in the
    world and transform it to be comparable using these industries. Its
    the broadest "nearly global" coverage I can find. It would be also
    advisable to accommodate multiple industry assignments per entity /
    establishment, each with the standard and year which were followed,
    applied from a specifically enumerated list. For example in North
    America data will often be available according to the most current,
    and highly granular 2017 NAICS system
    https://www.census.gov/eos/www/naics/ and there are concordances
    between versions see:
    https://www.census.gov/eos/www/naics/concordances/concordances.html
    and https://unstats.un.org/unsd/cr/registry/isic-4.asp. Looking
    towards the future where large amounts of company data are machine
    imported it would be best to preserve the original, most detailed
    industry codes available (such as the 6 digit NACIS code) and
    preserve the standard and year associated with that assigned
    code(s). Given the year and the detail the concordances can later be
    used to machine add different codes as needed. Granular users are
    then accommodated, and people looking to do cross country / global
    analysis (at the 56 industry level) are also accommodated.
When I look at the above challenge I think of your prescription of how 
to make RDF collections easier to read.
1. Addition of annotation relations esp., the likes of rdfs:label,
    skos:prefLabel, skos:altLabel, schema:name, foaf:name, rdfs:comment,
    schema:description etc..
2. Addition (where possible) use of relations such as foaf:depiction,
    schema:image etc..
Adhering to the above *leads to RDF statement collections that are
    easier**
    **to read*, without the confusing nature of the term "graph" getting
    in the
    way. At the end of the day, RDF is simply an abstract language for
    creating structured data using a variety of notations (RDF-Turtle,
    RDF-NTriples, JSON-LD, RDF-XML etc..). *It isn't a format, but sadly**
    **that's how it is still perceived* by most circa., 2017 (even
    though the
    initial RDF definition snafu on this front occurred around 2000).
And I can't help but be intensely curious as to what happened in that 
2000 initial RDF definition snafu?
Rick
On 3/2/2017 2:23 PM, Kingsley Idehen wrote:
...
On 3/2/17 11:48 AM, Rick Labs wrote:
...
Perhaps high quality documentation already exists?  Would be great to
have at least a syllabus (learn this first, then move on to this, then
on to...  Might be good to also have common / high value "use-case"
scenarios with pointers to documentation/tutorials that cover it.
Existing example queries are very helpful but many are complex. For
training purposes we need a graduated set of examples, that are
designed step-by-step to teach how to construct queries.
The trouble here isn't really SQL to SPARQL etc.. In my experience, it's
more to do with understanding what data is and the nature of data
representation. Having arrived at the aforementioned conclusion over the
years, I published a presentation titled "Understanding Data" as an aid
in this area [1].
SQL and SPARQL aren't very good starting points because literature
associated with both assume some fundamental understanding about the
nature of data (relations) against which they operate.
If one starts the journey with data representation comprehension
combined with clarity about RDF as a language, my hope is that folks
reach a point where creating RDF statements always includes (so SPARQL
compliant servers don't need to inject workarounds for label injection
into query solutions):

Addition of annotation relations esp., the likes of rdfs:label,

skos:prefLabel, skos:altLabel, schema:name, foaf:name, rdfs:comment,
schema:description etc..

Addition (where possible) use of relations such as foaf:depiction,

schema:image etc..
Adhering to the above leads to RDF statement collections that are easier
to read, without the confusing nature of the term "graph" getting in the
way. At the end of the day, RDF is simply an abstract language for
creating structured data using a variety of notations (RDF-Turtle,
RDF-NTriples, JSON-LD, RDF-XML etc..). It isn't a format, but sadly
that's how it is still perceived by most circa., 2017 (even though the
initial RDF definition snafu on this front occurred around 2000).
SPARQL is a Query Language for operating on data represented as a
collection of RDF statements grouped by statement Predicate, as opposed
to SQL which is oriented towards data represented as Records grouped by
Table.
Links:
[1] https://www.slideshare.net/kidehen/understanding-29894555 --
Understanding Data
[2] http://www.openlinksw.com/data/turtle/general/GlossaryOfTerms.ttl
-- Glossary that might also help with terminology
[3]
https://www.quora.com/What-is-the-Semantic-Web/answer/Kingsley-Uyi-Idehen

Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Documentation sprint for Wikidata during the Wikimedia Hackathon