Re: [Wikidata-l] Data values

19 Dec 2012

On 19.12.2012 15:12, Gregor Hagedorn wrote:
...
   I see no point
in storing the unit used for input.  
 I think you plan to store the unit (which would be meter), so you
 don't want to store prefixes, correct?

 Please argue why you don't see a point. You want to both the size of
 the universe, distance to New York, size of the proton in "meter"?  
Yes. Otherwise, they would not be comparable in a database query. It's generally
a good idea to normalize all values of a given dimension to use the same unit,
for the samerason that it's best to convert dates in a database th UTC, instead
of storing time zone info for each entry.

...
  If
 not, with which algorithm will you restore the SI prefix, or rather,
 recognize with SI-prefix is usable? We do not use Mm in common
 language, so we do give the circumference of the earth as roughly 40
 000 km and not as 40 Mm. We don't write 4*10^7 m either. 
This would be done based on the accuracy. If the accuracy is ~1000m, the
heuristic would decide that km is probably a good unit, with no digits after the
decimal point. Besides that, the desired unit and precision can be specified
when rendering values on a wiki page.

...
 >> it is probably necessary to store the number
of
>> significant decimals. 
Yes, that *is* the accuracy value i mean.

...
   That's how
Denny proposed to calculate the default accuracy. If the accuracy is
 given by a complex model (e.g. a gamma distribution), then it might be handy to
 have a simple value that tells us the significant digits.

 Hm... perhaps it's best to always express accuracy as "+/-n", and allow for
more
 detailed information (standard deviation, whatever) as *additional* information
 about the accuracy (could be modelled as a qualifier internally).  
 I fear that is two separate levels of precision of giving a measure of
 measurement _precision_ (I believe "accuracy" is the wrong term here,
 precision and accuracy are related but distinct concepts). 
Ok, there's some terminology confusion here. I'm using "accuracy" to
refer to
the accuracy of measurement (e.g. standard deviation), and "precision" to refer
to the precision of presentation (e.g. significant digits). We need these two
things at least, and words for them. I don't care much which words we use.

...
  So 4.10
 means that the last digit is significant, i.e. the best estimate is at
 least between 4.095 and 4.105 (but it may be better). . 4.10 +/- 0.005
 means it is precisely 4.095 and 4.105, as opposed to 4.10 +/- 0.004,
 4.10 +/- 0.003,  4.10 +/- 0.002 etc. 
Yes, all this should be handled by the component responsible for parsing user
input for quantity values.

...
  Futhermore, a quantity may be given as 4.10-4.20-4.35.
The precision
 of measurement and the the measure of variance and dispersion are
 separate concepts. 
True, though I think that we can lump them together for the sake of comparison,
picking appropriate units, etc. The finer points can be handled by qualifiers
(of the value, or even of the accuracy).

...
   No, it would
return using whatever system of measurement the user has selected
 in their preferences.  
 then you have lost the information. There is no "user selection" in
 this in science. 
I have lost the information which unit was used when originally entering the
data. That's it.

...
    Complex heuristic
 may "guess" when to use the scientific SI prefixes instead. The
 trailing zero cannot be reproduced however when completely relying on
 IEEE floating-point. 
 We'll need heuristics to pick the correct secondary unit (e.g. nm or km). The 

 (I believe there is no such thing as a "secondary unit", did you make
 that term up? Only "m" is a unit of measurement, the n or k are
 prefixes see http://en.wikipedia.org/wiki/SI_prefix ) 
I made up the term, yes :) Whatever you call "km", if "m" is the real
unit and
"k" is the prefix.

...
  I agree that the system should allow explicit
conversion in infoboxes.
 I disagree that you should create an artifical intelligence system for
 wikidata that knows more about unit usage than the authors. To store
 the wisdom of authors, storing both unit and original unit prefix is
 necessary. 
It can be stored as an auxilliary data point, that is, as a qualifier ("measured
in feet"). It should not IMHO be part of the data value as such, because that
would make it extremely hard to use the values in a database.

...
  You write "The Precision can be derived from the
accuracy and vice
 versa, using appropriate heuristics."

 I _terrible strongly_ doubt that. Can you give any proof of that? For
 precision I can use statistics, for accuracy and need an indirect,
 separate and precise method to estimate accuracy. If you have a
 laser-distance measurement device, the precision can be estimated by
 yourself by repeated measurements at various times, temperatures, etc.
 But unless you have an objective distance standard, you have no means
 to determine whether the accuracy of the device is always off by 10 cm
 because someone screwed up the software program inside the device. 
Sorry, I didn't mean to say that you can calculate one from the other, but that
you can make a decent guess that is useful for parsing and display. e.g. if a
value is given as 5km, we can assume that it wasn't measured with millimeter
precision. And if something is declared to have millimeter accuracy, it probably
makes sense to use an precision of 3 or 4 digits after the decimal point when
converting to feet.

This is just heuristics, and yes, we shouldn't try to be too smart here. Just
smart enough to be useful.

...
   But they are
not the same. IMHO, the accuracy should always be stored with the
 value, the precision never.  
 I fear that is a view of how data in a perfect world should be known,
 not a reflection of the kind of data that people need to store in
 Wikidata. Very often only the precision will be known or available to
 its authors, or worse, the source may not say which it is. 
Ok, so my use of precision is not precise :) We are stumbling over words here.
What would you call the level of detail used for presentation, if not
"precision"?

-- daniel

-- 
Daniel Kinzler, Softwarearchitekt
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata-l] Data values