Re: [Wikidata-l] Data values

19 Dec 2012

On 19.12.2012 08:53, Gregor Hagedorn wrote:
...
   Displaying the
numbers is another question. There I have to agree that it
 always makes sense to also store a typical used unit for that type of data.  
 I agree. What I propose is that the user interface supports entering
 and proofreading "10.6 nm" as "10.6" plus "n" (= nano) plus
"meter". 
Yes, absolutely.

...
  How the value is stored in the data property, whether
as 10.6 floating
 point or as 1.6e-8 is a second issue -- the latter is probably
 preferable.  
I think neither is sufficient: we need a representation that allows for
arbitrary (or at least very great) precision, and can still be indexed and
compared natively by (different!) database systems. Fixed length strings can
easily do that, if they are long enough. That's pretty inefficient, though.

IEEE floats work natively, but don't guarantee enough precision (well, maybe 128
bit floats come close?). The SQL, "decimal" might be sufficient: in MySQL, it
allows 30 decimal digits before the decimal point, and up to 64 after. But
that's still not enough to measure the extent of the universe in Plancks.

...
  In addition to a storage option of the desired unit
prefix (this may
 be considered a original-prefix, since naturally re-users may wish to
 reformat this). 
I see no point in storing the unit used for input.

...
  it is probably necessary to store the number of
 significant decimals. 
That's how Denny proposed to calculate the default accuracy. If the accuracy is
given by a complex model (e.g. a gamma distribution), then it might be handy to
have a simple value that tells us the significant digits.

Hm... perhaps it's best to always express accuracy as "+/-n", and allow for
more
detailed information (standard deviation, whatever) as *additional* information
about the accuracy (could be modelled as a qualifier internally).

...
  I believe in the user interface this needs not
 be any visible setting, simply the number of digits can be preserved.
 Without these is impossible to store and reproduce information  like
 "10.20 nm", it would be returned as 1.02 10^-8 m.  
No, it would return using whatever system of measurement the user has selected
in their preferences.

...
  Complex heuristic
 may "guess" when to use the scientific SI prefixes instead. The
 trailing zero cannot be reproduced however when completely relying on
 IEEE floating-point. 
We'll need heuristics to pick the correct secondary unit (e.g. nm or km). The
general rule could be to pick a unit so that the actual value is between 1 and
10, with some additional rules for dealing with cultural specialities (decimeter
is rarely used, hectoliter however is pretty common. The decagram is commonly
used in Austria only, etc).

Note that for rendering of values in infoboxes, the desired unit and precision
can always be given explicitly.

Note "precision" vs "accuracy" here: the precision controls how many
digits are
shown, while the accuracy indicates how exact our knowledge is. The Precision
can be derived from the accuracy and vice versa, using appropriate heuristics.
But they are not the same. IMHO, the accuracy should always be stored with the
value, the precision never.

-- daniel

-- 
Daniel Kinzler, Softwarearchitekt
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata-l] Data values