Re: [Wikidata-l] Data values

19 Dec 2012


      On 19.12.2012 15:12, Gregor Hagedorn wrote:
...
...
I see no point in storing the unit used for input.
I think you plan to store the unit (which would be meter), so you
don't want to store prefixes, correct?
Please argue why you don't see a point. You want to both the size of
the universe, distance to New York, size of the proton in "meter"?
Yes. Otherwise, they would not be comparable in a database query. It's generally
a good idea to normalize all values of a given dimension to use the same unit,
for the samerason that it's best to convert dates in a database th UTC, instead
of storing time zone info for each entry.
...
If
not, with which algorithm will you restore the SI prefix, or rather,
recognize with SI-prefix is usable? We do not use Mm in common
language, so we do give the circumference of the earth as roughly 40
000 km and not as 40 Mm. We don't write 4*10^7 m either.
This would be done based on the accuracy. If the accuracy is ~1000m, the
heuristic would decide that km is probably a good unit, with no digits after the
decimal point. Besides that, the desired unit and precision can be specified
when rendering values on a wiki page.
...
...
...
it is probably necessary to store the number of
significant decimals.
Yes, that *is* the accuracy value i mean.
...
...
That's how Denny proposed to calculate the default accuracy. If the accuracy is
given by a complex model (e.g. a gamma distribution), then it might be handy to
have a simple value that tells us the significant digits.
Hm... perhaps it's best to always express accuracy as "+/-n", and allow for more
detailed information (standard deviation, whatever) as *additional* information
about the accuracy (could be modelled as a qualifier internally).
I fear that is two separate levels of precision of giving a measure of
measurement _precision_ (I believe "accuracy" is the wrong term here,
precision and accuracy are related but distinct concepts).
Ok, there's some terminology confusion here. I'm using "accuracy" to refer to
the accuracy of measurement (e.g. standard deviation), and "precision" to refer
to the precision of presentation (e.g. significant digits). We need these two
things at least, and words for them. I don't care much which words we use.
...
So 4.10
means that the last digit is significant, i.e. the best estimate is at
least between 4.095 and 4.105 (but it may be better). . 4.10 +/- 0.005
means it is precisely 4.095 and 4.105, as opposed to 4.10 +/- 0.004,
4.10 +/- 0.003,  4.10 +/- 0.002 etc.
Yes, all this should be handled by the component responsible for parsing user
input for quantity values.
...
Futhermore, a quantity may be given as 4.10-4.20-4.35. The precision
of measurement and the the measure of variance and dispersion are
separate concepts.
True, though I think that we can lump them together for the sake of comparison,
picking appropriate units, etc. The finer points can be handled by qualifiers
(of the value, or even of the accuracy).
...
...
No, it would return using whatever system of measurement the user has selected
in their preferences.
then you have lost the information. There is no "user selection" in
this in science.
I have lost the information which unit was used when originally entering the
data. That's it.
...
...
...
Complex heuristic
may "guess" when to use the scientific SI prefixes instead. The
trailing zero cannot be reproduced however when completely relying on
IEEE floating-point.
We'll need heuristics to pick the correct secondary unit (e.g. nm or km). The
(I believe there is no such thing as a "secondary unit", did you make
that term up? Only "m" is a unit of measurement, the n or k are
prefixes see http://en.wikipedia.org/wiki/SI_prefix )
I made up the term, yes :) Whatever you call "km", if "m" is the real unit and
"k" is the prefix.
...
I agree that the system should allow explicit conversion in infoboxes.
I disagree that you should create an artifical intelligence system for
wikidata that knows more about unit usage than the authors. To store
the wisdom of authors, storing both unit and original unit prefix is
necessary.
It can be stored as an auxilliary data point, that is, as a qualifier ("measured
in feet"). It should not IMHO be part of the data value as such, because that
would make it extremely hard to use the values in a database.
...
You write "The Precision can be derived from the accuracy and vice
versa, using appropriate heuristics."
I _terrible strongly_ doubt that. Can you give any proof of that? For
precision I can use statistics, for accuracy and need an indirect,
separate and precise method to estimate accuracy. If you have a
laser-distance measurement device, the precision can be estimated by
yourself by repeated measurements at various times, temperatures, etc.
But unless you have an objective distance standard, you have no means
to determine whether the accuracy of the device is always off by 10 cm
because someone screwed up the software program inside the device.
Sorry, I didn't mean to say that you can calculate one from the other, but that
you can make a decent guess that is useful for parsing and display. e.g. if a
value is given as 5km, we can assume that it wasn't measured with millimeter
precision. And if something is declared to have millimeter accuracy, it probably
makes sense to use an precision of 3 or 4 digits after the decimal point when
converting to feet.
This is just heuristics, and yes, we shouldn't try to be too smart here. Just
smart enough to be useful.
...
...
But they are not the same. IMHO, the accuracy should always be stored with the
value, the precision never.
I fear that is a view of how data in a perfect world should be known,
not a reflection of the kind of data that people need to store in
Wikidata. Very often only the precision will be known or available to
its authors, or worse, the source may not say which it is.
Ok, so my use of precision is not precise :) We are stumbling over words here.
What would you call the level of detail used for presentation, if not "precision"?
-- daniel
-- 
Daniel Kinzler, Softwarearchitekt
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata-l] Data values