On Fri, Dec 21, 2012 at 2:47 PM, Daniel Kinzler <daniel.kinzler@wikimedia.de> wrote:
On 20.12.2012 20:59, Avenue wrote:
> I am increasingly wondering if "uncertainty" will be overloaded here. People
> seem to want to use it for various types of measurement uncertainty (e.g. the
> standard error), ranges with no defined central value, and distributional
> summaries (e.g. max and min), as well as for the precision with which a value is
> entered (as in the  "auto-certainty" value in the prototype). These are all
> quite different beasts, and conflating them will probably lead to problems -

The idea is to allow for the detailed information as qualifiers, while using the
"conflated" uncertainty for query answering. Ideally, for queries, we need a
single value. The current solution would be use a single value plus a range of
uncertainty for answering queries, so "Primates taller than 1m" will include a
species said to be 90+/-20cm or something.

Well, the problem with that approach is that the query results you get when the various "uncertainty" values are conflated will make little sense. For example, suppose monkey species A is recorded as having an average height of 90 cm, converted to 90 +/- 0.5 cm by the "autocertainty" rule; species B is recorded as having an average height of 90 +/- 6 cm, based on a small sample; and species C is recorded as having heights in the range 90 +/- 20 cm, with a footnote in Wikipedia saying that males' heights are in the range 100 +/- 10 cm and females 80 +/- 10 cm (based on a larger sample). Queries for "Primates taller than 1 m", "Primates taller than 95 cm", and "Primates taller than 90 cm" would return C, B+C, and A+B+C respectively. Do you think this is sensible? What if I tell you that these three species were actually all just the same species, only with its height distribution summarised in different ways?

If queries are going to incorporate uncertainty values, I think they need to do it more carefully than this.

Avenue