Hi all,
wow! Thanks for all the input. I read it all through, and am trying to digest it currently into a new draft of the data model for the discussed data values. I will try to adress some questions here. Please be kind if I refer the wrong person at one place or the other.
Whenever I refer to the "current model", I mean the version as it was during this discussion < http://meta.wikimedia.org/w/index.php?title=Wikidata/Development/Representin...
The term "updated model" refers to the new one, which is not published yet. I hope I can do that soon.
== General comments ==
I want to remind everyone of the Wikidata requirements: < http://meta.wikimedia.org/wiki/Wikidata/Notes/Requirements%3E
Here especially: * The expressiveness of Wikidata will be limited. There will always be examples of knowledge that Wikidata will not be able to convey. We hope that this expressiveness can increase over time. * The first goal of Wikidata is to serve actual use cases in Wikipedia, not to enable some form of hypothetical perfection in knowledge representation. * Wikidata has to balance ease of use and expressiveness of statements. The user interface should not get complicated to merely cover a few exceptional edge cases. * What is an exceptional case, and what is not, will be defined by how often they appear in Wikipedia. Instead of anecdotal evidence or hypothetical examples we will analyse Wikipedia and see how frequent specific cases are.
In general this means that we cannot express everything that is expressible. A statement should not be intended to reflect the source as close as possible, but rather to be *supported* by the source. I.e. if the source says "He died during the early days of 1876" this would also support a statement like "died in - 19th century". It does not have to be more exact than that.
Martynas, there is no mention here of XSD etc. because it is not relevant on this level of discussion. For exporting the data we will obviously use XSD datatypes. This is so obvious that I didn't think it needed to be explicitly stated.
Tom, thanks for the links to EDTF and the Freebase work, this was certainly very enlightening.
Friedrich, the term "query answering" simply means the ability to answer queries against the database in Phase 3, e.g. the list of cities located in Ghana with a population over 25,000 ordered by population.
A query system that deals well with intervals -- I would need a pointer for that. For now I was always assuming to use a single value internally to answer such queries. If the values is 90+-20 then the query >100? would not contain that result. Sucks, but I don't know of any better system.
We do not anywhere rely on floats (besides in internal representations), but always use decimals. Floats have some inherent problems in representing some numbers that could be interesting for us.
== Time ==
Marco suggested to N/A some values of dates. This is partially the idea of the "precision" attribute in the current data. Anything below the precision would be N/A. It would not be possible to N/A the year when the month or day is known though, as Friedrich suggested.
Friedrich also suggested to use a value like April-July 1567 for uncertain time instead of the current precision model. I prefer his suggestion to the current one and will include that in the updated model.
The accuracy though has to be in the unit given by the precision, we cannot just take seconds, since there is no well-defined number of seconds in a month or a year, or, almost anything, actually.
Note though that the intervals that Sven mentioned -- useful for e.g. reigns or office periods -- are different beasts and should have uncertainty entries both for the start and end date. We have intervals in the data model, and plan to implement them later -- it is just that they are not such a high priority (dates appear 2.5 Million times in infoboxes, intervals only 80,000 times).
I am completely unsure what to do with a value like "about 1850" if not to interpret it at as something like 1850 +- 50, but Sven seems to dislike that.
== Location ==
After the discussion, I decided to drop altitude / elevation from the Geolocation. It can still be expressed through a property, and have all the flexibility of a normal property (including qualifiers etc.)
In a Geolocation, neither the lat nor the long is optional (sorry Nikola). The Geolocation as a whole can be optional, though (i.e. unknown), but not only one of them.
For the geolocations uncertainty I would like to use the same uncertainty model as for Quantity values and now for time. I know that "meters" have been suggested instead of degrees, but that would be kind of ugh considering that the biggest reason why we need the uncertainty is for converting units, in this case from decimals to degree-minute-seconds.
== Quantity values ==
Sorry to disagree with Daniel here, but we will definitively store a quantity value in the unit that the editor used for input. We will then internally normalize it for indexing etc., but the editor won't be bothered with that as long as they do not ask for a conversion. Storing it with the original unit is important for a number of reasons, most of which Gregor already alluded to.
I very much like Gregor's suggestion: rename the lower uncertainty and upper uncertainty to something with less semantic baggage. What about upper and lower bound? Or just upper and lower? And then leave the interpretation to others.
Gregor, an infinitively precise number (the number of apostles, e.g.) would be handled trivially by +- 0.
Also I am taking the hint from Avenue and others and drop confidence. I don't think it is useful to have it so deeply embedded in the data model, and should properly be handled through qualifiers.
Regarding the height of the Eiffel tower: 324 m +- 1m is exactly what I would like to see here if the source states 324 meter. I know the source doesn't say +-1m, but this is certainly supported by the source. Think about why we need this +-1m: it is simply so we can give a useful transformation into feet. Otherwise we cannot convert units. The +-1m would not be displayed usually.
== Units ==
I sense consensus that we should allow declaration of units in the wiki, and not to have it hardcoded in the software. Having discussed the various options and in light of the discussion here, the current suggestion would be to create a page for every quantity unit including the appropriate factors (for linear translations). This is similar to the way Freebase does it, as sent around by Tom, and what John McClure suggested.
Then on a given property, the property points to a quantity unit and furthermore lists the "usual units" for the given property (pointing to the given items), which is used for display.
Internally, for indexing, sorting, and query answering, we would always transform the input to the quantity unit so they are comparable. But this is usually neither exposed nor a useful number (e.g. it might have too many significant digits etc.)
This would allow to use historic units like Li or historic miles even though we do not know how to translate them to other units (but not by the same property).
This would also allow for other units, like Avenue has pointed out. Those are important.
Nikola, we will not have special handling for money for now. This would require a whole different spec I am afraid. Currency happen 200,000 times in Wikipedia -- it is often, but not so often to be high priority.
I hope that I managed to digest the whole discussion and bring it together.
Cheers, Denny