A related question: how many WD statements are already part of a time
series? Let's say this means
properties qualified by point in time (P585
<https://www.wikidata.org/wiki/Property:P585>)
where there are at least four other instances of that property with a
point-in-time qualifier.
This query
<https://query.wikidata.org/#%23defaultView%3AAreaChart%0ASELECT%20%3Fst%20%3Fct%20%7B%0A%20%20FILTER%20%28%3Fst%20%3E%205%29%0A%20%20%7B%0A%20%20%20BIND%20%280%20AS%20%3Fct%29%0A%20%20%20BIND%20%280%20AS%20%3Fst%29%0A%20%20%7D%0A%20%20UNION%20%7B%0A%20%20%20%20SELECT%20%3Fst%20%28COUNT%28%2a%29%20as%20%3Fct%29%20%0A%20%20%20%20%7B%0A%20%20%20%20%20%20%3Fitem%20wdt%3AP585%20%3Fvalue%20%3B%20wikibase%3Astatements%20%3Fst%0A%20%20%20%20%7D%0A%20%20%20%20GROUP%20BY%20%3Fst%0A%20%20%20%20ORDER%20BY%20%3Fst%0A%20%20%7D%0A%7D>
suggests it's roughly 800K statements in all; with around 400 outliers with
over 400 such statements.
This is common enough, and for sufficiently high-interest/high-traffic
entities, that it would be nice to have a more explicit way of handling
this.
One suggestion: a norm of having a single most-recent value, for each
time-series property, and a time-series property-space exclusively used for
historical values. This would support explicitly noting where a time series
is intended, allow for cleaner edit histories for that work, allow for
including other time-series data that is in active use on Wikipedia, and
help optimize queries for the most recent data.
For instance: Iceland <https://www.wikidata.org/wiki/Q189> currently has
over 2x as many properties as its main entry needs. It has
* 17 statements of life-expectancy
* 60 statements of population,
* 57 statements for nominal GDP,
* 57 statements for nominal GDP per capita, &c. -- (each qualified by
point in time, reference)
Instead it could have a single statement for the latest value of each of
these (qualified by point in time: *date*, reference: *URL*, and
*time-series*: *start date - end date*). and an associated entity like
*Q189/historical* could have a time series; with the ~400 individual
historical statements. Most queries and views could touch only the
non-time-series statements, reflecting the most common uses of this data on
the projects.
SJ
On Fri, Apr 10, 2020 at 10:13 AM Samuel Klein <meta.sj(a)gmail.com> wrote:
There are many highly used templates on WP with
time-series data about
COVID spread: cases, tests, health outcomes, by region + per day. Each
cell has a source and some context (caveats, multiple slightly conflicting
or time-offset sources, commentary about that data point), and would
benefit from being explicitly versioned in Wikidata.
What's the right way to capture this in Wikidata - currently, and in the
future? EN Wikipedia tends to have one footnote about sourcing per
geography, with occasional footnotes about how some of those sources have
changed over time. I don't know of any of these templates that are drawing
from Wikidata.
SJ
--
Samuel Klein @metasj w:user:sj +1 617 529 4266