Hi!
There is always caching..
We have 14M+ entries, so keeping them all in memory won't be
realistic, and repetitiveness of access is pretty low - each one would
be accessed only once for each inferred relationship, so the cache would
work well only if we somehow are lucky to process related entities in
clusters, so that both ends of the relationship are processed within
short time to each other, but I don't see why we can count on such luck.
Besides that, caching would only save the time required to load the data
from the database, but not to actually process all the inferences. If we
talk about 15M entities, every 1ms of extra processing time per entity
adds 4 hours to dump processing. Granted, with modern CPUs you can do a
lot in 1ms, but we should keep in mind the costs.
Also, there's another thing. Suppose we have Q345 -> spouse -> Q123, but
not Q123 -> spouse -> Q345, and we process entities, without loss of
generality, in order of ascending IDs. When we generate data for Q123,
we don't know yet that Q345 is linked to it, so in order to infer Q123
-> spouse -> Q345, we can't just load Q345 (we'd need to load it later
anyway to get the qualifiers, etc.), since we don't know we'd need it,
we'd probably somehow have to query the database (if we have suitable
links table?) for every entry that has Q123 on the other end of
"spouse". I'm not even sure it's possible currently on Wikidata (query
service can easily do that, but not within 1ms), but even if it is, I
don't see how it is cacheable and doing this for every entity for
multiple relationships may be quite expensive.
--
Stas Malyshev
smalyshev(a)wikimedia.org