Hi all!
tl;dr: How to best handle the situation of an old parser cache entry not
containing all the info expected by a newly deployed version of code?
We are currently working to improve our usage of the parser cache for
Wikibase/Wikidata. E.g., We are attaching additional information related to
languagelinks the to ParserOutput, so we can use it in the skin when generating
the sidebar.
However, when we change what gets stored in the parser cache, we still need to
deal with old cache entries that do not yet have the desired information
attached. Here's a few options we have if the expected info isn't in the cached
ParserOutput:
1) ...then generate it on the fly. On every page view, until the parser cache is
purged. This seems bad especially if generating the required info means hitting
the database.
2) ...then invalidate the parser cache for this page, and then a) just live with
this request missing a bit of output, or b) generate on the fly c) trigger a
self-redirect.
3) ...then generated it, attach it to the ParserOutput, and push the updated
ParserOutput object back into the cache. This seems nice, but I'm not sure how
to do that.
4) ...then force a full re-rendering and re-caching of the page, then continue.
I'm not sure how to do this cleanly.
So, the simplest solution seems to be 2, but it means that we invalidate the
parser cache of *every* page on the wiki potentially (though we will not hit the
long tail of rarely viewed pages immediately). It effectively means that any
such change requires all pages to be re-rendered eventually. Is that acceptable?
Solution 3 seems nice and surgical, just injecting the new info into the cached
object. Is there a nice and clean way to *update* a parser cache entry like
that, without re-generating it in full? Do you see any issues with this
approach? Is it worth the trouble?
Any input would be great!
Thanks,
daniel
--
Daniel Kinzler
Senior Software Developer
Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.
Hi,
The JSON-only dumps are at http://dumps.wikimedia.org/other/wikidata/ Is
this a preliminary location?
I am asking since this does not follow the directory structure
conventions of the rest of the WMF dump site, which is normally
something like
http://some-base-url/projectname/filename
The projectname here would be "wikidatawiki" but even
http://dumps.wikimedia.org/other/wikidatawiki/ would obviously not be a
good location if more projects would export JSON.
For example, where would Commons put its JSON in the future?
Cheers,
Markus
WikibaseLib is a horrible kitchen sink, and I don't want to add more to the
mess. So I want to put the usage tracking code into sensible packages. However,
I'm a bit at a loss as to how to best split the different responsibilities into
packages. Here are some of the communication needs we have, implying which code
needs to be shared between repo and client:
The client needs to:
* load entity data
** need to share entity storage code
** but should not know about EntityContent
** and should have no write access
* look up properties by label, and look up labels of items
** need to share term storage code
** no need for write access
** no need for code for constraints checks, etc.
** should not have related related maintenance scripts or schema update code
* look up data types for properties
** need to share property info storage code
** no need for write access
** should not have related maintenance scripts or schema update code
* load change details
** need to share change table storage code and value objects
** no need for write access
** no need for dispatching logic
** also should not have schema update code
* look up sitelinks by page title
** need to share link table storage code
** no need for write access
** should not have related maintenance scripts or schema update code
* update notification subscriptions
** need to share subscription storage code
** should not have related maintenance scripts or schema update code
So, there are 6 things the client and the repo both need to access. But the
write logic, or at least the maintenance logic, should not be bundled with the
leaner "read only" package. So I see 12 new packages... dependency hell.
So, what to do? Have 6 read level packages, and stuff the maintenance logic into
a single package not used by the client? Also ugly.
Ideas?
-- daniel
--
Daniel Kinzler
Senior Software Developer
Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.
Hi,
Some questions on the new dump options. I noticed that the XML dump
files use exactly the same content model and format for the new model as
they used for the old. This is not so great as it reduces the utility of
the <model> information greatly if the same model is used for
incompatible content. I am now trying to find a way to write code that
supports both old and new dumps. Hence my questions:
(1) The most recent full dump that is available contains the old format.
The most recent current dump that is available contains the new format.
Is it possible that a single dump contains both formats?
(2a) If the answer to (1) is no: what are/will be the first (or last)
full/current/daily dump files that use the new format?
(2b) If the answer to (1) is yes: what is the revision number at which
the change was made (i.e., what is the largest revision number that is
still in the old format)?
Many thanks,
Markus
Hi,
The new JSON dump format uses JSON maps for aliases (and many other
things). A JSON map is a thing of the form { ... }. However, if an item
has no aliases, the JSON dump sets its value to the empty *list* [].
This is bad since it makes object-model based JSON parsers, such as
Jackson, trip over this (since the parser must know what type the parsed
output should have, and [] cannot be converted into a map).
It is possible that the problem occurs with other fields as well
(labels, descriptions, etc.).
There are two possible fixes:
* Correctly export empty maps as {}.
* Do not export empty maps at all (leave away the key).
The second is what happens in the API. It would be nice if the JSON
exported by the API would be the same as the JSON exported in the dumps.
If this is news for you, I can also create a bug report.
Cheers,
Markus