Problems with overly large items - Wikidata-tech

6 Oct 2014


      Hi Everyone,
below are some chatlogs between Daniel and me.
<hoo> aude: Around?
<DanielK_WMDE__> hoo: aude is having dinner with the multimedia team
<hoo> I feared that :S
<hoo> DanielK_WMDE__: Did you talk about the Q183 issues today?
* hoo came home late today...
<DanielK_WMDE__> no. 
<DanielK_WMDE__> hoo: actually - Thiemo was doing some benchmarking
earlier, together with aude. Might have been that
<hoo> I see... but the issue is (sadly) complexer
<DanielK_WMDE__> regular work is pretty much zero right now though. we
have been in sessions with the multimedia folks all day
<hoo> some revisions of that item segfault, some fatal and some throw an
exception
<DanielK_WMDE__> bah
<DanielK_WMDE__> this happens since the switch to DataModel 1.0, right?
<hoo> That's a good question, actually
<DanielK_WMDE__> sigh
<hoo> not sure why shit hit the fan exactly now and not earlier
<DanielK_WMDE__> JeroenDeDauw: care to look into that?
<DanielK_WMDE__> hoo: can you collect your findings somewhere, and mail
a link to wikidata-tech?
<hoo> DanielK_WMDE__: Well, everything is bugzillad
<hoo> just search for Q183
<hoo> might be that it's spread across various products, though
(Wikimedia and Wikidata repo)
<hoo> This problem has so many parts :S
<hoo> But the one that old revisions sometimes can't be viewed/
unserialized(?) is the most minor one IMO
* James_F|Away is now known as James_F
<DanielK_WMDE__> but it seems like all of them are related to the change
in the DataModel and/or the serialization format
<DanielK_WMDE__> hoo: my feelign is that we should (partially) go back
to deferred unstubbing: have stub implementations of StatementList, etc,
that would only instantiate the full structure when needed.
<DanielK_WMDE__> JeroenDeDauw: what do you think?
<hoo> Probably, yes
[...]
I post them here in order to coordinate a bit on the efforts to solve
this. So if you investigate anything, have any findings or are working
on something please let the others (and especially me) know.
Right now Tim is trying to help with the segfaults, but that's only one
of the problems we see here.
Also:
<TimStarling> "value":"\u0b9a\u0bc6\u0bb0\u0bc1\u0bae\u0ba9\u0bbf"
<TimStarling> very efficient encoding there
<TimStarling> well, using UTF-8 only reduces it from 736163 to 721591
<ori> it's JSON, though the JSON specification doesn't require that you
encode code points outside the ASCII range; it simply allows it
<ori>
http://php.net/manual/en/json.constants.php#constant.json-unescaped-unicode
<ori> I think FormatJson::encode() supports a similar option, and it's
compatible with PHP 5.3
<TimStarling> yes, that's what I just tried, reduces oldid 158433886
from 736163 to 721591 bytes
<TimStarling> I haven't reproduced that crash in eval.php yet
<ori> hoo: FormatJson::encode( $value, /* $pretty = */ false,
FormatJson::UTF8_OK );
I think we should do this, if all major json implementation support it
(which I guess they do). But also, as Tim points out, this is not going
to help very much.
Cheers,
Marius