Hi Everyone,
below are some chatlogs between Daniel and me.
<hoo> aude: Around? <DanielK_WMDE__> hoo: aude is having dinner with the multimedia team <hoo> I feared that :S <hoo> DanielK_WMDE__: Did you talk about the Q183 issues today? * hoo came home late today... <DanielK_WMDE__> no. <DanielK_WMDE__> hoo: actually - Thiemo was doing some benchmarking earlier, together with aude. Might have been that <hoo> I see... but the issue is (sadly) complexer <DanielK_WMDE__> regular work is pretty much zero right now though. we have been in sessions with the multimedia folks all day <hoo> some revisions of that item segfault, some fatal and some throw an exception <DanielK_WMDE__> bah <DanielK_WMDE__> this happens since the switch to DataModel 1.0, right? <hoo> That's a good question, actually <DanielK_WMDE__> sigh <hoo> not sure why shit hit the fan exactly now and not earlier <DanielK_WMDE__> JeroenDeDauw: care to look into that? <DanielK_WMDE__> hoo: can you collect your findings somewhere, and mail a link to wikidata-tech? <hoo> DanielK_WMDE__: Well, everything is bugzillad <hoo> just search for Q183 <hoo> might be that it's spread across various products, though (Wikimedia and Wikidata repo) <hoo> This problem has so many parts :S <hoo> But the one that old revisions sometimes can't be viewed/ unserialized(?) is the most minor one IMO * James_F|Away is now known as James_F <DanielK_WMDE__> but it seems like all of them are related to the change in the DataModel and/or the serialization format <DanielK_WMDE__> hoo: my feelign is that we should (partially) go back to deferred unstubbing: have stub implementations of StatementList, etc, that would only instantiate the full structure when needed. <DanielK_WMDE__> JeroenDeDauw: what do you think? <hoo> Probably, yes [...]
I post them here in order to coordinate a bit on the efforts to solve this. So if you investigate anything, have any findings or are working on something please let the others (and especially me) know.
Right now Tim is trying to help with the segfaults, but that's only one of the problems we see here.
Also:
<TimStarling> "value":"\u0b9a\u0bc6\u0bb0\u0bc1\u0bae\u0ba9\u0bbf" <TimStarling> very efficient encoding there <TimStarling> well, using UTF-8 only reduces it from 736163 to 721591 <ori> it's JSON, though the JSON specification doesn't require that you encode code points outside the ASCII range; it simply allows it <ori> http://php.net/manual/en/json.constants.php#constant.json-unescaped-unicode <ori> I think FormatJson::encode() supports a similar option, and it's compatible with PHP 5.3 <TimStarling> yes, that's what I just tried, reduces oldid 158433886 from 736163 to 721591 bytes <TimStarling> I haven't reproduced that crash in eval.php yet <ori> hoo: FormatJson::encode( $value, /* $pretty = */ false, FormatJson::UTF8_OK );
I think we should do this, if all major json implementation support it (which I guess they do). But also, as Tim points out, this is not going to help very much.
Cheers,
Marius
Hey,
What leads people to think this is related to the recent serialization format changes? And are there any more concrete suspicions on where exactly the problem occurs?
I'm not aware of any code in DataModel or the serialization components being unacceptably slow. There are various places where improvements could be made, though without data indicating that tackling such an instance would help us, this is really shooting in the dark. My suspicion is that the problem lies with the code that uses it. For instance there might be some code looping over a bunch of serializations to get a single sitelinks, and for each item use the regular entity deserializer. That'd be terribly inefficient. The issue would be that the code uses the regular entity deserializer rather than a suitable dedicated approach, not that the general entity deserializer is not tailored for this particular use case. Again, actual data on the problem, rather than suspicions, are needed to solve the issue.
Thiemo was doing some benchmarking earlier, together with aude
And the results of this where?
Cheers
-- Jeroen De Dauw - http://www.bn2vs.com Software craftsmanship advocate Evil software architect at Wikimedia Germany ~=[,,_,,]:3
Hey,
I got some data to share. Walking through the dump for the first 1000 entities (~19Mb), took 0.008 seconds per item, where in each step the following things where done:
* read line from file * json_decode the line * use the EntityDeserializer to turn the array into DataModel objects
Given that these entities are on average a lot bigger than the typical one found in Wikidata, it looks like the average deserialization time is a few milliseconds. So now I really wonder why people are blaming DataModel 1.0. Everything seems to indicate most time is spend in Wikibase.git and MediaWiki.
Cheers
-- Jeroen De Dauw - http://www.bn2vs.com Software craftsmanship advocate Evil software architect at Wikimedia Germany ~=[,,_,,]:3
Thank you for making the measurements. Can you estimate the time for item Q183 specifically? Since it is 1000 entities weighing 19 MB, this means that on average the entities were 19 KB. Germany on the other hand is much larger, and it makes we wonder how it scales to that size. On Oct 7, 2014 6:47 AM, "Jeroen De Dauw" jeroendedauw@gmail.com wrote:
Hey,
I got some data to share. Walking through the dump for the first 1000 entities (~19Mb), took 0.008 seconds per item, where in each step the following things where done:
- read line from file
- json_decode the line
- use the EntityDeserializer to turn the array into DataModel objects
Given that these entities are on average a lot bigger than the typical one found in Wikidata, it looks like the average deserialization time is a few milliseconds. So now I really wonder why people are blaming DataModel 1.0. Everything seems to indicate most time is spend in Wikibase.git and MediaWiki.
Cheers
-- Jeroen De Dauw - http://www.bn2vs.com Software craftsmanship advocate Evil software architect at Wikimedia Germany ~=[,,_,,]:3
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Appears we are also hitting an uncaught exception when trying to view Q183.
We have narrowed down where this occurs. Most items, this does not occur. It could be, due to the size of the item, we hit some php bug or issue.
details: https://bugzilla.wikimedia.org/show_bug.cgi?id=71519#c24
Cheers, Katie
On Tue, Oct 7, 2014 at 4:23 PM, Denny Vrandečić vrandecic@gmail.com wrote:
Thank you for making the measurements. Can you estimate the time for item Q183 specifically? Since it is 1000 entities weighing 19 MB, this means that on average the entities were 19 KB. Germany on the other hand is much larger, and it makes we wonder how it scales to that size. On Oct 7, 2014 6:47 AM, "Jeroen De Dauw" jeroendedauw@gmail.com wrote:
Hey,
I got some data to share. Walking through the dump for the first 1000 entities (~19Mb), took 0.008 seconds per item, where in each step the following things where done:
- read line from file
- json_decode the line
- use the EntityDeserializer to turn the array into DataModel objects
Given that these entities are on average a lot bigger than the typical one found in Wikidata, it looks like the average deserialization time is a few milliseconds. So now I really wonder why people are blaming DataModel 1.0. Everything seems to indicate most time is spend in Wikibase.git and MediaWiki.
Cheers
-- Jeroen De Dauw - http://www.bn2vs.com Software craftsmanship advocate Evil software architect at Wikimedia Germany ~=[,,_,,]:3
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Hey,
Thank you for making the measurements. Can you estimate the time for item
Q183 specifically? Since it is 1000 entities weighing 19 MB, this means that on average the entities were 19 KB. Germany on the other hand is much larger, and it makes we wonder how it scales to that size.
Good point - I did not realize the outliers are that big. Q183 takes ~415ms, which is rather long. ~25ms json_decode, ~390ms array -> objects. In itself that is not a problem, though perhaps something to look at after we fixed the critical performance issues. This also does illustrate that one should be careful to not fully deserilaize entities when that is not needed, and that fully deserializing a collection of entities in one request is something to avoid.
Do we have code that falls in the later category? Even if we do only partial deserialization, this is still going to be to costly for an action done dozens of times during a request. We should also not simply assume this is the case now and stop looking for what the critical issues are.
Cheers
-- Jeroen De Dauw - http://www.bn2vs.com Software craftsmanship advocate Evil software architect at Wikimedia Germany ~=[,,_,,]:3
Btw, when doing such performance measures, it would be great to get some memory statistics from PHP as well. From my past as an SMW developer, I remember seeing incredible memory footprints of apparently simple PHP objects. OoM would be one of the most common causes for blank pages, much more common than timeouts, and even a single object in PHP can take up huge amounts of memory.
Markus
On 07.10.2014 23:44, Jeroen De Dauw wrote:
Hey,
Thank you for making the measurements. Can you estimate the time for item Q183 specifically? Since it is 1000 entities weighing 19 MB, this means that on average the entities were 19 KB. Germany on the other hand is much larger, and it makes we wonder how it scales to that size.
Good point - I did not realize the outliers are that big. Q183 takes ~415ms, which is rather long. ~25ms json_decode, ~390ms array -> objects. In itself that is not a problem, though perhaps something to look at after we fixed the critical performance issues. This also does illustrate that one should be careful to not fully deserilaize entities when that is not needed, and that fully deserializing a collection of entities in one request is something to avoid.
Do we have code that falls in the later category? Even if we do only partial deserialization, this is still going to be to costly for an action done dozens of times during a request. We should also not simply assume this is the case now and stop looking for what the critical issues are.
Cheers
-- Jeroen De Dauw - http://www.bn2vs.com Software craftsmanship advocate Evil software architect at Wikimedia Germany ~=[,,_,,]:3
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
I'm also more worried about memory consumption than speed.
By far the biggest performance issue regarding speed is the fact that we load entire entities just to look up a single label. This has been known for a while.
But with the new data model, we no longer do deferred unstubbing. Everything is unserialized right away, always, even if in the end, all we need is a single label of the entity. That's especially bad if there is a lot of referenced entities, of course.
On top of that, PHP seems to "sometimes" get confused when memory is running low. This seems "somehow" connected to ArrayObject. These effects are hard to reproduce, though; we are not sure what exactly is going on.
In any case, we should try to be less wasteful with memory. Having a stub implementations for StatementList would already help a lot. I'll be working on removing the need to load so many entities in the first place (we already had TermsLookup in the sprint, but didn't get around to working on it - partially due to the problems on the live site).
Am 08.10.2014 09:43, schrieb Markus Krötzsch:
Btw, when doing such performance measures, it would be great to get some memory statistics from PHP as well. From my past as an SMW developer, I remember seeing incredible memory footprints of apparently simple PHP objects. OoM would be one of the most common causes for blank pages, much more common than timeouts, and even a single object in PHP can take up huge amounts of memory.
Markus
On 07.10.2014 23:44, Jeroen De Dauw wrote:
Hey,
Thank you for making the measurements. Can you estimate the time for item Q183 specifically? Since it is 1000 entities weighing 19 MB, this means that on average the entities were 19 KB. Germany on the other hand is much larger, and it makes we wonder how it scales to that size.
Good point - I did not realize the outliers are that big. Q183 takes ~415ms, which is rather long. ~25ms json_decode, ~390ms array -> objects. In itself that is not a problem, though perhaps something to look at after we fixed the critical performance issues. This also does illustrate that one should be careful to not fully deserilaize entities when that is not needed, and that fully deserializing a collection of entities in one request is something to avoid.
Do we have code that falls in the later category? Even if we do only partial deserialization, this is still going to be to costly for an action done dozens of times during a request. We should also not simply assume this is the case now and stop looking for what the critical issues are.
Cheers
-- Jeroen De Dauw - http://www.bn2vs.com Software craftsmanship advocate Evil software architect at Wikimedia Germany ~=[,,_,,]:3
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
wikidata-tech@lists.wikimedia.org