[Re-posting with fixed links. Thanks for pointing this out Cormac!]
This is the weekly TechCom board review. Remember that there is no meeting on Wednesday, any discussion should happen via email. For individual RFCs, please keep discussion to the Phabricator tickets.
Activity since Monday 2020-10-26 on the following boards:
https://phabricator.wiki09media.org/tag/techcom/ https://phabricator.wikimedia.org/tag/techcom/
https://phabricator.wikimedia.org/tag/techcom-rfc/
Committee board activity:
*
T175745 https://phabricator.wikimedia.org/T175745 *"overwrite edits when conflicting with self"* has once again come up while working on EditPage. There seems to no longer be any reason for this behavior. I think it does more harm then good. We should just remove it.
RFCs:
Phase progression:
* T266866 https://phabricator.wikimedia.org/T266866 *"Bump basic supported browsers (grade C) to require TLS 1.2"*: newly filed, lively discussion. Phase 1 for now.
https://phabricator.wikimedia.org/T262946
*
T263841 https://phabricator.wikimedia.org/T263841*"Expand API title generator to support other generated data"*: dropped back to phase 2 because resourcing is unclear.
* T262946 https://phabricator.wikimedia.org/T262946 *"Bump Firefox version in basic support to 3.6 or newer"*: last call ending on Wednesday, November 4. Some comments, no objections.
Other RFC activity:
* T250406 https://phabricator.wikimedia.org/T250406 *"Hybrid extension management"*: Asked for clarification expectations for WMF to publish extensions to packagist. Resourcing is being discussed in the platform team.
Cheers, Daniel
Am 02.11.20 um 19:24 schrieb Daniel Kinzler:
T262946 https://phabricator.wikimedia.org/T262946 *"Bump Firefox version in basic support to 3.6 or newer"*: last call ending on Wednesday, November 4. Some comments, no objections.
Since we are not having a meeting on Wednesday, I guess we should try and get quorum to approve by mail.
I'm in favor.
On Tue, Nov 3, 2020 at 4:38 AM Daniel Kinzler dkinzler@wikimedia.org wrote:
Am 02.11.20 um 19:24 schrieb Daniel Kinzler:
T262946 https://phabricator.wikimedia.org/T262946 *"Bump Firefox version in basic support to 3.6 or newer"*: last call ending on Wednesday, November 4. Some comments, no objections.
Since we are not having a meeting on Wednesday, I guess we should try and get quorum to approve by mail.
I'm in favor.
+1
On Thu, 5 Nov 2020 at 18:35, Dan Andreescu dandreescu@wikimedia.org wrote:
On Tue, Nov 3, 2020 at 4:38 AM Daniel Kinzler dkinzler@wikimedia.org wrote:
Am 02.11.20 um 19:24 schrieb Daniel Kinzler:
T262946 https://phabricator.wikimedia.org/T262946 *"Bump Firefox version in basic support to 3.6 or newer"*: last call ending on Wednesday, November 4. Some comments, no objections.
Since we are not having a meeting on Wednesday, I guess we should try and get quorum to approve by mail.
I'm in favor.
+1
LGMT3.
-- Timo
Am 02.11.20 um 19:24 schrieb Daniel Kinzler:
[Re-posting with fixed links. Thanks for pointing this out Cormac!]
This is the weekly TechCom board review. Remember that there is no meeting on Wednesday, any discussion should happen via email. For individual RFCs, please keep discussion to the Phabricator tickets.
That's another issue I wanted to raise: Platform Engineeing is working on switching ParserCache to JSON. For that, we have to make sure extensions only put JSON-Serializable data into ParserOutput objects, via setProperty() and setExtensionData(). We are currently trying to figure out how to best do that for TemplateData.
TemplateData already uses JSON serialization, but then compresses the JSON output, to make the data fit into the page_props table. This results in binary data in ParserOutput, which we can't directly put into JSON. There are several solutions under discussion, e.g.:
* Don't write the data to page_props, treat it as extension data in ParserOutput. Compression would become unnecessary. However, batch loading of the data becomes much slower, since each ParserOutput needs to be loaded from ParserCache. Would it be too slow?
* Apply compression for page_props, but not for the data in ParserOutput. We would have to introduce some kind of serialization mechanism into PageProps and LinksUpdate. Do we want to encourage this use of page_props?
* Introduce a dedicated database table for templatedata. Cleaner, but schema changes and data migration take a long time.
* Put templatedata into the BlobStore, and just the address into page_props. Makes loading slower, maybe even slower than the solution that relies on ParserCache.
* Convert TemplateData to MCR. This is the cleanest solution, but would require us to create an editing interface for templatedata, and migrate out existing data from wikitext. This is a long term perspective.
To unblock migration of ParserCache to JSON, we need at least a temporary solution that can be implemented quickly. A somewhat hacky solution I can see is:
* detect binary page properties and apply base64 encoding to them when serializing ParserOutput to JSON. This is possible because page properties can only be scalar values. So can convert to something like { _encoding_: "base64", data: "34c892ur3d40" }, and recognize the structure when decoding. This wouldn't work for data set with setTemplateData, since that could already be an arbitrary structure.
I don't know enough about the parser cache to give Daniel good advice on his question:
That's another issue I wanted to raise: Platform Engineeing is working on switching ParserCache to JSON. For that, we have to make sure extensions only put JSON-Serializable data into ParserOutput objects, via setProperty() and setExtensionData(). We are currently trying to figure out how to best do that for TemplateData.
TemplateData already uses JSON serialization, but then compresses the JSON output, to make the data fit into the page_props table. This results in binary data in ParserOutput, which we can't directly put into JSON. There are several solutions under discussion, e.g.: [...(see Daniel's original message for the list of ideas or propose your own)...]
But I see some people hiding in the back who might have some good ideas :) This is just a bump to invite them to respond.
I saw in the patch for https://phabricator.wikimedia.org/T266200 a strategy was devised to base64-encode page prop values that aren't strictly UTF-8. If I understand correctly, this means TemplateData extension code and page props interfaces require no change while the JSONification of Parser Cache output proceeds. Is that right? It's a clever solution.
Now, one thing I've been wondering about: might there be ways to query the database component of Parser Cache with relatively fresh results at the command line without deployer rights? And will it be possible, if not encouraged, to drop stringified JSON into the Parser Cache values?
The page props table tends to be useful for content analysis for UX interventions, and part of its usefulness has stemmed from being able to do simple MySQL queries (when the payload is encoded for JSON and even if it were compress()'d, it can also be trivial to use MySQL JSON built-ins). The more, shall we say, creative, uses of page props I'm told aren't great for scaling, but I'm wondering, how can we get some of the capabilities of querying derived data via another straightforward SQL mechanism on a replicated persistence store off the serving code path?
I hope those questions made sense! Maybe something exists already in Hadoop or the replicas, but I couldn't quite figure it out. I do look forward to other application layer and firehouse mechanisms in the works from different teams, although am most interested right now in the content analysis use case for some of our forthcoming Wikifunctions / Wikilambda and Abstract Wikipedia work.
Thanks! -Adam
On Fri, Nov 6, 2020 at 3:24 PM Dan Andreescu dandreescu@wikimedia.org wrote:
I don't know enough about the parser cache to give Daniel good advice on his question:
That's another issue I wanted to raise: Platform Engineeing is working on switching ParserCache to JSON. For that, we have to make sure extensions only put JSON-Serializable data into ParserOutput objects, via setProperty() and setExtensionData(). We are currently trying to figure out how to best do that for TemplateData.
TemplateData already uses JSON serialization, but then compresses the JSON output, to make the data fit into the page_props table. This results in binary data in ParserOutput, which we can't directly put into JSON. There are several solutions under discussion, e.g.: [...(see Daniel's original message for the list of ideas or propose your own)...]
But I see some people hiding in the back who might have some good ideas :) This is just a bump to invite them to respond. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Maybe something exists already in Hadoop
The page properties table is already loaded into Hadoop on a monthly basis (wmf_raw.mediawiki_page_props). I haven't played with it much, but Hive also has JSON-parsing goodies, so give it a shot and let me know if you get stuck. In general, data from the databases can be sqooped into Hadoop. We do this for large pipelines like edit history https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Edit_data_loading and it's very easy https://github.com/wikimedia/analytics-refinery/blob/master/python/refinery/sqoop.py#L505 to add a table. We're looking at just replicating the whole db on a more frequent basis, but we have to do some groundwork first to allow incremental updates (see Apache Iceberg if you're interested).
Dan Andreescu dandreescu@wikimedia.org wrote:
Maybe something exists already in Hadoop
The page properties table is already loaded into Hadoop on a monthly basis (wmf_raw.mediawiki_page_props). I haven't played with it much, but Hive also has JSON-parsing goodies, so give it a shot and let me know if you get stuck. In general, data from the databases can be sqooped into Hadoop. We do this for large pipelines like edit history https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Edit_data_loading and it's very easy https://github.com/wikimedia/analytics-refinery/blob/master/python/refinery/sqoop.py#L505 to add a table. We're looking at just replicating the whole db on a more frequent basis, but we have to do some groundwork first to allow incremental updates (see Apache Iceberg if you're interested).
Yes, I like that and all of the other wmf_raw goodies! I'll follow up off thread on accessing the parser cache DBs (they're in site.pp and db-eqiad.php, but I don't think those are presently represented by refinery.util as they're not in .dblist files).
On Tue, Nov 3, 2020 at 1:59 AM Daniel Kinzler dkinzler@wikimedia.org wrote:
TemplateData already uses JSON serialization, but then compresses the JSON output, to make the data fit into the page_props table. This results in binary data in ParserOutput, which we can't directly put into JSON.
I'm not sure I understand the problem. Binary data can be trivially represented as JSON, by treating it as a string. Is it an issue of storage size? JSON escaping of the control characters is (assuming binary data with a somewhat random distribution of bytes) an ~50% size increase, UTF-8 encoding the top half of bytes is another 50%, so it will approximately double the length - certainly worse than the ~33% increase for base64, but not tragic. (And if size increase matters that much, you probably shouldn't be using base64 either.)
* Don't write the data to page_props, treat it as extension data in
ParserOutput. Compression would become unnecessary. However, batch loading of the data becomes much slower, since each ParserOutput needs to be loaded from ParserCache. Would it be too slow?
It would also mean that fetching template data or some other page property might require a parse, as parser cache entries expire. It would also also mean the properties could not be searched, which I think is a dealbreaker.
* Apply compression for page_props, but not for the data in ParserOutput.
We would have to introduce some kind of serialization mechanism into PageProps and LinksUpdate. Do we want to encourage this use of page_props?
IMO we don't want to. page_props is for page *properties*, not arbitrary structured data. Also it's somewhat problematic in that it is per-page data but it represents the result of a parse, so it doesn't necessarily match the current revision, nor what a user with non-canonical parser options sees. New features should probably use MCR for structured data.
* Introduce a dedicated database table for templatedata. Cleaner, but
schema changes and data migration take a long time.
That seems like a decent solution to me, and probably the one I would pick (unless there are more extensions in a similar situation). This is secondary data so it doesn't really need to be migrated, just make TemplateData write from the new table and fall back to the old one when reading. Creating new tables should also not be time-consuming.
* Put templatedata into the BlobStore, and just the address into
page_props. Makes loading slower, maybe even slower than the solution that relies on ParserCache.
Doesn't BlobStore support batch loading, unlike ParserCache?
* Convert TemplateData to MCR. This is the cleanest solution, but would
require us to create an editing interface for templatedata, and migrate out existing data from wikitext. This is a long term perspective.
MCR has fairly different semantics from parser metadata. There are many ways TemplateData data can be generated for a page without having a <templatedata> tag in the wikitext (e.g. a doc subpage, or a template which generates both documentation HTML and hidden TemplateData). Switching to MCR should be thought of as a workflow adjustment for contributors, not just a data migration.
On Tue, Nov 10, 2020 at 5:50 PM Gergo Tisza gtisza@wikimedia.org wrote:
On Tue, Nov 3, 2020 at 1:59 AM Daniel Kinzler dkinzler@wikimedia.org wrote:
TemplateData already uses JSON serialization, but then compresses the JSON output, to make the data fit into the page_props table. This results in binary data in ParserOutput, which we can't directly put into JSON.
I'm not sure I understand the problem. Binary data can be trivially represented as JSON, by treating it as a string. Is it an issue of storage size? JSON escaping of the control characters is (assuming binary data with a somewhat random distribution of bytes) an ~50% size increase, UTF-8 encoding the top half of bytes is another 50%, so it will approximately double the length - certainly worse than the ~33% increase for base64, but not tragic. (And if size increase matters that much, you probably shouldn't be using base64 either.)
The binary aspect here refers to the gzip output buffer. While these are represented in PHP as a string, the string is not encodable as UTF-8 or indeed as JSON. Attempting to do so results in a PHP json error with boolean false returned.
Condensed example: https://3v4l.org/cJttU
*RFC: Expiring watch list entries* https://phabricator.wikimedia.org/T124752
This just missed the triage window, but it looks like this was implemented and deployed meanwhile (it was in Phase 3). I'm proposing we put this on Last Call for wider awareness and so that the team can answer any questions people might have, and to address any concerns that people might have based on reviewing the proposal we now know the team wanted/has chosen.
-- Timo
On Mon, Nov 2, 2020 at 6:24 PM Daniel Kinzler dkinzler@wikimedia.org wrote:
[Re-posting with fixed links. Thanks for pointing this out Cormac!]
This is the weekly TechCom board review. Remember that there is no meeting on Wednesday, any discussion should happen via email. For individual RFCs, please keep discussion to the Phabricator tickets.
Activity since Monday 2020-10-26 on the following boards:
https://phabricator.wiki09media.org/tag/techcom/ https://phabricator.wikimedia.org/tag/techcom/
https://phabricator.wikimedia.org/tag/techcom-rfc/
Committee board activity:
T175745 https://phabricator.wikimedia.org/T175745 *"overwrite edits when conflicting with self"* has once again come up while working on EditPage. There seems to no longer be any reason for this behavior. I think it does more harm then good. We should just remove it.
RFCs:
Phase progression:
- T266866 https://phabricator.wikimedia.org/T266866 *"Bump basic
supported browsers (grade C) to require TLS 1.2"*: newly filed, lively discussion. Phase 1 for now.
https://phabricator.wikimedia.org/T262946
T263841 https://phabricator.wikimedia.org/T263841 *"Expand API title generator to support other generated data"*: dropped back to phase 2 because resourcing is unclear.
- T262946 https://phabricator.wikimedia.org/T262946 *"Bump Firefox
version in basic support to 3.6 or newer"*: last call ending on Wednesday, November 4. Some comments, no objections.
Other RFC activity:
- T250406 https://phabricator.wikimedia.org/T250406 *"Hybrid
extension management"*: Asked for clarification expectations for WMF to publish extensions to packagist. Resourcing is being discussed in the platform team.
Cheers, Daniel
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Wed, Nov 4, 2020 at 12:23 AM Krinkle krinklemail@gmail.com wrote:
*RFC: Expiring watch list entries* https://phabricator.wikimedia.org/T124752
This just missed the triage window, but it looks like this was implemented and deployed meanwhile (it was in Phase 3). I'm proposing we put this on Last Call for wider awareness and so that the team can answer any questions people might have, and to address any concerns that people might have based on reviewing the proposal we now know the team wanted/has chosen.
+1 to this as well
wikitech-l@lists.wikimedia.org