I usually use bash or python scrips to query data from multiple wikis at once, but I wanted something "simple" in Quarry such as a list of users who are active in two wikis, and I felt very stupid.
I ended up using two separate queries, a subquery and a mix of GROUP_CONCAT, TO_BASE64 and SET SESSION group_concat_max_len (to make a sort of temporary table in a pseudo-array string): https://quarry.wmflabs.org/query/23956
This was reasonably fast, but is there a better way?
Federico
SELECT ... FROM `<database>`.`<table>`, like in https://quarry.wmflabs.org/query/24212. This should work in the foreseeable future, during which all the replica databases are accessible on the same server.
Zhuyifei1999
2018-01-17 14:11 GMT-06:00 Federico Leva (Nemo) nemowiki@gmail.com:
I usually use bash or python scrips to query data from multiple wikis at once, but I wanted something "simple" in Quarry such as a list of users who are active in two wikis, and I felt very stupid.
I ended up using two separate queries, a subquery and a mix of GROUP_CONCAT, TO_BASE64 and SET SESSION group_concat_max_len (to make a sort of temporary table in a pseudo-array string): https://quarry.wmflabs.org/query/23956
This was reasonably fast, but is there a better way?
Federico
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
On Wed, Jan 17, 2018 at 5:50 PM, YiFei zhuyifei1999@gmail.com wrote:
SELECT ... FROM `<database>`.`<table>`, like in https://quarry.wmflabs.org/query/24212. This should work in the foreseeable future, during which all the replica databases are accessible on the same server.
This is bad advice. Just last week[1] one of our DBAs warned this list that that ability is not guaranteed or supported, and changing that is indeed a possibility.
[1]: https://lists.wikimedia.org/pipermail/cloud/2018-January/000169.html
Well, you are free to suggest a better advice. In the case of shards splitting into different servers, Quarry will have to implement a server / database selector [1], and processing data from two databases in a single query may not be possible at all, like currently in the case of tools-db and replicas, where all inter-host joins must happen in application space.
In any case, the question is how to do cross-wiki joins properly instead of emulating them with temporary variables, which what I have suggested may be how one would do it properly, when it works at least //in the foreseeable future//. (Yes, changing is a possibility but it’s not currently foreseeable whether or when it will happen.)
[1] https://phabricator.wikimedia.org/T76466
Zhuyifei1999
On Thu, Jan 18, 2018 at 9:39 AM Brad Jorsch (Anomie) bjorsch@wikimedia.org wrote:
On Wed, Jan 17, 2018 at 5:50 PM, YiFei zhuyifei1999@gmail.com wrote:
SELECT ... FROM `<database>`.`<table>`, like in https://quarry.wmflabs.org/query/24212. This should work in the foreseeable future, during which all the replica databases are accessible on the same server.
This is bad advice. Just last week[1] one of our DBAs warned this list that that ability is not guaranteed or supported, and changing that is indeed a possibility.
-- Brad Jorsch (Anomie) Senior Software Engineer Wikimedia Foundation _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
A lot of things were not really foreseeable, like not being able to create user tables on replica servers. This killed projects. You better heed the comment of "this should not be done". Otherwise you'll end up in the same spot as me. It was unforeseeable that that ability would be taken away, many people used it (it was the only way to get the desired results), yet "we shouldn't have done it"... Don't fall into the same trap.
On Thu, Jan 18, 2018 at 10:28 AM YiFei zhuyifei1999@gmail.com wrote:
Well, you are free to suggest a better advice. In the case of shards splitting into different servers, Quarry will have to implement a server / database selector [1], and processing data from two databases in a single query may not be possible at all, like currently in the case of tools-db and replicas, where all inter-host joins must happen in application space.
In any case, the question is how to do cross-wiki joins properly instead of emulating them with temporary variables, which what I have suggested may be how one would do it properly, when it works at least //in the foreseeable future//. (Yes, changing is a possibility but it’s not currently foreseeable whether or when it will happen.)
[1] https://phabricator.wikimedia.org/T76466
Zhuyifei1999
On Thu, Jan 18, 2018 at 9:39 AM Brad Jorsch (Anomie) < bjorsch@wikimedia.org> wrote:
On Wed, Jan 17, 2018 at 5:50 PM, YiFei zhuyifei1999@gmail.com wrote:
SELECT ... FROM `<database>`.`<table>`, like in https://quarry.wmflabs.org/query/24212. This should work in the foreseeable future, during which all the replica databases are accessible on the same server.
This is bad advice. Just last week[1] one of our DBAs warned this list that that ability is not guaranteed or supported, and changing that is indeed a possibility.
-- Brad Jorsch (Anomie) Senior Software Engineer Wikimedia Foundation _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
Yeah, when it's "the only way you can get the results", you may have no option but to use it. "This should not be done" is just a //should//, and applies only if you have another better option. Here, you don't.
When someone asks how, don't answer how not; it will not solve the problem. A solution is a solution, and if you can't find a better method for someone else, don't say not to do so without a very good reason (such as a foreseeable breakage), because people want to solve problems rather than not staying in a puddle of more problems. Do propose a different path if you oppose one [1].
[1] https://www.mediawiki.org/wiki/Wikimedia_Cloud_Services_team/Team_Social_Nor...
Zhuyifei1999
2018-01-18 12:16 GMT-06:00 Daniel Schwen lists@schwen.de:
A lot of things were not really foreseeable, like not being able to create user tables on replica servers. This killed projects. You better heed the comment of "this should not be done". Otherwise you'll end up in the same spot as me. It was unforeseeable that that ability would be taken away, many people used it (it was the only way to get the desired results), yet "we shouldn't have done it"... Don't fall into the same trap.
On Thu, Jan 18, 2018 at 10:28 AM YiFei zhuyifei1999@gmail.com wrote:
Well, you are free to suggest a better advice. In the case of shards splitting into different servers, Quarry will have to implement a server / database selector [1], and processing data from two databases in a single query may not be possible at all, like currently in the case of tools-db and replicas, where all inter-host joins must happen in application space.
In any case, the question is how to do cross-wiki joins properly instead of emulating them with temporary variables, which what I have suggested may be how one would do it properly, when it works at least //in the foreseeable future//. (Yes, changing is a possibility but it’s not currently foreseeable whether or when it will happen.)
[1] https://phabricator.wikimedia.org/T76466
Zhuyifei1999
On Thu, Jan 18, 2018 at 9:39 AM Brad Jorsch (Anomie) < bjorsch@wikimedia.org> wrote:
On Wed, Jan 17, 2018 at 5:50 PM, YiFei zhuyifei1999@gmail.com wrote:
SELECT ... FROM `<database>`.`<table>`, like in https://quarry.wmflabs.org/query/24212. This should work in the foreseeable future, during which all the replica databases are accessible on the same server.
This is bad advice. Just last week[1] one of our DBAs warned this list that that ability is not guaranteed or supported, and changing that is indeed a possibility.
000169.html
-- Brad Jorsch (Anomie) Senior Software Engineer Wikimedia Foundation _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
On Thu, Jan 18, 2018 at 12:28 PM, YiFei zhuyifei1999@gmail.com wrote:
when it works at least //in the foreseeable future//. (Yes, changing is a possibility but it’s not currently foreseeable whether or when it will happen.)
Not being able to create user tables on the replicas wasn't foreseeable. It was possible, and then it was decided that that couldn't be supported anymore in combination with other reliability features that were desired.
Not being able to select any wiki's database from any shard is foreseeable, as in our DBAs have //already foreseen// that it's a likely enough possibility to tell us not to count on it remaining available.