Is it feasible to do a log analysis of the database servers to find out what tools are
(were?) using cross-wiki joins? At least that would ensure that all the tool owners could
be contacted directly to make sure they know this is happening.
On Mar 31, 2021, at 3:46 PM, Joaquin Oltra Hernandez
<jhernandez(a)wikimedia.org> wrote:
Hi Fastily, we are aware of the use case for matching commons pages/images/sha1s between
commons/big wikis and other wikis, as it has come up many times. I'm cataloging all
the comments and examples that have come up in the last 5 months in order to provide
categorized input to the parent task <https://phabricator.wikimedia.org/T215858> so
that the engineering teams can think of solutions. I'll share it publicly once it is
in a presentable state.
We did some exploration a while ago (from Huji's examples), you can see some
notebooks with python approaches here
<https://phabricator.wikimedia.org/T267992#6637250>, but there is too much data and
doing the same takes a very long time and can be impractical. If you want to give it a try
have a look at the notebooks, I don't think the code is too memory intensive,
specially in bd808s notebook using the API, and Raspberry Pis could maybe handle it.
It is more complex and error-prone, for sure, so disabling those reports and waiting is
sadly the option right now, until a suitable solution for this is found.
So, to answer your question:
Is there going to be a replacement for this functionality?
I can't promise anything yet but I can assure you the teams involved in these systems
are aware of the need for this functionality and will be looking into how to provide it to
make these reports/bots/queries viable.
We will send updates or new info to the cloud lists, and you can subscribe to these tasks
if you want to follow more closely:
Plan a replacement for wiki replicas that is better suited to typical OLAP use cases than
the MediaWiki OLTP schema <https://phabricator.wikimedia.org/T215858>
Provide mechanism to detect name clashed media between Commons and a Local project,
without needing to join tables across wiki-db's
<https://phabricator.wikimedia.org/T267992>
Provide a mechanism for detecting duplicate files in commons and a local wiki
<https://phabricator.wikimedia.org/T268240>
Provide a mechanism for detecting duplicate files in enwiki and another wikipedia
<https://phabricator.wikimedia.org/T268242>
Provide a mechanism for accessing the names of image files on Commons when querying
another wiki <https://phabricator.wikimedia.org/T268244>
On Wed, Mar 31, 2021 at 10:57 AM Fastily <fastilywp(a)gmail.com
<mailto:fastilywp@gmail.com>> wrote:
A little late to the party, I just learned about this change today.
I maintain a number of bot tasks <https://en.wikipedia.org/wiki/User:FastilyBot>
and database <https://fastilybot-reports.toolforge.org/> reports
<https://en.wikipedia.org/wiki/Wikipedia:Database_reports> on enwp that rely on
cross-wiki joins (mostly page title joins between enwp and Commons) to function properly.
I didn't find the migration instructions
<https://wikitech.wikimedia.org/w/index.php?title=News/Wiki_Replicas_2020_Redesign&oldid=1905818#How_do_I_cross_reference_data_between_wikis_like_I_do_with_cross_joins_today?>
very helpful; I run FastilyBot on a Raspberry Pi, and needless to say it would be grossly
impractical for me to perform a "join" in the bot's code.
Is there going to be a replacement for this functionality?
Fastily
On Mon, Mar 15, 2021 at 3:09 PM Dan Andreescu <dandreescu(a)wikimedia.org
<mailto:dandreescu@wikimedia.org>> wrote:
[4] was made to figure out common use cases and possibilities to enable them again.
...
[4]
https://phabricator.wikimedia.org/T215858
<https://phabricator.wikimedia.org/T215858>
I just want to highlight this ^ thing Joaquin said and mention that our team (Data
Engineering) is also participating in brainstorming ways to bring back not just cross-wiki
joins but better datasets to run these queries. We have some good ideas, so please do
participate in the task and give us more input so we can pick the best solution quickly.
_______________________________________________
Wikimedia Cloud Services mailing list
Cloud(a)lists.wikimedia.org <mailto:Cloud@lists.wikimedia.org> (formerly
labs-l(a)lists.wikimedia.org <mailto:labs-l@lists.wikimedia.org>)
https://lists.wikimedia.org/mailman/listinfo/cloud
<https://lists.wikimedia.org/mailman/listinfo/cloud>
_______________________________________________
Wikimedia Cloud Services mailing list
Cloud(a)lists.wikimedia.org <mailto:Cloud@lists.wikimedia.org> (formerly
labs-l(a)lists.wikimedia.org <mailto:labs-l@lists.wikimedia.org>)
https://lists.wikimedia.org/mailman/listinfo/cloud
<https://lists.wikimedia.org/mailman/listinfo/cloud>
--
Joaquin Oltra Hernandez
Developer Advocate - Wikimedia Foundation
_______________________________________________
Wikimedia Cloud Services mailing list
Cloud(a)lists.wikimedia.org (formerly labs-l(a)lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud