Hi Amir & others,
I’m glad we are making changes to improve DB storage/query efficiency. I wanted to express my agreement with Tacsipacsi that dropping the data before the migration has completed is a really bad outcome. Now tool maintainers need to deal with multiple migrations depending on the wikis they query or add more code complexity. And there is little time to make the changes for those of us who had planned to wait until the new data was available.
Commons has grown to 1.8TB already
That’s a big number yes, but it doesn’t really answer the question — is the database actually about to fill up? How much time do you have until that happens and how much time until s1/s8 finish their migration? Is there a reason you can share why this work wasn’t started earlier if the timing is so close?
so you need to use the old way for the list of thirty-ish wikis (s1, s2, s6, s7, s8) and for any wiki not a member of that, you can just switch to the new system
IMO, a likely outcome is some tools/bots will simply be broken on a subset of wikis until the migtation is completed across all DBs.
Thanks for all your work on this task so far.
Ben / Earwig
Hi!
Am Mi., 17. Jan. 2024 um 19:37 Uhr schrieb Ben Kurtovic < wikipedia.earwig@gmail.com>:
Hi Amir & others,
I’m glad we are making changes to improve DB storage/query efficiency. I wanted to express my agreement with Tacsipacsi that dropping the data before the migration has completed is a really bad outcome. Now tool maintainers need to deal with multiple migrations depending on the wikis they query or add more code complexity. And there is little time to make the changes for those of us who had planned to wait until the new data was available.
I totally understand the frustration. In my volunteer capacity, I also maintain numerous tools and they break every now and then because of changes.
Commons has grown to 1.8TB already
That’s a big number yes, but it doesn’t really answer the question — is the database actually about to fill up?
It's a bit more nuanced. We are not hitting limits on the storage. But the memory for data cache on each replica is about 350GB and we need to serve almost everything from memory since the disk is 1000 times slower than memory. If we read too much from disk, reads start to pile up, leading the appserver requests starting to pile up and general outage happening (which has happened before with wikidata's database). You can have a 3TB database with only 100GB of "hot" data and you'd be fine but Commons is both big and very heavily read and across its tables and rows. Ratio-wise, Commons database is already reading twice as much as English Wikipedia from disk.
How much time do you have until that happens and how much time until s1/s8
finish their migration?
The database is already in the "fragile" and "high risk" state. I can't give you an exact date when it'll go down but due to reasons mentioned above I can tell you that even now with any noticeable increase in its traffic or sudden shift in its read patterns it will go down and bring all wikis down with it. There are already user-facing parts in commons that shouldn't be slow but they are due to excessive read from disk.
Also, for the case of Wikidata, it might take a long time, possibly three more months, to finish due to its unique pagelinks usage pattern because of scholarly articles.
Is there a reason you can share why this work wasn’t started earlier if the
timing is so close?
We have been constantly working to reduce its size in the past several years, templatelinks migration, externallinks redesign, and so on has been done back to back (started in 2021. We even bumped priority of externallinks migration because of Commons only) but at the same time, the wiki has been growing way too fast. (Emphasizing that the growth doesn't have much to do with images being uploaded, the image table is only 100GB, the problem is the overly large links tables, including templatelinks being 270GB, categorylinks being 200GB, pagelinks being 190GB and so on.). This has put us into a red queen situation https://en.wikipedia.org/wiki/Red_Queen%27s_race with no easy way out.
so you need to use the old way for the list of thirty-ish wikis (s1, s2,
s6, s7, s8) and for any wiki not a member of that, you can just switch to the new system
IMO, a likely outcome is some tools/bots will simply be broken on a subset of wikis until the migtation is completed across all DBs.
The most urgent one is Commons. What about only dropping it from Commons to reduce the risk of outage and leave the rest until the all are finished (or all except Wikidata)? You'd have to write something for the new schema regardless.
Thanks for all your work on this task so far.
Thank you and sorry for the inconvenience.
Ben / Earwig
Changing queries to support a new database format is one thing. Writing migration code to deal with a situation that should not exist (columns being dropped before the migration is completed) is another. I suppose I am lucky in that the only tool I maintain that queries the pagelinks table is single-wiki.
AntiCompositeNumber (he/him)
On Wed, Jan 17, 2024 at 3:05 PM Amir Sarabadani asarabadani@wikimedia.org wrote:
Hi!
Am Mi., 17. Jan. 2024 um 19:37 Uhr schrieb Ben Kurtovic wikipedia.earwig@gmail.com:
Hi Amir & others,
I’m glad we are making changes to improve DB storage/query efficiency. I wanted to express my agreement with Tacsipacsi that dropping the data before the migration has completed is a really bad outcome. Now tool maintainers need to deal with multiple migrations depending on the wikis they query or add more code complexity. And there is little time to make the changes for those of us who had planned to wait until the new data was available.
I totally understand the frustration. In my volunteer capacity, I also maintain numerous tools and they break every now and then because of changes.
Commons has grown to 1.8TB already
That’s a big number yes, but it doesn’t really answer the question — is the database actually about to fill up?
It's a bit more nuanced. We are not hitting limits on the storage. But the memory for data cache on each replica is about 350GB and we need to serve almost everything from memory since the disk is 1000 times slower than memory. If we read too much from disk, reads start to pile up, leading the appserver requests starting to pile up and general outage happening (which has happened before with wikidata's database). You can have a 3TB database with only 100GB of "hot" data and you'd be fine but Commons is both big and very heavily read and across its tables and rows. Ratio-wise, Commons database is already reading twice as much as English Wikipedia from disk.
How much time do you have until that happens and how much time until s1/s8 finish their migration?
The database is already in the "fragile" and "high risk" state. I can't give you an exact date when it'll go down but due to reasons mentioned above I can tell you that even now with any noticeable increase in its traffic or sudden shift in its read patterns it will go down and bring all wikis down with it. There are already user-facing parts in commons that shouldn't be slow but they are due to excessive read from disk.
Also, for the case of Wikidata, it might take a long time, possibly three more months, to finish due to its unique pagelinks usage pattern because of scholarly articles.
Is there a reason you can share why this work wasn’t started earlier if the timing is so close?
We have been constantly working to reduce its size in the past several years, templatelinks migration, externallinks redesign, and so on has been done back to back (started in 2021. We even bumped priority of externallinks migration because of Commons only) but at the same time, the wiki has been growing way too fast. (Emphasizing that the growth doesn't have much to do with images being uploaded, the image table is only 100GB, the problem is the overly large links tables, including templatelinks being 270GB, categorylinks being 200GB, pagelinks being 190GB and so on.). This has put us into a red queen situation with no easy way out.
so you need to use the old way for the list of thirty-ish wikis (s1, s2, s6, s7, s8) and for any wiki not a member of that, you can just switch to the new system
IMO, a likely outcome is some tools/bots will simply be broken on a subset of wikis until the migtation is completed across all DBs.
The most urgent one is Commons. What about only dropping it from Commons to reduce the risk of outage and leave the rest until the all are finished (or all except Wikidata)? You'd have to write something for the new schema regardless.
Thanks for all your work on this task so far.
Thank you and sorry for the inconvenience.
Ben / Earwig
Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/