Your last question is a non issue for me- I do not care if things are on the database or on configuration- that is not the issue I have been complaining about.
What I blocked is having 6000 million rows (x40 due to redundancy) with the same column value "gzip; version 3 (1-2-3-testing-testing. It seems to work)" when it can be summarized as a 1-byte or less id (and that id be explained somewhere else). The difference between both options is extremely cheap to code and not only it would save thousands of dollars in server cost, it would also minimize maintenance cost and dramatically increase performance (or not decrease it) on one of the largest bottlenecks for large wikis, as it could fit fully into memory (yes, we have 515 GB servers now).
To give you an idea how how bad things are currently: WMF's architecture technically does not store on the main databases servers any data (a lot of asterisks here, allow me be inexact for the sake of simplicity), only metadata, as the wiki content is stored on the "external storage" subsystem. I gave a try to InnoDB compression [0] (which has a very low compression ratio and a very small block size, as it is for real-time purposes only), yet I was able to reduce the disk usage to less than half by only compressing the top 10 tables: [1]. If this is not an objective measurement of how inefficient mediawiki schema is, I do not know how I can convince you otherwise.
Of course there are a lot of history and legacy and maintenance issues, but when the guy that actually would spend days of his life running schema changes so they do not affect production is the one begging for them to happen you know there is an issue. And this is not a "mediawiki" is bad complain- I think mediawiki is a very good piece of software- I only want to make it better with very, very small maintenance-like changes.
The disadvantage is of course that the model and format are not obvious when eyeballing the result of an SQL query.
Are you serious? Because this is super-clear already :-P:
MariaDB db1057 enwiki > SELECT * FROM revision LIMIT 1000,1\G *************************** 1. row *************************** rev_text_id: 1161 -- what? [...] rev_content_model: NULL -- what? rev_content_format: NULL 1 row in set (0.00 sec)
MariaDB db1057 enwiki > SELECT * FROM text WHERE old_id=1161; -- WTF, old_id? +--------+---------------------+----------------+ | old_id | old_text | old_flags | +--------+---------------------+----------------+ | 1161 | DB://rc1/15474102/0 | external,utf-8 | -- WTF is this? +--------+---------------------+----------------+ 1 row in set (0.03 sec)
I am joking at this point, but emulating what someone that looks at the db would say. My point is that mediawiki is no longer simple.
More recommended reading (not for you, for many developers that still are afraid of them- and I really found many cases in the wild for otherwise good contributors): https://en.wikipedia.org/wiki/Join_(SQL)
[0] https://phabricator.wikimedia.org/T139055 [1] https://grafana.wikimedia.org/dashboard/db/server-board?panelId=17&fullscreen&from=1467294350779&to=1467687175941&var-server=db1073&var-network=eth0
On Tue, Jul 12, 2016 at 10:40 AM, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:
Addendum, after sleeping over this:
Do we really want to manage something that is essentially configuration, namely the set of available content models and formats, in a database table? How is it maintained?