-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hi,
On Monday morning we will switch these clusters from the current server (hyacinth) to cassia due to previously announced problems with hyacinth. This will involve a couple of hours read-only time while the user databases are copied. There should be no interruption to wiki database access.
This issue is being tracked in JIRA as MNT-423.
- river.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
This maintenance is now completed.
- river.
River,
This maintenance is now completed.
did you change the upper limit for @@max_heap_table_size on cassia? Increasing its value higher than 134217728 (default size doubled) doesn't work there.
With such a limitation I can say I spen years of coding for nothing, because Golem's requirements to temporary table sizes are depend on actual wiki size and sometimes go up to 4 GB. I do not think this can be improved.
Did you announce such a limitation on memory table size anywhere in the past?
mashiah
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Mashiah Davidson:
did you change the upper limit for @@max_heap_table_size on cassia?
Hm... I knew I forget to mention something. Yes, maximum_max_heap_table_size is now set to 128MB.
With such a limitation I can say I spen years of coding for nothing, because Golem's requirements to temporary table sizes are depend on actual wiki size and sometimes go up to 4 GB. I do not think this can be improved.
4GB is *way* too much; we don't have even close to that much free memory available on the MySQL servers. When we discussed this by email, I had the impression you had reduced its memory usage a lot -- if it's using 4GB *after* being reduced, I hate to think what it required previously...
- river.
Hello
Hm... I knew I forget to mention something. Yes, maximum_max_heap_table_size is now set to 128MB.
This disables my tool to be functional.
4GB is *way* too much; we don't have even close to that much free memory available on the MySQL servers.
As far as I've seen during last two years, it was ok. To be more precise, I am not sure it really used 4 GB, but for codespots where significant amount of data is to be cached the bot sometimes increases allowed memory size up to this limit. Of course, I did not allow to increase it up to infinity. The described functionality for memory limit change can be seen here: https://fisheye.toolserver.org/browse/golem/isolated/memory.sql?r=HEAD
From now on I'll add there something like
SET @b=@@a; SET @@a=2*@b;
# crazy code IF @@a!=2*@b THEN CALL error(); END IF
When we discussed this by email, I had the impression you had reduced its memory usage a lot -- if it's using 4GB *after* being reduced, I hate to think what it required previously...
Ok, here I need to tell a few words about what Golem is and what and how it does. Golem in general is a tool for isolated article clusters recognition and for suggestions generation on how to improve wikipedia connectivity. It also performs some supplementary functions such as analysis of links to disambiguation pages and categorytree cycles recognition. One can try it starting from this page: http://toolserver.org/~lvova/cgi-bin/go.sh?interface=en.
Golem's function is split into a number of subsequent stages, each stage depends on data obtained on previous stages. The very initial phase is caching of links from language database into memory tables: page links, category links, template links and, of course, zero namespace pages, category pages, template pages. Memory requirements for this stage are depend only on size of the wiki. For deutch wikipedia golem requests to allow memory tables to be of 4 GB in size, for russian wikipedia 1 GB is enoug and for smaller wikis it can fit to default limits. For english wikipedia I do not run the analysis as too hard.
After the first stage is completed other stages utilize much less with an exception for the very last stage called iwiki spy. The purpose of iwiki spy is to analyze interwiki links from isolated articles to other languages and spy possible linking suggestions there or find articles to be translated in order to set links from them.
Iwiki spy requires a lot of memory for relatively small wikis (war was the worst case I've seen) because there are lots of suggestions coming for its isolated articles from everywhere. That's why the limit for iwiki spy is always set to 4 GB.
One more thing to take into account is that there are two persons on ts who regularily run Golem. One of them is lvova, she have an official copy linked from wikpedia templates and other pages. My copy is just for development.
As you remember, we've experienced issues with memory during last few weeks. All that issues correllate with situations when both, me and lvova ran the bot together (each requesting for up to 4 GB for temorary data) and both worked on relatively small languages. In such a situation two iwiki spies requested too much memory together.
Now I just disabled interwiki spy stage in my copy of the bot in order to let lvova's copy providing isolated articles linking suggestions. After this change the bot successfully analyzed whole list of available languages and neither hung nor utilize too much of memory. Lvova's copy at the same time was also running with iwiki spy on.
My intention was to start iwiki spy again for both copies after rewriting (which is possible but may take some time).
The current status of the tool is: 1. it does work for languages on s2/s5 because the limitation on allowed memory size there is not on 2. it does not work for relatively big lanuages on s3/s6 even for the first stage, links and pages caching. I have tried fr as an example and the bot hang trying to increase the allowed heap table size up to values around 512 MB for category pages caching. 3. golem's web page works well with small exceptions caused by a number of SQL procedures not moved to the new server, the example can be seen here: http://toolserver.org/~mashiah/cgi-bin/go.sh?language=ru&interface=en&am... 4. I am really in a difficult situation with the latest configuration change because from now on I think the connectiwity project running in russian wiki have no data it had for last two years. Ukrainian comunity actively used golem's data for approximately a year. Poland community just familiarized with Golem's functions during national wiki conference will not be able to use functions just introduced to them. Work other people do on translating Golem's interface to deutch is probably nor longer required.
Sorry if something looks too emotional in this message; indeed too much efforts were spent to implement it and I am really not sure "way too much" statement is well proved to cause the restrictions just introduced.
mashiah
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Mashiah Davidson:
Hm... I knew I forget to mention something. Yes, maximum_max_heap_table_size is now set to 128MB.
This disables my tool to be functional.
4GB is *way* too much; we don't have even close to that much free memory available on the MySQL servers.
As far as I've seen during last two years, it was ok.
I don't think so. Over the last year or so, we've seen a lot of problems with MySQL using too much memory for no apparent reason, and sometimes running out of memory and crashing. (This is not something that's just been happening in the last couple of months.) I wasn't able to find anything that might cause this, but I did not consider that someone would create 4GB of MEMORY tables.
As you remember, we've experienced issues with memory during last few weeks.
We've been having this issue a lot longer than that. It was simply that in the last month or so, it caused more MySQL crashes than previously.
Sorry if something looks too emotional in this message; indeed too much efforts were spent to implement it and I am really not sure "way too much" statement is well proved to cause the restrictions just introduced.
I understand this seems like we're breaking something that previously worked. However, that is not the case. This has _never_ worked properly, we just weren't able to find the cause before. This excessive memory use is causing significant problems for all Toolservers users, in the form of unreliable MySQL servers and slow performance (= slow queries, higher replication lag).
The statement is trivial to prove: we do not have 4GB free memory on MySQL servers. The reason we buy servers with 32GB RAM is so we can *use* 32GB of memory. There is no free memory (beyond a little bit that the OS needs to function). If we use 32GB, and you use 4GB, we are using 36GB. The server only has 32G. The only possibly result of that is either it will start swapping (which completely kills performance), or it runs out of memory and crashes.
Still, since I have no absolutely certainly proof that golem causes this problem, I will monitor memory use on cassia over the next month or so. If we see the same issue again, even though MEMORY tables are limited to 128MB, I'll consider that something else might be the cause.
- river.
I don't think so. Over the last year or so, we've seen a lot of problems with MySQL using too much memory for no apparent reason, and sometimes running out of memory and crashing. (This is not something that's just been happening in the last couple of months.) I wasn't able to find anything that might cause this, but I did not consider that someone would create 4GB of MEMORY tables.
From what I remember, there were a lot of various reasons why ts
crashed. Just during last month I've started receiving messages "cannot allocate XXX" during processing.
Did you meanwhile know it worked in 24x7-like mode during last year and sometimes two bots were working in parallel?
mashiah
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Mashiah Davidson:
From what I remember, there were a lot of various reasons why ts crashed.
Yes, I'm quite aware of the various reasons for things crashing or running much slower than they should. One of these reasons is MySQL running out of memory.
Did you meanwhile know it worked in 24x7-like mode during last year and sometimes two bots were working in parallel?
That would not surprise me, since as I said, we've been having this problem for quite a while.
- river.
That would not surprise me, since as I said, we've been having this problem for quite a while.
What I cannot understand is why you think it causes probles just sometimes and most of the time it works ok. One more think, didn't you see that after iwiki spy was disabled in one of the copies issues with memory disappeared?
mashiah
On Mon, Mar 29, 2010 at 3:53 PM, Mashiah Davidson mashiah.davidson@gmail.com wrote:
This disables my tool to be functional.
Why can't it just use MyISAM or InnoDB tables instead of MEMORY? That would slow it down, but it should still work. A lot of the reads will still be from memory anyway, although writes might be slower across the board.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Aryeh Gregor:
although writes might be slower across the board.
Shouldn't be *much* slower, since all writes go to nvram cache anyway. Most of the overhead would be InnoDB/OS processing.
- river.
Why can't it just use MyISAM or InnoDB tables instead of MEMORY? That would slow it down, but it should still work. A lot of the reads will still be from memory anyway, although writes might be slower across the board.
It does use MyISAM for results to be shown on web pages. Use of MyISAM for temporary data makes the bot several times slower. Currently Golem needs about 1.5 hours to process ruwiki, about 6 hours for such dis-connected wiki as pt, 2 hours for de. All languages (106 with disambiguating templates tunned properly) take 3 days of processing for bot with interwiki spy off and 1 day without iwiki spy.
mashiah
Mashiah Davidson wrote:
As you remember, we've experienced issues with memory during last few weeks. All that issues correllate with situations when both, me and lvova ran the bot together (each requesting for up to 4 GB for temorary data) and both worked on relatively small languages. In such a situation two iwiki spies requested too much memory together.
Now I just disabled interwiki spy stage in my copy of the bot in order to let lvova's copy providing isolated articles linking suggestions. After this change the bot successfully analyzed whole list of available languages and neither hung nor utilize too much of memory. Lvova's copy at the same time was also running with iwiki spy on.
My intention was to start iwiki spy again for both copies after rewriting (which is possible but may take some time).
If the problem is just of launching two copies at the same time, make the code verify before running that the other copy is not running. That can go from an existance check of a file on /tmp (supposing both copies are launched from the same login server) to mysql table locks or entries on a well-known table.
If the problem is just of launching two copies at the same time, make the code verify before running that the other copy is not running.
Unfortunately this is not the main problem at the moment. The first thing is that after the most memory consuming part of processing was disabled in one of the copies the bot became more stable. But the main problem is that with the limitation on memory table size just introduced it cannot perform even its very first steps even when runs alone.
The other problem is that once the limit had taken place two years ago I would never go with creating this tool, but I did; and this means I've spent a huge part of my life for nothing and a few other people did the same. One of my collegues who runs the bot and applies its results for ru and uk just participated poland wiki-conference describing the tool and inviting the third language comunity to apply bot's results in their wiki.
mashiah
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Mashiah Davidson:
The other problem is that once the limit had taken place two years ago I would never go with creating this tool, but I did; and this means I've spent a huge part of my life for nothing and a few other people did the same.
I'm sorry if you thought creating multi-gigabyte MEMORY tables was acceptable. It never even occurred to me that someone would do this and not see the problem with it.
You now have two options: you can continue to discuss this on the mailing list, which will not result in the limit being removed[0], or you can look at a way to perform the same result without needing such large MEMORY tables. I would suggest that the latter is a more productive use of your time.
- river.
[0] But I might consider a small increase; perhaps 256MB.
It never even occurred to me that someone would do this and not see the problem with it.
Exactly, I do not see any problems with it because the tool itself is working well (with exception to iwiki spy, which indeed can be improved with use of another suggestions collection algorithm).
You now have two options: you can continue to discuss this on the mailing list, which will not result in the limit being removed[0], or you can look at a way to perform the same result without needing such large MEMORY tables. I would suggest that the latter is a more productive use of your time.
I guess it is not possible to reduce the limit in the bot and keep bot's performance at least on the same level at the same time. You know administration well, I know the application domain of connectivity analysis.
I really think that the only acceptable solution for golem to continue working is to get the limit removed. No matter if this will be achieved with additional memory installed to servers or any other way. This does not mean of course I should not think about memory usage, and I will like I was in the past if the bot will be allowed to work. Not now.
I believe you also have even more than just two options. To keep the limit at cassia, or to remove it. You also can introduce similar limit on thyme and daphna. Why not? In your options list for me I do not see any, which allows you helping golem to run again as a toolserver administrator.
mashiah
Hello, At Tuesday 30 March 2010 00:36:00 DaB. wrote:
Exactly, I do not see any problems with it because the tool itself is working well
why do you think that you can use 1/8 of the memory of the server alone? We have a lot more then 8 users so it makes me wonder, how 1 user thinks he can use 1/8 and the other few dozenth the rest. I guess the solution for your problem is clear: Rewrite your tool that it doesn't use memory-tables but real db-tables (InnoDB or MyISAM). That should you tool keep running and let the other users keep their fair share of memory.
Sincerly, DaB.
DaB,
why do you think that you can use 1/8 of the memory of the server alone? We have a lot more then 8 users so it makes me wonder, how 1 user thinks he can use 1/8 and the other few dozenth the rest.
The solution you suggested was tested by me on initial implementation phase and found it too slow.
Anyway, the current limitation doesn't mean you cannot use 1/8 of ts memory for your tools. It limits just size of one memory table, so you can create much more than one and still utilize the 1/8th.
On the other side, asking for tables to be up to 4GB in size also doesn't mean 4GB will be used. The request to allow 4GB is based on not precise estimate of how much a table of a given structure may take on server with a given HW architecture.
I also think your calculations are wrong based, indeed am just one of toolserver users, but the tool is supposed to be used by far more people like your's and other's tools. I am ok thinking that connectivity analysis is too complex task for toolserver to handle. But I do not think I've taken 1/8 of its memory for my own.
mashiah
On 2010-03-30 09:21, Mashiah Davidson wrote:
DaB,
why do you think that you can use 1/8 of the memory of the server alone? We have a lot more then 8 users so it makes me wonder, how 1 user thinks he can use 1/8 and the other few dozenth the rest.
The solution you suggested was tested by me on initial implementation phase and found it too slow.
Hello, I don't know the details of your algorithm and apologies for assuming ignorance on your part, but did you keep in mind the mantra of databases? It goes something like this (imagine Steve Ballmer saying this): Indexes! Indexes! Indexes! (Or is it "indices"?) I too had a serious query complexity problem a while back - that was with disambiguation pages with links (and related queries such as pages with most DPLs, templates linking to DPLs etc.), which entails even 5- or 6-table joins. When I tried running it as one query, it would get killed all the time, except on very small samples. Then I started storing intermediate results in temporary (albeit "physical") tables, *heavily* indexed (even on fields that are not normally indexed in a mediawiki database) - now several reports get generated in a matter of minutes (for pl.wiki, which is quite a biggie).
Of course, indexes take disk space, but that is quite plentiful on the toolserver, and with that space you should be able to buy speed for your algorithm.
Regards, Misza
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 10-03-30 03:01 PM, Misza wrote:
imagine Steve Ballmer saying this
I'd rather not, actually! :P
- -Mike
I don't know the details of your algorithm and apologies for assuming ignorance on your part, but did you keep in mind the mantra of databases? It goes something like this (imagine Steve Ballmer saying this): Indexes! Indexes! Indexes!
The code is open available at https://fisheye.tooleserver.org/browse/golem. The short answer is definitely YES. Even with MEMORY engine with no indecies it will be too slow. Indecies just were they are really needed in order to reduce memory consumptions.
mashiah
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Mashiah Davidson:
I guess it is not possible to reduce the limit in the bot and keep bot's performance at least on the same level at the same time. You know administration well, I know the application domain of connectivity analysis.
I really think that the only acceptable solution for golem to continue working is to get the limit removed. No matter if this will be achieved with additional memory installed to servers or any other way.
Sorry, but I don't see how spending EUR2'500 on additional RAM to support a single tool is an effective use of WM-DE funds.
Perhaps you could consider a different implementation language? I don't think shell scripts and SQL stored procedures are particularly well-known for being memory-efficient.
You also can introduce similar limit on thyme and daphna. Why not?
We will probably do this after a testing period on cassia (and hyacinth when it returns).
- river
Sorry, but I don't see how spending EUR2'500 on additional RAM to support a single tool is an effective use of WM-DE funds.
Thank you for puttig the exact amount here. Do you think this will be efficient if such a sum came from another source and will be used by all tools running on toolserver? My view is it is far less than EUR8'000 mentioned last time as a price of a dedicated sever.
Perhaps you could consider a different implementation language? I don't think shell scripts and SQL stored procedures are particularly well-known for being memory-efficient.
You right on the memory efficiency. MySQL MEMORY engine is not about that. On the other hand, offline comutations will create load on inter-sever channels and occupy disc space. Downloading of links table for deutch to login server will take more time than just copying it to a memory table on the same sever. My estimate is it will take hours to get a copy of this table on a disc drive, and this is just a small portion of the whole thing.
mashiah
On Tue, Mar 30, 2010 at 2:46 AM, River Tarnell river.tarnell@wikimedia.de wrote:
Sorry, but I don't see how spending EUR2'500 on additional RAM to support a single tool is an effective use of WM-DE funds.
WM-DE received some day money from Wikimedia Foundation. Wikimedia Foundation received some money including from our project's users. Some our users live in Germany and are members of WM-DE. You must resolve problems with toolserver programms, yes, and memory is a problem, but current solution of problems does not allow these users to work with accustomed tools.
In any way we constantly work about optimization of our script, and just some days ago it became better in this regard. Does ts have problems from this time? I didn't find answer about it. And there are also further improvements in our plans...
Besides, the project needs in update of data closedaily, or connectivity information that use in wikipedia becomes wrong. This information used in 4% of articles in ruwiki and in 10% of articles in ukwiki.
Mashiah responded to previous letter about memory tables earlier; so give him time to work on code with the opportunity of launching script in this time, please.
Mashiah Davidson wrote:
I guess it is not possible to reduce the limit in the bot and keep bot's performance at least on the same level at the same time. You know administration well, I know the application domain of connectivity analysis.
I really think that the only acceptable solution for golem to continue working is to get the limit removed. No matter if this will be achieved with additional memory installed to servers or any other way. This does not mean of course I should not think about memory usage, and I will like I was in the past if the bot will be allowed to work. Not now.
I believe you also have even more than just two options. To keep the limit at cassia, or to remove it. You also can introduce similar limit on thyme and daphna. Why not? In your options list for me I do not see any, which allows you helping golem to run again as a toolserver administrator.
mashiah
Mashiah, you are not being reasonable. Your "work" is to get the analysis done. River's work is to make sure that *everyone* accessing the toolserver can use it decently. Even if it means reducing what each one can individually do. That's what time-sharing is about.
You also have several options, like running the analysis on a dedicated machine you own, or donating WM-DE compatible memory modules to upgrade cassia adding it 4GB.
Why do you need eg. the initial phase of copying the links from language database into memory tables (page links, category links, template links, zero namespace pages, category pages, template pages) ?
Platonides,
Why do you need eg. the initial phase of copying the links from language database into memory tables (page links, category links, template links, zero namespace pages, category pages, template pages) ?
Connectivity is a property of a graph as a whole, there is no way to analyze it having just a part of all nodes and edges. Use of original tables in language database or use of MyISAM tables makes the analysis far too slow. Good thing with memory tables is not only in being located in memory (which is not always true of course), the engine is optimized for speed itself and the format is designed to allow that.
mashiah
Hello Mashiah
Connectivity is a property of a graph as a whole, there is no way to analyze it having just a part of all nodes and edges. Use of original tables in language database or use of MyISAM tables makes the analysis far too slow. Good thing with memory tables is not only in being located in memory (which is not always true of course), the engine is optimized for speed itself and the format is designed to allow that.
If your project requires more resources than are available as your fair share on the toolserver, then either the need for resources needs to be reduced, or the project has to run elsewhere. If there are good reasons and sufficient funding, setting aside a VM or even a full server for a special project can be considered. How individual projects and chapters can participate more in the givernance (and funding) of the toolserver is one of the topics that will be discussed at the upcoming chapter's conference in april in berlin. I suggest you contact someone who will attend the meeting, and discuss the issue with them.
Anyway, if using MySQL's memory tables consumes too much resources, perhaps consider alternatives? Have you looked at network analysis frameworks like JUNG (Java) or SNAP (C++)? Relational databases are not good at managing linked structures like trees and graphs anyway.
The memory requirements shouldn't be that huge anyway: two IDs per edge = 8 byte. The German language Wikipedia for instance has about 13 million links in the main namespace, 8*|E| would need about 1GB even for a naive implementation. With a little more effort, it can be nearly halved to 4*|E|+4*|V|.
I have used the trivial edge store for analyzing the category structure before, and Neil Harris is currently working on an nice standalone implementation of this for Wikimedia Germany. This should allow recursive category lookup in microseconds.
In any case, something needs to change. You can't expect to be frequently using 1/8 of the toolserver's RAM. Even more so since this amount of memory can't be used by MySQL for caching while you are not using it (because of the way the innodb cache pool works).
Regards, Daniel
* Daniel Kinzler
Anyway, if using MySQL's memory tables consumes too much resources, perhaps consider alternatives? Have you looked at network analysis frameworks like JUNG (Java) or SNAP (C++)? Relational databases are not good at managing linked structures like trees and graphs anyway.
If Java is not a problem, there's also a (NoSQL) graph database available called Neo4J: http://neo4j.org/
Regards, Morten
I suggest you contact someone who will attend the meeting, and discuss the issue with them.
Thank you I think I've already found such a person.
Anyway, if using MySQL's memory tables consumes too much resources, perhaps consider alternatives? Have you looked at network analysis frameworks like JUNG (Java) or SNAP (C++)? Relational databases are not good at managing linked structures like trees and graphs anyway.
My view on MySQL capabilities was different. The first thought is that task involves memory intensive computations, i.e. it will use lots of data to produce considerably low amount of results. Operations with memory take major part of the common analysis complexity, that's why it is reasonable to involve an engine especially targeted to work with data efficiently. By efficiency here I mean mostly processing speed. Indeed, the idea was to mark isolated articles with templates to make autors aware of the issue. Practice shown that templates are to be set with use of actual data, which means it is not good if a bot works for many hours. The other thing from practice is that templates are to be set near dayly, overwise autors lose attention to their creatures.
Yes, it takes lots of memory because MEMORY engine stores varchar data in an inefficient way and spends lots of memory for indexes, but on the other hand the processing takes just 1-2 hours for such wiki as ru or de. My estimates for offline implementation made on the initial stage gave me much higher estimate on processing results actuality, that's why I've chosen sql.
The memory requirements shouldn't be that huge anyway: two IDs per edge = 8 byte. The German language Wikipedia for instance has about 13 million links in the main namespace, 8*|E| would need about 1GB even for a naive implementation. With a little more effort, it can be nearly halved to 4*|E|+4*|V|.
My data for dewiki is different. The amount of links between articles (excluding disambigs) after redirects throwing is around 33 milion. The source is here: http://toolserver.org/~mashiah/isolated/de.log. One may find lots of other interesting statistics there.
I have used the trivial edge store for analyzing the category structure before, and Neil Harris is currently working on an nice standalone implementation of this for Wikimedia Germany. This should allow recursive category lookup in microseconds.
I think category tree analysis takes (which is also there) - worst case - minutes for relatively large wiki (7 minutes for about 150 small wikipedias). On the output the categorytree graph is split into strongly connected components. With offline application just data download from the database could take more than Golem's processing time.
mashiah
Mashiah Davidson schrieb:
Yes, it takes lots of memory because MEMORY engine stores varchar data in an inefficient way and spends lots of memory for indexes,
Why do you store varchar data at all? It would be much more efficient to use id-to-id maps, no?
but on the other hand the processing takes just 1-2 hours for such wiki as ru or de.
On a database server, all available memory is usually reserved for the innodb cache, where it greately benefits query performance. If you want to use large chunks of memory for MEMORY tables, this memory can not be reserved fro innodb. So, it is unavailable to "normal" database operation. Sure, others could use it for thier own memory tables while you do not uzse it, but I doubt that this is much help. Basically, memory available for memory tables is not available for innodb.
This is the essentail conflict: basically, we would have to reserver 1/8 of all resources for your use (well, for use by memory tables - but I doubt anyone besides you used big memory tables).
My estimates for offline implementation made on the initial stage gave me much higher estimate on processing results actuality, that's why I've chosen sql.
I can see that it would be much more effort to implement these things by hand, but I don't see why it would be less efficient.
My data for dewiki is different. The amount of links between articles (excluding disambigs) after redirects throwing is around 33 milion. The source is here: http://toolserver.org/~mashiah/isolated/de.log. One may find lots of other interesting statistics there.
You are right, I was looking at the wrong numbers.
I have used the trivial edge store for analyzing the category structure before, and Neil Harris is currently working on an nice standalone implementation of this for Wikimedia Germany. This should allow recursive category lookup in microseconds.
I think category tree analysis takes (which is also there) - worst case - minutes for relatively large wiki (7 minutes for about 150 small wikipedias). On the output the categorytree graph is split into strongly connected components. With offline application just data download from the database could take more than Golem's processing time.
Speed vs. Memory is the usual tradeoff. We have found that Golem uses too much memory, and of course, the easy way to solve to problem is by using a slower (offline) aproach. I don't see a easy solution for this.
Anyway, my point is not about the category graph as such. I'm just saying that fast and memory-efficient network analysis is possible with this kind of architecture.
-- daniel
Why do you store varchar data at all? It would be much more efficient to use id-to-id maps, no?
Yes, sure. But in order to get this id-to-id map one should first cache name-to-id (pages) because links are stored in id-to-name format. The table caching id-to-name takes sometimes more memory than links itself. But the good thing about it it leaves very short period of time.
This is the essentail conflict: basically, we would have to reserver 1/8 of all resources for your use (well, for use by memory tables - but I doubt anyone besides you used big memory tables).
This figure of 1/8 resources is inaccurate. First, the limit is set just to one table, not for all tables user create. Second, when Golem supposes it will need 4 GB it assumes worst case. Third, Golem at a time works with just one server (when iwiki spy is switched off), so even if it takes 1/8 at one server this means 1/24 of all memory available at s1, s2 and s3.
I can see that it would be much more effort to implement these things by hand, but I don't see why it would be less efficient.
No hand operations at all. I was talking about the fact that data is to be transferred from sql server to client and this transmission takes time. If I need to transmit id-to-id, as I said above I will anyway use a lot of mem to convert to this format. On the other hand, transmission of id-to-name data (assuming it is converted in an app written in C) will take a lot of time.
Speed vs. Memory is the usual tradeoff. We have found that Golem uses too much memory, and of course, the easy way to solve to problem is by using a slower (offline) aproach. I don't see a easy solution for this.
Me too, that's why I think I was doing too much work for nothing.
Anyway, my point is not about the category graph as such. I'm just saying that fast and memory-efficient network analysis is possible with this kind of architecture.
I would agree up to a point of data transmission from sql servers to the application performing analysis. However, all this is just for the only function, for connectivity analysis itself, but there are a lot of other things golem does isolated articles creators stat, suggestions generation etc. All this could require most part of the language metadata to be first downloaded from the sql server.
mashiah
On Wed, Mar 31, 2010 at 4:02 PM, Mashiah Davidson mashiah.davidson@gmail.com wrote:
This figure of 1/8 resources is inaccurate. First, the limit is set just to one table, not for all tables user create. Second, when Golem supposes it will need 4 GB it assumes worst case. Third, Golem at a time works with just one server (when iwiki spy is switched off), so even if it takes 1/8 at one server this means 1/24 of all memory available at s1, s2 and s3.
The question is what we set innodb_buffer_pool_size to. If no one needs to allocate large memory tables or such, we'd allocate almost all physical server memory to that variable. (>80% is standard for all-InnoDB database servers.) In general, MySQL will allocate the full size of innodb_buffer_pool_size fairly quickly. Memory allocated there is not available for anything else, including other MySQL things like creation of temporary tables. The variable cannot be changed without a reboot of mysqld, it's a fixed amount.
If you *ever* use 4 GB memory tables, even for one single second, we need to decrease innodb_buffer_pool_size by 4 GB *always*, to avoid the risk of swapping or memory allocations failing. This is, in a very real sense, 1/8 of the server's resources that we would have to reserve for your project. We cannot reduce the buffer pool size only when you need the extra memory; it's always or never.
If MySQL could automatically reduce buffer pool allocations when large temporary tables are needed, then there would be little to no issue, you're right. But that's not the case.
toolserver-l@lists.wikimedia.org