Dear colleagues, dear Daniel,
thanks for the detailed explanation!
It looks like the Connectivity project definitely overgrew the Toolserver single user status, and maybe also overgrew the Toolserver structure in general. There is a contradiction between considering it as a single user tool and its real large-scale nature. (For example, it may be considered as equal to all interwiki bots summed together, or even similar to the semantic wiki project.)
It looks like there's a REAL need to find some long-term solution - a complete (or at least major) bot refactoring, integrating support into mediawiki, financing a dedicated server, or something like this.
Ok. But then I would like to ask you to consider also some short-term solution (say, for two-three months) which might allow the project to function before a long-term solution is implemented.
Maybe also Mashiah is able to find a way how to run Golem with a limited functionality, still making the analysis essential for the project?
Of course, I would be happy to discuss Toolserver support opportunities during the chapters' conference in Berlin!
Hello Vladimir
The problem is that Golem uses a very large amount of memory, about 4GB. That's 1/8 of the total capacity, and it's memory that can not be used for normal database operations if it's set aside for the memory tables golem uses (even if they are not in use). This by far exceeds the fair share of resources for each toolserver user.
It was only recently discovered that golem does use so much memory (because it does it on the database server, not the normal user server), but it is suspected that this is at least one of the causes that triggered system failured in the past. We do not currently see the possibility of allowing individual users to use that much memory, especially not on the database server. Basically, as it is implemented now, Golem is unfit for the toolserver, because it consumes far too much resources.
Earlier today, I ask Mashiah to consider alternative ways to implement the network analysis. I think it would be possible to reduce the memory use by at least a factor of 8. Should this not be possible, golem would have to run on a dedicated system.
If there are good reasons and sufficient funding, setting aside a VM or even a full server for a special project can be considered. How individual projects and chapters can participate more in the governance (and funding) of the toolserver is one of the topics that will be discussed at the upcoming chapters' conference in april in berlin. I recommend you bring up the topic of golem there.
Regards, Daniel
PS: Below I quote my reply to Mashiah.
- Hello Mashiah
*>* *>>* > Connectivity is a property of a graph as a whole, there is no way to *>>* > analyze it having just a part of all nodes and edges. Use of original *>>* > tables in language database or use of MyISAM tables makes the analysis *>>* > far too slow. Good thing with memory tables is not only in being *>>* > located in memory (which is not always true of course), the engine is *>>* > optimized for speed itself and the format is designed to allow that. *>* *>* If your project requires more resources than are available as your fair share on *>* the toolserver, then either the need for resources needs to be reduced, or the *>* project has to run elsewhere. If there are good reasons and sufficient funding, *>* setting aside a VM or even a full server for a special project can be *>* considered. How individual projects and chapters can participate more in the *>* givernance (and funding) of the toolserver is one of the topics that will be *>* discussed at the upcoming chapter's conference in april in berlin. I suggest you *>* contact someone who will attend the meeting, and discuss the issue with them. *>* *>* Anyway, if using MySQL's memory tables consumes too much resources, perhaps *>* consider alternatives? Have you looked at network analysis frameworks like JUNG *>* (Java) or SNAP (C++)? Relational databases are not good at managing linked *>* structures like trees and graphs anyway. *>* *>* The memory requirements shouldn't be that huge anyway: two IDs per edge = 8 *>* byte. The German language Wikipedia for instance has about 13 million links in *>* the main namespace, 8*|E| would need about 1GB even for a naive implementation. *>* With a little more effort, it can be nearly halved to 4*|E|+4*|V|. *>* *>* I have used the trivial edge store for analyzing the category structure before, *>* and Neil Harris is currently working on an nice standalone implementation of *>* this for Wikimedia Germany. This should allow recursive category lookup in *>* microseconds. *>* *>* *>* In any case, something needs to change. You can't expect to be frequently using *>* 1/8 of the toolserver's RAM. Even more so since this amount of memory can't be *>* used by MySQL for caching while you are not using it (because of the way the *>* innodb cache pool works). *>* *>* *>* Regards, *>* Daniel *>* *
Vladimir Medeyko schrieb:
- Dear colleagues,
*>* *>* I've heard that Golem bot, which is the heart of the connectivity *>* project, stopped to function due to the recent toolserver reconfiguration. *>* *>* Is it possible to adjust configuration specifically for Golem or to do *>* something else to make it function again? *>* *>* It is especially a pity that the connectivity project has problems now, *>* just two days after the project was reported at Konferencija Wikimedia *>* Polska and received much of interest from the listeners. *>* *>* What could be done to fix the situation? Thanks! *>* *