Moin,
it occured to me recently that there is a file leak in any extension that creates external files, like graphviz, graph, or possible even math (creating PNGs).
It works like that:
Article A contains one <graph> object, called AA Article B contains two <graph> objects, called BA and BA
When you edit article A, the following happens:
* the new graph code is sent to the extension * it is hashed * the hash is the filename that will be used to generate the file, lets call it "ABCD" for now * file AB/CD/ABCD is generated, and included in the output
(The hashes are done for two reasons: to save a file if two articles contain the same text, and to convienently generate short, unique file names)
Likewise for article B, except that BCDE and BCDF are the hashes, so we get BC/DE/BCDE and BC/DE/BCDEF as files.
No problem so far, but now what happens if you edit file A again and change something: A new hash results, like ABXY. This results in the file AB/XY/ABXY generated.
Note that the file ABCD was never cleaned off. In fact, it is impossible for the current scheme to clean it up for the following reasons:
* ABCDE could be as well used by page B, since only the content go into the hash, not the article name. Deleting the file should only be done if it is not used from any other article. (if the file ever vanishes, a null-edit is nec. to re-generate it!) * the extension doesn't even get to know the old text, or the filenames used on the page, so it can't simple know which file to potentional to delete
The end effect is that the file cache gets bigger and bigger, and there is no easy way to clean ununsed files out of it.
Here are a few ideas how to deal with that:
* peridically clean off al files until you are left with X files. (there is at least on extension already doing this). This does not work, since the deletion cannot guarantue that the files left over are really used, and the files delete are no longer used. It's an ugly hack and creates more problems than it solves. * Somehow we could track of all filenames used on all articles. Just think of article B;
first edit creates two entries in the table under "B" second edit: * first time extensio runs, it cleans table "B", and adds new hash * second run cleans table again, and adds a new HASH
The problem here is that the exention cannot decide which text to convert is the first on the page (and thus when to clean the table) * Various other schemes that gen. the hash based on the article name plus per-article unique ID (potentially given by the user creating the text ala <graph id="1">). These also require somehow a real big table listing which files are all used.
The last idea I had are data-urls. These allow emebedding the content inline, instead linked via a file: http://en.wikipedia.org/wiki/Data:_URL
This would work beautyfull, except for a few bits:
* we would lose the savings that if article A and B contain the same text, it would no embedded twice. * the data is in mysql, not on the file system * it is not supported by IE at all (bummer :-( * Opera apparently only supports these up to 4K,which is way to little for being practically usefull :-(
Anyway, the problem needs to be solved, even my testwikie which contains only 3 SVG graphs already accumulated thousand little files in images/graph due to the many edits done on these three articles.
Best wishes,
Tels
wikitech-l@lists.wikimedia.org