File leakage in extensions - Wikitech-l

7 Apr 2006


      Moin,
it occured to me recently that there is a file leak in any extension that 
creates external files, like graphviz, graph, or possible even math 
(creating PNGs).
It works like that:
Article A contains one <graph> object, called AA
    Article B contains two <graph> objects, called BA and BA
When you edit article A, the following happens:
* the new graph code is sent to the extension
    * it is hashed
    * the hash is the filename that will be used to generate the file,
      lets call it "ABCD" for now
    * file AB/CD/ABCD is generated, and included in the output
(The hashes are done for two reasons: to save a file if two articles 
contain the same text, and to convienently generate short, unique file 
names)
Likewise for article B, except that BCDE and BCDF are the hashes, so we 
get BC/DE/BCDE and BC/DE/BCDEF as files.
No problem so far, but now what happens if you edit file A again and 
change something: A new hash results, like ABXY. This results in the file 
AB/XY/ABXY generated.
Note that the file ABCD was never cleaned off. In fact, it is impossible 
for the current scheme to clean it up for the following reasons:
* ABCDE could be as well used by page B, since only the content go into 
the hash, not the article name. Deleting the file should only be done if 
it is not used from any other article. (if the file ever vanishes, a 
null-edit is nec. to re-generate it!)
* the extension doesn't even get to know the old text, or the filenames 
used on the page, so it can't simple know which file to potentional to 
delete
The end effect is that the file cache gets bigger and bigger, and there is 
no easy way to clean ununsed files out of it.
Here are a few ideas how to deal with that:
* peridically clean off al files until you are left with X files. (there
  is at least on extension already doing this). This does not work, since
  the deletion cannot guarantue that the files left over are really used,
  and the files delete are no longer used. It's an ugly hack and creates
  more problems than it solves.
* Somehow we could track of all filenames used on all articles. Just
  think of article B;
first edit creates two entries in the table under "B"
    second edit: 
    	* first time extensio runs, it cleans table "B", and adds
    	  new hash
    	* second run cleans table again, and adds a new HASH
The problem here is that the exention cannot decide which text to
  convert is the first on the page (and thus when to clean the table)
* Various other schemes that gen. the hash based on the article name plus 
  per-article unique ID (potentially given by the user creating the text
  ala <graph id="1">). These also require somehow a real big table listing
  which files are all used.
The last idea I had are data-urls. These allow emebedding the content 
inline, instead linked via a file: http://en.wikipedia.org/wiki/Data:_URL
This would work beautyfull, except for a few bits:
* we would lose the savings that if article A and B contain the same text, 
it would no embedded twice.
* the data is in mysql, not on the file system
* it is not supported by IE at all (bummer :-(
* Opera apparently only supports these up to 4K,which is way to little for 
being practically usefull :-(
Anyway, the problem needs to be solved, even my testwikie which contains 
only 3 SVG graphs already accumulated thousand little files in 
images/graph due to the many edits done on these three articles.
Best wishes,
Tels
-- 
 Signed on Fri Apr  7 10:53:56 2006 with key 0x93B84C15.
 Visit my photo gallery at http://bloodgate.com/photos/
 PGP key on http://bloodgate.com/tels.asc or per email.

 "Call me Justin, Justin Case."