On Sun, 1 Jun 2003, Brion Vibber wrote:
Je Ĵaŭdo 29 Majo 2003 02:15, Alfio Puglisi skribis:
http://www.arcetri.astro.it/~puglisi/wiki/dump/ma/main_page.html
Looks nice! Cleaner interface than we have, too. ;)
Being static, lots of the dynamic stuff and special pages didn't need to be on the topbar :-)
Letters aren't distributed evenly, alas... If you want to even that out, consider using a binary hash as the basis of the divisions. (We use the first one and two hex digits of the md5hash of the title/filename for the uploads and the rendered page cache, for instance.) They're not pretty, though.
I'll save this for when/if it becomes a real problem
- Size: this dump is about 800MB. (tar.gz is just 110MB).
[...]
Single CD would be preferable, of course, though a static HTML dump can target mirror sites which don't have that limitation as well.
The new version of the script (not online yet) produces a dump that, according to Nero, can be written on a 650MB cd. The main reasons are a smaller html template and the elimination of redirects (but they are still present for searches).
- Self-extracting JavaScript. :) I'm sure someone, somewhere has done
this; if not it's worth it for the evil factor: rewrite gunzip in JavaScript, and have the content of the HTML files be a <script> tag with a big string and a call to the gunzip() function pulled in from a common .js file. Downsides are likely crappy performance and an inability to function in non-JavaScript browsers.
So I wasn't the only one thinking about this :) Some days ago I did a little Google search, but found nothing. I'm also *sure* that someone has already written this. Again, I'll postpone it to a future version.
I should point out that the main reason for size bloat is the proliferation of small files. Combined with the 2048 bytes cluster size for cd-rom filesystems, it means that each article uses at least 2K, and an average of 1K is wasted on bigger files. Just counting bytes, the html version is around 490MB. So maybe some way to bundle files together (maybe using frames and #anchors, or now-you-see-it-now-you-don't effects in Javascript :) could pay off.
It may be better to go with something similar to MySQL's fulltext search index: break the titles into words, and associate words with lists of pages that contain them rather than full titles with their page names. Instead of regexping a hundred thousand strings, you'd only need to break the query into words, fetch the lists of pages for _just those words_, and intersect or union the results as desired.
Hmm, this seems neat. Now how many different words are there in the average Wikipedia dump? :)) And, a frameset would be necessary as in the next option, barring some black magic communication between Javascript pages.
Also, you could break up the index into several smaller files so not all strings need to be loaded into memory. I don't recall JavaScript having an include() command, but in the worst case you could pull some kind of
<frameset> or <iframe> thing and bring up the necessary sub-scripts in another frame.
This is what I was thinking to do as the next step. The include() can be hacked up, but memory would just add up to the original size. Some neat frameset should do the trick.
Next version online soon :-)
Ciao, Alfio