Brion raised the question the other day about the possibility of storing the images directly in the database. Part or all of the motivation was to make it easy for multiple webservers to serve consistent images on the site in our new configuration.
I've been thinking of other ways to do this.
1. NFS -- all the webservers could have read/write access to an NFS partition, probably on the new machine. This is easy to setup, but there are questions about the security and stability of NFS, and at least a few years ago, it was considered by some to be bad mojo to try to serve web content off of NFS-mounted partitions -- performance is bad.
2. AFS - Andrew File System, or DRBD - Distributed Replicated Block Device --- These sound to me like things that hold forth great promise, but I also regard them as fairly esoteric technologies. We're probably better off doing something more boring.
Am I wrong? Too conservative? Not up to date?
3. Apache reverse proxying -- this is a boring and good solution that I'm confident would work well, but it does have some drawbacks. Essentially, the way it works is this: for image uploads and downloads, we mod_rewrite to transparently reverse proxy the requests to apache running on the backend (database) machine.
One possible drawback is the overhead of apache running on the backend, and the fact that it might become a bottleneck. However, since almost all the backend machine would be doing is static requests, and since those could be shunted to something faster than apache, it really wouldn't be all that hard.
--------
Hybrid approaches are possible -- the webservers could NFS mount the /images/ directory from the db machine but only use the NFS mountpoints for writing -- for reads, we'd go through the reverse proxying mechanism.
--Jimbo
Why not just serve the images from images.wikipedia.org (which I imagine would resolve to the database machine for the moment). Am I missing some key point that makes something fancier necessary?
Jason
Jimmy Wales wrote:
Brion raised the question the other day about the possibility of storing the images directly in the database. Part or all of the motivation was to make it easy for multiple webservers to serve consistent images on the site in our new configuration.
I've been thinking of other ways to do this.
- NFS -- all the webservers could have read/write access to an NFS
partition, probably on the new machine. This is easy to setup, but there are questions about the security and stability of NFS, and at least a few years ago, it was considered by some to be bad mojo to try to serve web content off of NFS-mounted partitions -- performance is bad.
- AFS - Andrew File System, or DRBD - Distributed Replicated Block
Device --- These sound to me like things that hold forth great promise, but I also regard them as fairly esoteric technologies. We're probably better off doing something more boring.
Am I wrong? Too conservative? Not up to date?
- Apache reverse proxying -- this is a boring and good solution that
I'm confident would work well, but it does have some drawbacks. Essentially, the way it works is this: for image uploads and downloads, we mod_rewrite to transparently reverse proxy the requests to apache running on the backend (database) machine.
One possible drawback is the overhead of apache running on the backend, and the fact that it might become a bottleneck. However, since almost all the backend machine would be doing is static requests, and since those could be shunted to something faster than apache, it really wouldn't be all that hard.
Hybrid approaches are possible -- the webservers could NFS mount the /images/ directory from the db machine but only use the NFS mountpoints for writing -- for reads, we'd go through the reverse proxying mechanism.
--Jimbo _______________________________________________ Wikitech-l mailing list Wikitech-l@Wikipedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Jason Richey wrote:
Why not just serve the images from images.wikipedia.org (which I imagine would resolve to the database machine for the moment). Am I missing some key point that makes something fancier necessary?
My inherent love for complexity? ;-)
Hmmm, well, let me think. A fancy-enough reverse proxying system could cache images on the frontend machine so that we only have to bother the database machine for write and occassional reads, whereas serving all the images directly from the db machine is going to be a fair amount of traffic.
On the other hand, since images are always going to be different from article data in some important ways, then *whatever* we do, a switch to images.wikipedia.org is probably a very good idea.
--Jimbo
On Mon, Nov 03, 2003 at 09:51:07AM -0800, Jimmy Wales wrote:
Jason Richey wrote:
Why not just serve the images from images.wikipedia.org (which I imagine would resolve to the database machine for the moment). Am I missing some key point that makes something fancier necessary?
My inherent love for complexity? ;-)
Hmmm, well, let me think. A fancy-enough reverse proxying system could cache images on the frontend machine so that we only have to bother the database machine for write and occassional reads, whereas serving all the images directly from the db machine is going to be a fair amount of traffic.
On the other hand, since images are always going to be different from article data in some important ways, then *whatever* we do, a switch to images.wikipedia.org is probably a very good idea.
Using images.wikipedia.org I see one open issue: Login cookies are currently set to the full qualified hostname, e.g. en.wikipedia.org. The browser would not send the cookie to images.wiki[pm]edia.org.
Should we create a unified user database for all wikimedia projects and set the cookie to *.wikipedia.org?
Regards,
JeLuF
Using images.wikipedia.org I see one open issue: Login cookies are currently set to the full qualified hostname, e.g. en.wikipedia.org. The browser would not send the cookie to images.wiki[pm]edia.org.
Should we create a unified user database for all wikimedia projects and set the cookie to *.wikipedia.org?
Why does this matter, though? We don't do anything with images based on somebody's user info, do we?
On Mon, Nov 03, 2003 at 04:48:20PM -0600, Nick Reinking wrote:
Using images.wikipedia.org I see one open issue: Login cookies are currently set to the full qualified hostname, e.g. en.wikipedia.org. The browser would not send the cookie to images.wiki[pm]edia.org.
Should we create a unified user database for all wikimedia projects and set the cookie to *.wikipedia.org?
Why does this matter, though? We don't do anything with images based on somebody's user info, do we?
We don't allow users to upload an image unless they are logged in. And it's nice to know whom to ask regarding the copyright of an image.
Regards,
JeLuF
Jimmy Wales wrote:
Hybrid approaches are possible -- the webservers could NFS mount the /images/ directory from the db machine but only use the NFS mountpoints for writing -- for reads, we'd go through the reverse proxying mechanism.
--Jimbo
Why not keep the master copies NFS mounted on the database machines, _and_ local copies cached on the web server filesystems?
The same caching algorithm can be used for image files and rendered pages, except that the "rendering" process for image files is a simple copy from the master NFS server if the DB timestamp is more recent than the local file timestamp. In this way, no web serving would need to occur directly from the master.
In addition, running rsync periodically on the slaves to sync their image directories with those of the master would have the effect of keeping local copies up-to-date with very low overhead, and without disturbing the caching algorithm described above.
This would also make switching to "disconnected mode" easy, if the central DB / NFS server goes down.
-- Neil
wikitech-l@lists.wikimedia.org