Distributed content hosting

List overview All Threads
Download

newer

older

Incremental article IDs

Bulk Page Creation

Andy Spencer

17 Feb 2007 17 Feb '07

4:03 p.m.

Recently I have read several news reports regarding funding for the wikimedia foundation. Regardless of whether the foundation is in need of a little cash I believe that it would be possible to greatly reduce the costs of operation by divvying up some of the server load to volunteers.

This would be done in a way analogous projects such as SETI@home, where anyone with access to a server could install a client and host data. For instance: I, as a student in college sometimes feel bad for not being able to contribute monetarily to the fundraising campaigns. I do however have access to a server that's using roughly 0.5% it's CPU and 1.5% of it's allocated bandwidth and would be more than willing to contribute those resources if it were possible.

I'm raising this topic from the standpoint of the 'idea' and would be most interested in discussion based on based around the assumption that it would be trivial to implement such a system. Whether or not that is the case is something to be explored as well, however I'd rather not get bogged down in implementation before discussing the concept.

If anyone has any suggestions or would be interested in helping please reply. I'm new around here so I'm not exactly sure what the next step should be :)

Show replies by date

Gerard Meijssen

17 Feb 17 Feb

4:07 p.m.

Hoi, The idea is not exactly new. It is also rather complicated. The idea is cool enough that some real brains are working on it, it will however take its time. In the mean time Wikipedia grows exponentially. This is good in one way; it means we are fulfilling our mission. It is problematic in others, the organisation is to scale in the same way. Thanks, GerardM

Andy Spencer schreef:

...

Recently I have read several news reports regarding funding for the wikimedia foundation. Regardless of whether the foundation is in need of a little cash I believe that it would be possible to greatly reduce the costs of operation by divvying up some of the server load to volunteers.

This would be done in a way analogous projects such as SETI@home, where anyone with access to a server could install a client and host data. For instance: I, as a student in college sometimes feel bad for not being able to contribute monetarily to the fundraising campaigns. I do however have access to a server that's using roughly 0.5% it's CPU and 1.5% of it's allocated bandwidth and would be more than willing to contribute those resources if it were possible.

I'm raising this topic from the standpoint of the 'idea' and would be most interested in discussion based on based around the assumption that it would be trivial to implement such a system. Whether or not that is the case is something to be explored as well, however I'd rather not get bogged down in implementation before discussing the concept.

If anyone has any suggestions or would be interested in helping please reply. I'm new around here so I'm not exactly sure what the next step should be :)

Anthony

18 Feb 18 Feb

1:51 a.m.

On 2/17/07, Gerard Meijssen gerard.meijssen@gmail.com wrote:

...

Hoi, The idea is not exactly new. It is also rather complicated. The idea is cool enough that some real brains are working on it, it will however take its time.

Who is working on this? Is any of the process public?

Anthony

Gerard Meijssen

2:13 a.m.

Anthony schreef:

...

On 2/17/07, Gerard Meijssen gerard.meijssen@gmail.com wrote:

...
Hoi, The idea is not exactly new. It is also rather complicated. The idea is cool enough that some real brains are working on it, it will however take its time.

Who is working on this? Is any of the process public?

Anthony

Hoi, It is a research project of the Vrije Universiteit of Amsterdam.. Andrew Tannenbaum is part of that department .. He and the people at the VU do qualify as real brains. The process is as far as I understand not public. Thanks, GerardM

Timwi

1:35 a.m.

Andy Spencer wrote:

...

This would be done in a way analogous projects such as SETI@home, where anyone with access to a server could install a client and host data.

There is very little analogy between your suggestion and SETI@home (or Folding@home or distributed.net or any other distributed computing project). Those distribute only CPU usage (and possibly RAM), but not bandwidth usage.

Your idea necessitates that users (who are trying to read an article) would be redirected to some random volunteer computer that is running an HTTP daemon. But what do you do when it goes down? The central server that does the redirecting would take a while to determine that you are down, and until then, would continue to redirect requests to it. Wikipedia would become very unreliable.

...

I do however have access to a server that's using roughly 0.5% it's CPU and 1.5% of it's allocated bandwidth and would be more than willing to contribute those resources if it were possible.

You may consider donating the CPU to a distributed computing project of your choice. As for the bandwidth, I'm sure there are services on the net that are trying hard to find people to mirror their large files (download sites, for example).

Timwi

Andy Spencer

2:37 p.m.

On 2/17/07, Timwi timwi@gmx.net wrote:

...

There is very little analogy between your suggestion and SETI@home (or Folding@home or distributed.net or any other distributed computing project). Those distribute only CPU usage (and possibly RAM), but not bandwidth usage.

You're right in that they operate differently, but I think a similar interface would be useful (e.g. you download a program that runs in the system tray or as a daemon.

...

Your idea necessitates that users (who are trying to read an article) would be redirected to some random volunteer computer that is running an HTTP daemon. But what do you do when it goes down? The central server that does the redirecting would take a while to determine that you are down, and until then, would continue to redirect requests to it. Wikipedia would become very unreliable.

Even if the central server had to ping the volunteers after every single request to check status, that would still be a fraction of the bandwidth taken up by sending wiki page.

GerardM

4:29 p.m.

Hoi, You assume that a central server will exist.. Maybe, but maybe not. Leave that to the guys that implement it. Thanks, GerardM

On 2/18/07, Andy Spencer andy753421@gmail.com wrote:

...

On 2/17/07, Timwi timwi@gmx.net wrote:

...
There is very little analogy between your suggestion and SETI@home (or Folding@home or distributed.net or any other distributed computing project). Those distribute only CPU usage (and possibly RAM), but not bandwidth usage.

You're right in that they operate differently, but I think a similar interface would be useful (e.g. you download a program that runs in the system tray or as a daemon.

...
Your idea necessitates that users (who are trying to read an article) would be redirected to some random volunteer computer that is running an HTTP daemon. But what do you do when it goes down? The central server that does the redirecting would take a while to determine that you are down, and until then, would continue to redirect requests to it. Wikipedia would become very unreliable.

Even if the central server had to ping the volunteers after every single request to check status, that would still be a fraction of the bandwidth taken up by sending wiki page.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

Anthony

6:18 p.m.

On 2/18/07, GerardM gerard.meijssen@gmail.com wrote:

...

Hoi, You assume that a central server will exist.. Maybe, but maybe not. Leave that to the guys that implement it. Thanks, GerardM

It seems to me that the biggest reason this hasn't yet been done is that people are under the mistaken impression that it has to be decentralized.

A completely decentralized peer-to-peer wiki is theoretically possible, but it's kind of one of the holy grails of computing (a decentralized peer-to-peer RDBMS). A highly centralized peer-to-peer wiki which offloads much of the bandwidth and hosting costs away from the central database, well that's already in place. The cache servers perform that function, they just happen to be run by Wikimedia also.

A decentralized system is great for university research, but a centralized system could be implemented tomorrow. Just stop blocking the live mirrors. Then start working with the mirrors to come up with a way to be more efficient for both parties.

Anthony

10:22 p.m.

On 2/18/07, Anthony wikitech@inbox.org wrote:

...

Just stop blocking the live mirrors. Then start working with the mirrors to come up with a way to be more efficient for both parties.

In an exchange with Gregory Maxwell it came up that I haven't actually asked for permission to do this. So just to be sure I'm posting this request here:

I'd like to run a live mirror of Wikipedia without any advertising. I'd put the live pages into the robots.txt file so search engines would only be allowed to access my local cache, and thus wouldn't cause any extra traffic for Wikimedia. So can I have permission to do this?

Anthony

Platonides

19 Feb 19 Feb

5:05 a.m.

Anthony wrote:

...

I'd like to run a live mirror of Wikipedia without any advertising. I'd put the live pages into the robots.txt file so search engines would only be allowed to access my local cache, and thus wouldn't cause any extra traffic for Wikimedia. So can I have permission to do this?

Anthony

As far as you don't hit the wikipedia servers when crawling to populate your cache. Get the dumps from http://download.wikimedia.org/

Remember to state that the content is under GFDL and to credit the authors.

GerardM

5:11 a.m.

Hoi, A live mirror is not based on using dumps. A life mirror wants an update by either crawling real time or by using the RSS feed to keep the system up to date. Thanks, GerardM

On 2/18/07, Platonides Platonides@gmail.com wrote:

...

Anthony wrote:

...
I'd like to run a live mirror of Wikipedia without any advertising. I'd put the live pages into the robots.txt file so search engines would only be allowed to access my local cache, and thus wouldn't cause any extra traffic for Wikimedia. So can I have permission to do this?

Anthony

As far as you don't hit the wikipedia servers when crawling to populate your cache. Get the dumps from http://download.wikimedia.org/

Remember to state that the content is under GFDL and to credit the authors.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

Platonides

20 Feb 20 Feb

6:04 a.m.

GerardM escribió:

...

Hoi, A live mirror is not based on using dumps. A life mirror wants an update by either crawling real time or by using the RSS feed to keep the system up to date. Thanks, GerardM

I know what is a 'live mirror'. What i don't know is how live does it need to be for Anthony. And not everyone is able to set sensible times. Probably, only the server admins know what is acceptable and not at a moment.

Timwi

18 Feb 18 Feb

11:32 p.m.

Andy Spencer wrote:

...

Even if the central server had to ping the volunteers after every single request to check status, that would still be a fraction of the bandwidth taken up by sending wiki page.

It would use less bandwidth, yes, but it would take much longer to serve a page.

Timwi

Gerard Meijssen

11:49 p.m.

Timwi schreef:

...

Andy Spencer wrote:

...
Even if the central server had to ping the volunteers after every single request to check status, that would still be a fraction of the bandwidth taken up by sending wiki page.

It would use less bandwidth, yes, but it would take much longer to serve a page.

Timwi

Hoi, More relevantly it uses a fraction of the costs. Given that we grow exponentially and given that we do not have a healthy balance sheet, this is extremely relevant. When the nodes that serve user requests are close to the user, it will also mean that the Internet as a whole will have less traffic. This is particularly relevant in countries where access to the International backbone is a limiting factor.

With a decentralised infrastructure it is a sound strategy to have trusted nodes in many countries. This will alleviate the problem even more.

Thanks, GerardM

Tim Starling

11:54 p.m.

Gerard Meijssen wrote:

...

Hoi, More relevantly it uses a fraction of the costs. Given that we grow exponentially and given that we do not have a healthy balance sheet, this is extremely relevant. When the nodes that serve user requests are close to the user, it will also mean that the Internet as a whole will have less traffic. This is particularly relevant in countries where access to the International backbone is a limiting factor.

With a decentralised infrastructure it is a sound strategy to have trusted nodes in many countries. This will alleviate the problem even more.

http://meta.wikimedia.org/wiki/Reducing_transit_requirements

-- Tim Starling

Timwi

19 Feb 19 Feb

12:03 a.m.

Gerard Meijssen wrote:

...

Timwi schreef:

...
It would use less bandwidth, yes, but it would take much longer to serve a page.

Hoi, More relevantly it uses a fraction of the costs.

Which not something an end-user will see. If end-users see it become noticeably slower, they'll complain, full-stop. "It's cheaper for us!" is not a good excuse in the minds of most consumers.

...

When the nodes that serve user requests are close to the user,

What good is a node that is close to the user if it's dead?

Timwi

Gerard Meijssen

12:24 a.m.

Hoi, Have you been paying attention lately; our costs and our traffic are growing exponentially. We do not have a rosy balance sheet as it is. Users will be extremely unhappy when we are not able to continue to provide our service. They are used to our service not being the fastest around.

Why would a node that is close by be dead ??

NB when our customers find the argument "it's cheaper for us" not a good one, they have to realise that they are not paying customers. TANSTAAFL

Thanks, GerardM

Timwi schreef:

...

Gerard Meijssen wrote:

...
Timwi schreef:

...
It would use less bandwidth, yes, but it would take much longer to serve a page.

Hoi, More relevantly it uses a fraction of the costs.

Which not something an end-user will see. If end-users see it become noticeably slower, they'll complain, full-stop. "It's cheaper for us!" is not a good excuse in the minds of most consumers.

...
When the nodes that serve user requests are close to the user,

What good is a node that is close to the user if it's dead?

Timwi

Simetrical

12:39 a.m.

On 2/18/07, Gerard Meijssen gerard.meijssen@gmail.com wrote:

...

Hoi, Have you been paying attention lately; our costs and our traffic are growing exponentially. We do not have a rosy balance sheet as it is. Users will be extremely unhappy when we are not able to continue to provide our service.

Costs are growing exponentially, but so is income. Wikipedia has enough money to continue operations. It's not going to disappear due to hardware costs, as a board member has recently stated: http://lists.wikimedia.org/pipermail/foundation-l/2007-February/027762.html. Notice the emphasis on needing to get more money, not needing to use cheaper hardware.

If you think that a SETI@home-style computing model for a high-load webpage-serving application is practicable, could you point to a single example of anyone pulling this off successfully? Specifically, concerns over trustworthiness (how do we stop agents from putting ads or other content on their copies?) and latency (needs to be routed through extra servers) appear insuperable.

Gerard Meijssen

1:04 a.m.

Hoi, A research project is a research project because it helps to learn something new. I support the VU project because it may bring us an alternate way of providing information in a reliable way. Your argument is that there is no example yet. My argument is that this research may prove how to do exactly what has never been done before.

When you read the VU paper, you will find that security is addressed. You will also find how the distribution of content is modelled. I know that the VU will use a GRID to do the simulation of traffic. These guys have the tools to do a decent job !!

When you state that we do not need to use cheaper hardware, it does not mean at all say that we have a healthy balance sheet. Our auditors indicate that we should have a reserve of a specific size; we do not have it. We expect that our growth will continue unabated; our efforts to get more money will have to be in line with these expectations. No, your assessment that our income is satisfactory is wrong. There are other costs other than hardware. I think we are asking too much from people like Anthere and the other board members; remember they are volunteers and I would not be surprised if it is like a full time job for them.

Thanks, GerardM

Simetrical schreef:

...

On 2/18/07, Gerard Meijssen gerard.meijssen@gmail.com wrote:

...
Hoi, Have you been paying attention lately; our costs and our traffic are growing exponentially. We do not have a rosy balance sheet as it is. Users will be extremely unhappy when we are not able to continue to provide our service.

Costs are growing exponentially, but so is income. Wikipedia has enough money to continue operations. It's not going to disappear due to hardware costs, as a board member has recently stated: http://lists.wikimedia.org/pipermail/foundation-l/2007-February/027762.html. Notice the emphasis on needing to get more money, not needing to use cheaper hardware.

If you think that a SETI@home-style computing model for a high-load webpage-serving application is practicable, could you point to a single example of anyone pulling this off successfully? Specifically, concerns over trustworthiness (how do we stop agents from putting ads or other content on their copies?) and latency (needs to be routed through extra servers) appear insuperable.

Domas Mituzas

1:44 a.m.

Hi!

...

Have you been paying attention lately; our costs and our traffic are growing exponentially.

Not forever.

...

Users will be extremely unhappy when we are not able to continue to provide our service.

We can continue the service at our current service levels.

...

They are used to our service not being the fastest around.

Of course, and Pompeii is still visited by tourists too.

...

Why would a node that is close by be dead ??

It is HTTP. It is designed to throw error messages ASAP.

...

NB when our customers find the argument "it's cheaper for us" not a good one, they have to realise that they are not paying customers. TANSTAAFL

They pay attention.

-- Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]

Nick Jenkins

8:16 a.m.

...

...
Have you been paying attention lately; our costs and our traffic are growing exponentially.

Not forever.

I'm kind of assuming that at some point there will be YouTube-style videos embedded in many pages - e.g. for [[Hummingbird]], there might be a video which when you click play, will show you a hummingbird flying, and then again slowed down to show the mechanics of its hovering flight; And for [[Windsor Castle]] you might get a virtual tour of the highlights of the castle; for [[Vladimir Putin]] you might get a subtitled excerpt of a speech he's made; and so forth.

That'd probably impact on someone's bandwidth bill in a big way, although it's in keeping with the "media" in "MediaWiki", and certainly in keeping with an encyclopedia for the 21st century.

All the best, Nick.

Rob Church

8:20 a.m.

On 19/02/07, Nick Jenkins nickpj@gmail.com wrote:

...

I'm kind of assuming that at some point there will be YouTube-style videos embedded in many pages - e.g. for [[Hummingbird]], there might be a video which when you click play, will show you a hummingbird flying, and then again slowed down to show the mechanics of its hovering flight; And for [[Windsor Castle]] you might get a virtual tour of the highlights of the castle; for [[Vladimir Putin]] you might get a subtitled excerpt of a speech he's made; and so forth.

Shame the work on the plugins that would have made this a reality sooner has been temporarily suspended.

Rob Church

Anthony

10 p.m.

On 2/18/07, Domas Mituzas midom.lists@gmail.com wrote:

...

Hi!

...
Have you been paying attention lately; our costs and our traffic are growing exponentially.

Not forever.

Of course not. But will they stop growing due to lack of funding, or due to the fact that the goals have been met and everyone in the world has access to the sum of all knowledge?

What percentage of the world is served by Wikipedia today? What percentage of all knowledge is in the encyclopedia? Multiply by the reciprocals, and how much would the yearly costs be?

I see from Alexa the reach is 5% of Internet users. 16.6% of the world is on the Internet. I'm going to guess Wikipedia covers 1% of what it should. That's a major lowball estimate, though.

So 1/.05/.166/.01=12,048. Will Wikimedia ever be able to raise $12 billion a year? If not, then cutting costs is mandatory in order to reach the goals. My guess is no. $12 billion a year is way too much money to be passing through a non-governmental organization like Wikimedia.

And from your own comments we're talking about a problem that's going to be very difficult to solve. I'm not saying that everyone who is doing anything related to Wikimedia needs to drop everything else and work on this. But some people need to be considering it. In other words, I think we do "benefit from the flamefests on this list every few months", though obviously not literally from those parts of the discussion which are simply flames.

Anthony

Domas Mituzas

10:49 p.m.

Hi!

...

...
Not forever.

Of course not. But will they stop growing due to lack of funding, or due to the fact that the goals have been met and everyone in the world has access to the sum of all knowledge?

I wasn't telling "wikipedia will stop growing". It was more about the word "exponentially" that everybody is happy to attach to the debate. There're limiting factors, such as "everyone in the world", "all knowledge", "enough funding", etc, which are quite philosophical and outside the scope of this list.

Anyway, as for distributed content hosting ideas, we haven't been ignoring them all that time. Though, we had to test and reject quite a few of them.

Very interesting example is Joost (http://www.wired.com/wired/archive/ 15.02/trouble.html) - the article explains about distributed media storage in popular terms, but in summary, they still have to handle all the long tail content on their storage environment, and p2p wins are mostly for the very fresh and very popular (and huge) content. Their system is being built on most mature p2p platform out there, and still does not seem to solve everything. And yes, they need a special client.

As for image (or any other) hosting - WMF would have completely no power over privacy policy then - unless whole world agreed to turn off Referer [sic] request headers. Of course, that aside, there're quite some other issues with efficiency, and it is mostly about 'reducing costs' rather than 'improving user experience'.

It would be much easier just to have someone donate few gigabits of IP transit ;-)

-- Domas Mituzas -- http://dammit.lt/ -- [[user:midom]] P.S. Kudos to Tim for HTTP/1.0 keepalive issue tackling :)

Anthony

20 Feb 20 Feb

1:29 a.m.

On 2/19/07, Domas Mituzas midom.lists@gmail.com wrote:

...

As for image (or any other) hosting - WMF would have completely no power over privacy policy then - unless whole world agreed to turn off Referer [sic] request headers.

Actually, there are plenty of ways to refer people to images without leaking the article title. I'm not sure what the point would be, though. Which pages you're viewing is pretty obvious based on what images you're loading.

For my part, I don't think the WMF *should* have any power over privacy policy. The end users should have that power. If they want to browse anonymously, there are plenty of tools out there to do that. If they just want to stop giving referer information, there are plenty of tools for that too.

...

Of course, that aside, there're quite some other issues with efficiency, and it is mostly about 'reducing costs' rather than 'improving user experience'.

Reducing costs and improving user experience are fairly synonymous. If you can cut costs, then you can either spend the extra money improving user experience or you can have fewer or less obnoxious fundraising drives.

...

It would be much easier just to have someone donate few gigabits of IP transit ;-)

Yup.

Anthony

Domas Mituzas

2:07 a.m.

Hi!

...

For my part, I don't think the WMF *should* have any power over privacy policy. The end users should have that power. If they want to browse anonymously, there are plenty of tools out there to do that. If they just want to stop giving referer information, there are plenty of tools for that too.

Yay, let's just give away all logs to the public, everyone will be happy, and the ones concerned about privacy will be able to use Tor. Or some anonymous proxy . Why should WMF have power over that? Because the potential target for the privacy violation attacks is the one who doesn't know about the possibility.

Anyway, this is wrong place to discuss privacy policy. I just mention that there're technical issues where we'd fail to comply with it.

...

Reducing costs and improving user experience are fairly synonymous. If you can cut costs, then you can either spend the extra money improving user experience or you can have fewer or less obnoxious fundraising drives.

Of course, instead of buying a luxury car, you can buy two cheaper ones and drive both at the same time ;-) Now there're bits of experience, which are not completely synonymous with reduced costs. That means being more up than down, getting higher quality images, faster response times, etc. Every decision like that has a cost. I guess we should hire some consultants to do cost/benefit analysis for us, then we could give that to board to decide on. ;-)

-- Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]

Anthony

2:15 a.m.

On 2/19/07, Domas Mituzas midom.lists@gmail.com wrote:

...

Hi!

...
For my part, I don't think the WMF *should* have any power over privacy policy. The end users should have that power. If they want to browse anonymously, there are plenty of tools out there to do that. If they just want to stop giving referer information, there are plenty of tools for that too.

Yay, let's just give away all logs to the public, everyone will be happy, and the ones concerned about privacy will be able to use Tor. Or some anonymous proxy . Why should WMF have power over that? Because the potential target for the privacy violation attacks is the one who doesn't know about the possibility.

Anyway, this is wrong place to discuss privacy policy. I just mention that there're technical issues where we'd fail to comply with it.

Well, I think you're completely oversimplifying things and either missing or completely ignoring solutions to the problems you bring up.

Your response is a strawman. But since you don't want to discuss it, I'll leave it at that.

Anthony

Gregory Maxwell

3:06 a.m.

On 2/19/07, Anthony wikitech@inbox.org wrote:

...

On 2/19/07, Domas Mituzas midom.lists@gmail.com wrote:

...
As for image (or any other) hosting - WMF would have completely no power over privacy policy then - unless whole world agreed to turn off Referer [sic] request headers.

Actually, there are plenty of ways to refer people to images without leaking the article title. I'm not sure what the point would be, though. Which pages you're viewing is pretty obvious based on what images you're loading.

And how do you propose we hide what pages an image is used on? Switch them up randomly?

Rob Church

3:14 a.m.

On 19/02/07, Gregory Maxwell gmaxwell@gmail.com wrote:

...

And how do you propose we hide what pages an image is used on? Switch them up randomly?

Kid of five views [[Teletubbies]].

Tinky winky, Dipsy, La-la, OH MY GOD, IT'S GOATSE!

Rob Church

Anthony

3:48 a.m.

On 2/19/07, Gregory Maxwell gmaxwell@gmail.com wrote:

...

On 2/19/07, Anthony wikitech@inbox.org wrote:

...
On 2/19/07, Domas Mituzas midom.lists@gmail.com wrote:

...
As for image (or any other) hosting - WMF would have completely no power over privacy policy then - unless whole world agreed to turn off Referer [sic] request headers.

Actually, there are plenty of ways to refer people to images without leaking the article title. I'm not sure what the point would be, though. Which pages you're viewing is pretty obvious based on what images you're loading.

And how do you propose we hide what pages an image is used on? Switch them up randomly?

No, what I said was precisely that you *can't* hide what pages an image is used on. You could hide the referrer using various tricks (frames, redirects, javascript, etc), but doing so wouldn't accomplish anything.

Anthony

Boris Eetgerink

3:51 a.m.

Anthony wrote:

...

On 2/19/07, Gregory Maxwell gmaxwell@gmail.com wrote:

...
On 2/19/07, Anthony wikitech@inbox.org wrote:

...
On 2/19/07, Domas Mituzas midom.lists@gmail.com wrote:

...
As for image (or any other) hosting - WMF would have completely no power over privacy policy then - unless whole world agreed to turn off Referer [sic] request headers.

Actually, there are plenty of ways to refer people to images without leaking the article title. I'm not sure what the point would be, though. Which pages you're viewing is pretty obvious based on what images you're loading.

And how do you propose we hide what pages an image is used on? Switch them up randomly?

No, what I said was precisely that you *can't* hide what pages an image is used on. You could hide the referrer using various tricks (frames, redirects, javascript, etc), but doing so wouldn't accomplish anything.

Or if you're desperate you can just download the archive/dump and search it for that image. ;)

Boris

Platonides

6 a.m.

Domas Mituzas wrote:

...

As for image (or any other) hosting - WMF would have completely no power over privacy policy then - unless whole world agreed to turn off Referer [sic] request headers. Of course, that aside, there're quite some other issues with efficiency, and it is mostly about 'reducing costs' rather than 'improving user experience'.

You could get them by FTP. Still, there's the leak of IP address and which image you're seeing (most have an associated area). And some clients may reveal the email. And there's still the problem of faking the content: 'i asked for POTD, and get the "goatse" image'. Let's use ed2k protocol for images :D

...

It would be much easier just to have someone donate few gigabits of IP transit ;-)

Not everybody can give a few gigas of IP transit ;)

Timwi

19 Feb 19 Feb

3:22 a.m.

Gerard Meijssen wrote:

...

Users will be extremely unhappy when we are not able to continue to provide our service.

Users will be somewhat unhappy if all of Wikipedia goes down and never comes back. Users will be *much more* unhappy if Wikipedia continues to operate but is extremely slow, unreliable, and often down.

...

Why would a node that is close by be dead ??

Because people tend to turn off their own computers whenever they want to. And because computers sometimes crash, or Internet connections go down.

...

NB when our customers find the argument "it's cheaper for us" not a good one, they have to realise that they are not paying customers.

(Note that I didn't use the word "customer".)

Timwi

Jim Wilson

3:24 a.m.

...

Users will be somewhat unhappy if all of Wikipedia goes down and never comes back. Users will be *much more* unhappy if Wikipedia continues to operate but is extremely slow, unreliable, and often down.

"Somewhat unhappy"? Come on dude. I mean, seriously.

On 2/18/07, Timwi timwi@gmx.net wrote:

...

Gerard Meijssen wrote:

...
Users will be extremely unhappy when we are not able to continue to provide our service.

Users will be somewhat unhappy if all of Wikipedia goes down and never comes back. Users will be *much more* unhappy if Wikipedia continues to operate but is extremely slow, unreliable, and often down.

...
Why would a node that is close by be dead ??

Because people tend to turn off their own computers whenever they want to. And because computers sometimes crash, or Internet connections go down.

...
NB when our customers find the argument "it's cheaper for us" not a good one, they have to realise that they are not paying customers.

(Note that I didn't use the word "customer".)

Timwi

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

Andre Engels

8:12 a.m.

2007/2/18, Jim Wilson wilson.jim.r@gmail.com:

...

...
Users will be somewhat unhappy if all of Wikipedia goes down and never comes back. Users will be *much more* unhappy if Wikipedia continues to operate but is extremely slow, unreliable, and often down.

"Somewhat unhappy"? Come on dude. I mean, seriously.

Yeah, just somewhat unhappy. It's a problem, but also a chance. The great thing about having this project copyleft as it is, is that if Wikipedia would go down, anyone can take the content and start it again. And I'm quite convinced that if it were now announced that Wikimedia might have to close down in three months, before those three months were over there would be quite a number of people and groups doing exactly that. And where we now have one big Wikipedia, we would then have a multitude of small ones. Which has both advantages and disadvantages.

-- Andre Engels, andreengels@gmail.com ICQ: 6260644 -- Skype: a_engels

Platonides

5:19 a.m.

Timwi wrote:

...

...
Why would a node that is close by be dead ??

Because people tend to turn off their own computers whenever they want to. And because computers sometimes crash, or Internet connections go down.

The redirection could be done within a frame. If mirror answers the page in less than x seconds, the page will load and a frame-breaker will destroy it. If not, the second frame can switch to the next server. Worst case for the server: it is busy answering petitions which, by the time it has the answer, the client is asking to another server. :/

Domas Mituzas

7:17 a.m.

...

The redirection could be done within a frame. If mirror answers the page in less than x seconds, the page will load and a frame-breaker will destroy it. If not, the second frame can switch to the next server.

HTTP?

-- Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]

Domas Mituzas

1:27 a.m.

Hi!

...

I'm new around here so I'm not exactly sure what the next step should be :)

You are forgiven. :-) Anyway, I'll repeat excerpt from my old email message:

Serving a wiki isn't hosting an .iso file, where of course, bandwidth is main cost, and it is easy to offload. ISO files don't change, people don't care about how fast they start getting ISO file, because the transfer is long enough to forget all startup costs. Serving a wiki isn't looking for aliens. If someone turns off the computer, or DSL will go down, aliens won't disappear, now the request will. Nobody really cares about individual packet containing alien information, because it is sent to multiple nodes. Some will reply, some won't. Serving a wiki isn't serving a personal website. It is not single person editing, there's great deal of conflict resolution, possible race conditions, versioning and metadata information. Serving a wiki isn't serving a conventional media website, because it is far more organic in terms of load pattern evolution, or accidental surges. Content formats also come bottom->up, requiring agile development of systems. Serving a wiki means delivering user contributed content thousands collaborated on in few tens of milliseconds. We do succeed this mission and every time we increased responsiveness of the site, we had more users coming.

-- Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]

Rob Church

2:22 a.m.

On 18/02/07, Domas Mituzas midom.lists@gmail.com wrote:

...

Serving a wiki isn't hosting an .iso file, where of course, bandwidth is main cost, and it is easy to offload. ISO files don't change, people don't care about how fast they start getting ISO file, because the transfer is long enough to forget all startup costs. Serving a wiki isn't looking for aliens. If someone turns off the computer, or DSL will go down, aliens won't disappear, now the request will. Nobody really cares about individual packet containing alien information, because it is sent to multiple nodes. Some will reply, some won't. Serving a wiki isn't serving a personal website. It is not single person editing, there's great deal of conflict resolution, possible race conditions, versioning and metadata information. Serving a wiki isn't serving a conventional media website, because it is far more organic in terms of load pattern evolution, or accidental surges. Content formats also come bottom->up, requiring agile development of systems. Serving a wiki means delivering user contributed content thousands collaborated on in few tens of milliseconds. We do succeed this mission and every time we increased responsiveness of the site, we had more users coming.

Fucking *excellent* post, Domas.

Rob Church

Gerard Meijssen

2:51 a.m.

Hoi, Domas describes the status quo. He does describe it well. It does however not detract one iota from the usefulness of doing this research. Mechanisms are developed that may work at a fraction of our current (ie WMF) cost, for the WMF it is irresponsible to be against such a research project. It does not matter if you think the VU will succeed or not, what matters is that serious effort is put into this endeavour. Just wait and watch what will transpire when it does.

It is not as if there is no need to maintain and improve our current code. It is not as if this project will be finished at the end of the year. Domas is right in that nothing changes for now.

It is however a really relevant project and I believe we should cheer them on for trying this in the first place. Thanks, GerardM

Rob Church schreef:

...

On 18/02/07, Domas Mituzas midom.lists@gmail.com wrote:

...
Serving a wiki isn't hosting an .iso file, where of course, bandwidth is main cost, and it is easy to offload. ISO files don't change, people don't care about how fast they start getting ISO file, because the transfer is long enough to forget all startup costs. Serving a wiki isn't looking for aliens. If someone turns off the computer, or DSL will go down, aliens won't disappear, now the request will. Nobody really cares about individual packet containing alien information, because it is sent to multiple nodes. Some will reply, some won't. Serving a wiki isn't serving a personal website. It is not single person editing, there's great deal of conflict resolution, possible race conditions, versioning and metadata information. Serving a wiki isn't serving a conventional media website, because it is far more organic in terms of load pattern evolution, or accidental surges. Content formats also come bottom->up, requiring agile development of systems. Serving a wiki means delivering user contributed content thousands collaborated on in few tens of milliseconds. We do succeed this mission and every time we increased responsiveness of the site, we had more users coming.

Fucking *excellent* post, Domas.

Rob Church

Brion Vibber

3:17 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Gerard Meijssen wrote:

...

Domas describes the status quo. He does describe it well. It does however not detract one iota from the usefulness of doing this research. Mechanisms are developed that may work at a fraction of our current (ie WMF) cost, for the WMF it is irresponsible to be against such a research project. It does not matter if you think the VU will succeed or not, what matters is that serious effort is put into this endeavour. Just wait and watch what will transpire when it does.

Research projects are great; if it goes somewhere eventually that's super, and if it doesn't that's fine too. :)

Our own resources have to be invested in managing what we know works; we don't benefit from the flamefests on this list every few months when someone hears about SETI@home or BitTorrent and thinks it'd be easy to apply the principle to a wiki so why aren't we doing it we must be incompetent or wasting money OMG! ;)

Good distributed hosting for large numbers of small, fast-changing objects like a wiki is not what we might call a "solved problem". If it's feasible at all, that's something we should leave to researchers better versed in the field for now.

In the *forseeable* future, we expect the primary web site to continue to work much as it does now, with central servers and some limited distribution through centrally-administered proxy caching systems.

We can be more aggressive on other parts of the system, though.

Bulk downloads like data dumps could be done over BitTorrent, but they're not really a significant overall resource drain.

Media files are half our bandwidth, so that is an area which we can see gains from.

Once media storage has been rearranged for better versioning stability (as already specced out) it can be much more aggressively cached, perhaps through content distribution networks such as Coral in addition to more traditional centrally-administered proxy caches.

- -- brion vibber (brion @ pobox.com / brion @ wikimedia.org)

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFF2KYxwRnhpk1wk44RArJaAJ9a4wWyhH6RV9FTp0sis5ejFe9YggCePn6X W9H3zYRZECqquPQ3xby6imY= =j4gb -----END PGP SIGNATURE-----

Lars Aronsson

7:45 p.m.

Gerard Meijssen wrote:

...

Domas describes the status quo. He does describe it well. It does however not detract one iota from the usefulness of doing this research.

It is a great piece of research. However, is it Wikipedia's or WMF's thing to do this? It seems like a generic web component, almost like Apache, the PHP programming language or the Squid proxy server. If the fully distributed web server architecture was a really good idea, many kinds of websites could find use for it and someone else might already have implemented it. Even if this technology existed, it isn't clear that it should be deployed at WMF in Florida, but perhaps instead at Kennisnet in Amsterdam, as a way to offload the Squid servers.

Rather than discussing this on wikitech-l, perhaps you should take the idea to people who develop Apache and Squid? Then when a working prototype exists, perhaps it can be tried and evaluated for some part of Wikipedia. Since I don't know of any existing technology today, it seems to be at least several years into the future before it can help to reduce WMF's bandwidth costs.

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

Steve Bennett

7:49 p.m.

On 2/19/07, Lars Aronsson lars@aronsson.se wrote:

...

It is a great piece of research. However, is it Wikipedia's or WMF's thing to do this? It seems like a generic web component, almost like Apache, the PHP programming language or the Squid proxy server. If the fully distributed web server architecture was a really good idea, many kinds of websites could find use for it and someone else might already have implemented it. Even if

WMF is unique in that its bandwidth requirements are astronomical compared to its income. There are plenty of high-bandwidth sites. And plenty of low-budget sites. But there's nothing that comes close to Wikipedia in the proportion of the two. So if the (probably not that critical) problem is "how to host a huge amount of highly-requested content on a shoestring budget", it's not surprising that no one has attempted to solve it before.

Steve

David Gerard

8:39 p.m.

On 19/02/07, Steve Bennett stevagewp@gmail.com wrote:

...

On 2/19/07, Lars Aronsson lars@aronsson.se wrote:

...

...
It is a great piece of research. However, is it Wikipedia's or WMF's thing to do this? It seems like a generic web component, almost like Apache, the PHP programming language or the Squid proxy server. If the fully distributed web server architecture was a really good idea, many kinds of websites could find use for it and someone else might already have implemented it. Even if

...

WMF is unique in that its bandwidth requirements are astronomical compared to its income. There are plenty of high-bandwidth sites. And plenty of low-budget sites. But there's nothing that comes close to Wikipedia in the proportion of the two. So if the (probably not that critical) problem is "how to host a huge amount of highly-requested content on a shoestring budget", it's not surprising that no one has attempted to solve it before.

LiveJournal is a slightly comparable site - a commercial site, but not terribly rich before the buyout. They developed useful toys like memcached because of their unique circumstances.

If it's going to go into Apache or whatever, it'll probably have to be us or people that love us that do it. Scratching that itch.

- d.

6489

Age (days ago)

6491

Last active (days ago)

wikitech-l@lists.wikimedia.org

43 comments

19 participants

tags (0)

participants (19)

Andre Engels
Andy Spencer
Anthony
Boris Eetgerink
Brion Vibber
David Gerard
Domas Mituzas
Gerard Meijssen
GerardM
Gregory Maxwell
Jim Wilson
Lars Aronsson
Nick Jenkins
Platonides
Rob Church
Simetrical
Steve Bennett
Tim Starling
Timwi