On Sun, Nov 29, 2015 at 8:18 PM, Yeongjin Jang yeongjinjanggrad@gmail.com wrote:
I recall that I saw financial statement of WMF that states around $2.3M was spent for Internet Hosting. I am not sure whether it includes management cost for computing resources (server clusters such as eqiad) or not.
That's the cost for datacenters, hardware, bandwidth, etc..
Not sure following simple calculation works; 117 TB per day, for 365 days, if $0.05 per GB, then it is around $2.2M. Maybe it would be more accurate if I contact analytics team directly.
That calculation doesn't work because it doesn't take into account peering agreements, or donated (or heavily discounted) transit contracts. Bandwidth is one of the cheaper overall costs.
Something your design doesn't take into account for bandwidth costs is that the world is trending to mobile and mobile bandwidth costs are generally very high. It's likely this p2p approach will be many orders of magnitude more expensive than the current approach.
A decentralized approach doesn't benefit from the economics of scale. Instead of being able to negotiate transit pricing and eliminating cost through peering, you're externalizing the cost at the consumer rate, which is the highest possible rate.
The other scalability concern would be for obscure articles. I havent really looked at your code, so maybe you cover it - but wikipedia has over 5 million articles (and a lot more when you count non-content pages). The group of peers is presumably going to have high churn (since they go away when you browse somewhere else). Id worry the overhead of keeping track of which peer knows what, especially given how fast the peers change to be a lot. I also expect that for lots of articles, only a very small number of peers will know them.
That's true. Dynamically registering / un-registering lookup table gives high overhead on the servers (in both computation & memory usage). Distributed solutions like DHT is there, but we think there could be a trade-off on lookup time for using de-centralized (managing lookup table on the server) versus fully distributed architecture (DHT).
Our prior "naive" implementation costs like if each user has 5K pages cached (with around 50K images), when 10K concurrent user presents it consumes around 35GB of memory, and each registering incurs 500K bytes of network traffic.
We thought it is not that useful, so now we are trying to come up with more lightweight implementation. We hope to have a practically meaningful micro-benchmark result on the new implementation.
Just the metadata for articles, images and revisions is going to be massive. That data itself will need to be distributed too. The network costs associated with just lookups is going to be quite expensive for peers.
It seems your project assumes that bandwidth is unlimited and unrated, which for many parts of the world isn't true.
I don't mean to dissuade you. The idea of a p2p Wikipedia is an interesting project, and at some point in the future if bandwidth is free and unrated everywhere this may be a reasonable way to provide a method of access in case of major disaster of Wikipedia itself. This idea has been brought up numerous times in the past, though, and in general the potential gains are never better than the latency, cost, and complexity associated with it.
- Ryan