On Sun, Nov 29, 2015 at 8:18 PM, Yeongjin Jang <yeongjinjanggrad(a)gmail.com>
wrote:
I recall that I saw financial statement of WMF that states around $2.3M
was spent for Internet Hosting. I am not sure whether it includes
management cost for computing resources
(server clusters such as eqiad) or not.
That's the cost for datacenters, hardware, bandwidth, etc..
Not sure following simple calculation works;
117 TB per day, for 365 days, if $0.05 per GB, then it is around $2.2M.
Maybe it would be more accurate if I contact analytics team directly.
That calculation doesn't work because it doesn't take into account peering
agreements, or donated (or heavily discounted) transit contracts. Bandwidth
is one of the cheaper overall costs.
Something your design doesn't take into account for bandwidth costs is that
the world is trending to mobile and mobile bandwidth costs are generally
very high. It's likely this p2p approach will be many orders of magnitude
more expensive than the current approach.
A decentralized approach doesn't benefit from the economics of scale.
Instead of being able to negotiate transit pricing and eliminating cost
through peering, you're externalizing the cost at the consumer rate, which
is the highest possible rate.
The other scalability concern would be for obscure
articles. I havent
really looked at your code, so maybe you cover it - but wikipedia has over
5 million articles (and a lot more when you count non-content pages). The
group of peers is presumably going to have high churn (since they go away
when you browse somewhere else). Id worry the overhead of keeping track of
which peer knows what, especially given how fast the peers change to be a
lot. I also expect that for lots of articles, only a very small number of
peers will know them.
That's true. Dynamically registering /
un-registering lookup table gives
high
overhead on the servers (in both computation & memory usage).
Distributed solutions like DHT is there, but we think there could be a
trade-off
on lookup time for using de-centralized (managing lookup table on the
server)
versus fully distributed architecture (DHT).
Our prior "naive" implementation costs like if each user has
5K pages cached (with around 50K images),
when 10K concurrent user presents it consumes around 35GB
of memory, and each registering incurs 500K bytes of network traffic.
We thought it is not that useful, so now we are trying to come up with
more lightweight implementation. We hope to have a practically meaningful
micro-benchmark result on the new implementation.
Just the metadata for articles, images and revisions is going to be
massive. That data itself will need to be distributed too. The network
costs associated with just lookups is going to be quite expensive for peers.
It seems your project assumes that bandwidth is unlimited and unrated,
which for many parts of the world isn't true.
I don't mean to dissuade you. The idea of a p2p Wikipedia is an interesting
project, and at some point in the future if bandwidth is free and unrated
everywhere this may be a reasonable way to provide a method of access in
case of major disaster of Wikipedia itself. This idea has been brought up
numerous times in the past, though, and in general the potential gains are
never better than the latency, cost, and complexity associated with it.
- Ryan