Re: [Wikitech-l] Peer-to-peer sharing of the content of Wikipedia through WebRTC

28 Nov 2015

      Thank you for your comments!
On Sat, Nov 28, 2015 at 2:33 PM, Brian Wolff bawolff@gmail.com wrote:
...
On 11/28/15, Yeongjin Jang yeongjinjanggrad@gmail.com wrote:
...
Hi,
I am Yeongjin Jang, a Ph.D. Student at Georgia Tech.
In our lab (SSLab, https://sslab.gtisc.gatech.edu/),
we are working on a project called B2BWiki,
which enables users to share the contents of Wikipedia through WebRTC
(peer-to-peer sharing).
Website is at here: http://b2bwiki.cc.gatech.edu/
The project aims to help Wikipedia by donating computing resources
from the community; users can donate their traffic (by P2P communication)
and storage (indexedDB) to reduce the load of Wikipedia servers.
For larger organizations, e.g. schools or companies that
have many local users, they can donate a mirror server
similar to GNU FTP servers, which can bootstrap peer sharing.
Potential benefits that we think of are following.

Users can easily donate their resources to the community.

Just visit the website.

Users can get performance benefit if a page is loaded from

multiple local peers / local mirror (page load time got faster!).

Wikipedia can reduce its server workload, network traffic, etc.

Local network operators can reduce network traffic transit

(e.g. cost that is caused by delivering the traffic to the outside).
While we are working on enhancing the implementation,
we would like to ask the opinions from actual developers of Wikipedia.
For example, we want to know whether our direction is correct or not
(will it actually reduce the load?), or if there are some other concerns
that we missed, that can potentially prevent this system from
working as intended. We really want to do somewhat meaningful work
that actually helps run Wikipedia!
Please feel free to give as any suggestions, comments, etc.
If you want to express your opinion privately,
please contact sslab@cc.gatech.edu.
Thanks,
--- Appendix ---
I added some detailed information about B2BWiki in the following.
# Accessing data
When accessing a page on B2BWiki, the browser will query peers first.

If there exist peers that hold the contents, peer to peer download

happens.
2) otherwise, if there is no peer, client will download the content
from the mirror server.
3) If mirror server does not have the content, it downloads from
Wikipedia server (1 access per first download, and update).
# Peer lookup
To enable content lookup for peers,
we manage a lookup server that holds a page_name-to-peer map.
A client (a user's browser) can query the list of peers that
currently hold the content, and select the peer by its freshness
(has hash/timestamp of the content,
has top 2 octet of IP address
(figuring out whether it is local peer or not), etc.
# Update, and integrity check
Mirror server updates its content per each day
(can be configured to update per each hour, etc).
Update check is done by using If-Modified-Since header from Wikipedia
server.
On retrieving the content from Wikipedia, the mirror server stamps a
timestamp
and sha1 checksum, to ensure the freshness of data and its integrity.
When clients lookup and download the content from the peers,
client will compare the sha1 checksum of data
with the checksum from lookup server.
In this settings, users can get older data
(they can configure how to tolerate the freshness of data,
e.g. 1day older, 3day, 1 week older, etc.), and
the integrity is guaranteed by mirror/lookup server.
More detailed information can be obtained from the following website.
http://goo.gl/pSNrjR
(URL redirects to SSLab@gatech website)
Please feel free to give as any suggestions, comments, etc.
Thanks,
Yeongjin Jang

Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hi,
This is some interesting stuff, and I think research along these lines
(That is, leveraging webrtc to deliver content in a P2P manner over
web browsers) will really change the face of the internet in the years
to come.
As for Wikipedia specifically (This is all just my personal opinion.
Others may disagree with me):
*Wikipedia is a a fairly stable/mature site. I think we're past the
point where its a good idea to experiment with experimental
technologies (Although mirrors of Wikipedia are a good place to do
that). We need stability and proven technology.
That's true. Our current prototype works well for the testing,
but not sure for the robustness in the wild, yet.
We want to develop this to be more stable.
...
*Bandwidth makes up a very small portion of WMF's expenses (I'm not
sure how small. Someone once told me that donation processing costs
takes up more money then raw bandwidth costs. Don't know if that's
true, but bandwidth is certainly not the biggest expense).
Your scheme primarily serves to offload bandwidth of cached content to
other people. But serving cached content (by which I mean, anonymous
users getting results from varnish) is probably the cheapest (in terms
of computational resources) part of our setup. The hard part is things
like parsing wikitext->html, and otherwise generating pages.
Yes, it can be.
But for the network traffic bandwidth, we think that it could benefit both
side (an organizations that runs B2BWiki and Wikipedia), because
it would reduce not only the traffic that hits Wikipedia, but also
egress traffic from the LAN (or ISP) to Wikipedia.
And we know that this is just a hypothesis, so we want to do
analysis on potential reduction in traffic/cost with network data stats.
Is there any point that I can get stats for the bandwidth,
such as daily traffic for serving Wikipedia servers, etc?
(Please let me know if you know any point)
I visited several stats pages, such as
https://stats.wikimedia.org/EN/ChartsWikipediaEN.htm
https://en.wikipedia.org/wiki/Wikipedia:Statistics
, and those sites let me know about how many page accesses,
edits were happened but not for the traffic,
while ganglia site gave me too finer grained stats.
And what we got from our local network data is,
from Georgia Tech, it shows that 50GB of download per day
from Wikipedia
(including wikipedia.org and upload.wikimedia.org, based on source
IP address).
...
From the simple calculations, the cost of 18TB / year will be
around $600, for serving 30K person organizations.
...
*24 hours to page update is generally considered much too slow
(Tolerance for anons is probably a bit higher than logged in users,
but still).  People expect their changes to appear for everyone,
immediately. We want the delay to be in seconds, not days. I think its
unlikely that any sort of expire scheme would be acceptable. We need
active cache invalidation upon edit.
Yes, we have thought about cache expiring/update/invalidation,
for example, whenever a peer detects update, propagate the update
to its neighbor to update/invalidate the cache.
We did not concentrate on that part much,
as we had similar thoughts that non-editors can be more tolerable.
The scheme is not in there yet, but it is worth to implement if
editors care it much.
...
*Lack of analysis of scalability (I just briefly skimmed the google
cache version of the page you linked [your webserver had the
connection keep being reset], so its possible I missed this). I didn't
see any analysis of how your system scales with load. Perhaps that's
because you're still in the development stage, and the design isn't
finalized(?) Anyways, scalability analysis is important when we're
talking about wikipedia. Does this design still work if you have
100,000 (or even more) peers?
We think scalability is very important for B2BWiki.
While we were doing micro-benchmark on the servers of B2BWiki,
we realized that current one cannot support more than 10K concurrent
peers. We've been update the internal structure from September,
and now we are targeting support up to 100K peers
(we think this would be enough for
supporting a metropolitan local area).
Implementation with new data structure will be available very soon,
I hope we can show the good result that shows its scalability.
...
*Privacy concerns - Would a malicious person be able to force
themselves to be someone's preferred peer, and spy on everything they
read, etc.
*DOS concerns - Would a malicious peer or peers be able to prevent an
honest user from connecting to the website? (I didn't look in detail
at how you select peers and handle peer failure, so I'm not sure if
this applies)
Nice points! For privacy, we want to implement k-anonymity scheme on
the page access. However, it incurs more bandwidth consumption and
potential performance overhead on the system.
Malicious peers can act as if they hold legitimate content
(while actually not), or making null request to the peers.
We are currently thinking about black-listing such malicious peers,
and live-migration of mirror/peer servers if they fails,
but more fundamental remedy is required.
--
...
Anyways, I think this sort of think is interesting, but I think your
system would be more suited to people running a small static website,
that want to scale to very high numbers, rather than Wikipedia's use
case.
--
-bawolff

Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- 
Yeongjin Jang

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Peer-to-peer sharing of the content of Wikipedia through WebRTC

Thanks,