B2BWiki: Collaboratively Reducing Network and Server Cost of Wikipedia Using WebRTC

Overview

B2BWiki (browser-to-browser served Wikipedia) is a project that attempts to help Wikipedia in reducing its network burden by delivering and sharing the page contents among users using in-browser P2P communication (WebRTC). Each user can contribute his/her network capacity as well as own local storage (e.g., indexedDB) in their browser while reading the page, and a larger organization might even contribute to the community by deploying own servers (no out-of-pocket money!), similar to a mirroring server in an old good day (e.g. GNU FTP Mirrors).

The project consists three pieces: 1) a signaling server that makes browsers to be connected, and locates the peers that have requested Wikipedia pages, 2) a mirror server that transforms Wikipedia pages into peer-to-peer deliverable formats, and does an initial bootstrapping of data, and 3) the client (browsers).

B2BWiki works as follows. When a user visits a page, 1) the browser first checks if its local indexedDB has the requested content without sending any packet (no If-Not-Modified nor Etag querying packets!), 2) If the page is not in the local DB, it tries to get a list of peers that have the page content, and try to download it from the peers, thereby reducing the network traffic from Wikipedia. If there is no peer available for certain contents, then the browser downloads the content from the mirror server, and stores it into the local DB for sharing later.

Demo

How to use?

Visit B2BWiki Page: http://b2bwiki.cc.gatech.edu/, or with a specific page name http://b2bwiki.cc.gatech.edu/wiki#United_States.

You can also install our extension for Google Chrome and Mozilla Firefox for the convenience; the extension will automatically redirect your connection to en.wikipedia.com into B2Bwiki.

Statistics

Looking at the stats: Data sources: on top of the page (above the wikipedia page title), a combo box shows where the page comes from (peer ID, server, local, with downloaded size).

Socket message, Server download, Peer received, Peer sent data: located on the left-side bar.

Benefits

For Wikipedia: reduce server load!

Users on B2BWiki will download the page contents from peer-to-peer network (if available). This directly reduces network traffic overload of Wikipedia server. If no peers are available, B2BWiki will download the contents from the mirror server; although mirror server is required to serve page to the user, fetch the page data at the first time and on update to the page has happened (i.e. page edits), this generates 1 page access per mirror server per each update (compare this to when N users tries to get data from Wikipedia). Therefore, the overall load on the servers (including Wikipedia, and mirror servers) will be reduced.

For local network operators (e.g. ISPs): reduce network transit

For the local network operators (e.g. LAN, WAN, AS, ISPs, etc.), installing a mirror server will directly benefit them by reducing egress traffic to the outer network (transit that causes cost!). Instead of hitting Wikipedia server, local users will hit the mirror server. In addition, operators can configure peer selection algorithm to control peer sharing that can be happened only to the local peers. In both cases, all the contents will be served locally, they can reduce the cost of network traffic transit.

For the user: contribute to the community; faster access!

Users can easily contribute their resource to the community. Instead of donating small amount of money (which requires several steps such as typing credit card info, etc.), on B2BWiki, just enabling browser extensions or accessing through http://b2bwiki.cc.gatech.edu/ will automatically let user donate their resource.

Users can also gain performance benefits. In B2BWiki, since pages are either loaded from the local mirror or local peers, page loading time is faster than connecting to Wikipedia website.

For example, when loading the entire page of 'United_States' from Wikipedia, If enough peers are located (e.g. 3 neighbor peers has the contents), B2BWiki takes 1,454ms while accessing Wikipedia took 1,817ms. For the cached contents (e.g. accessing the same page later), loading of United_States page in b2b took 608ms while refreshing on Wikipedia takes 881ms.

Technical Stuffs (FAQ style)

Oh, there is no peer download. What happend?

Currently, there is not that many users yet so there could be no peer who holds the page content that you requested.

I want to see the peer download. How can I?

An easy way to test how B2BWiki works on peer download is opening a 'Private Mode' of your browser along with normal sessions. For example, on your regular browser tab, visit b2bwiki for a page, for instance, "Georgia Tech". And then, try to open 'Private Mode' such as Incognito mode in Google Chrome, connect to the same page (i.e. Georgia Tech). Then, the page will be downloaded from the peer (in most case, the peer will be the tab you opened in your regular session).

How can I be sure that the other peers are not modifying the content?

On peer delivery, we check SHA1 sum of the content. And the SHA1 hash of the page was generated from the mirror server, and we compare the hash of delivered content with the hash that we got from the mirror server, so that peers cannot lie about the hash (we have full control of hash!). Now we are relying on single mirror server model, but there could be more scalable structures or schemes for content integrity for the distributed settings.

How frequently does the pages get updated?

We schedule to run update at our mirror server once a day. Since all the page content is originally distributed by the mirror server, the freshest content that you can get could be one-day old. As we set 'time tolerance' in data update model, which sets the client to accept N-day old data (in default, it is just 1-day), so that as time goes past that threshold, if you hit the same page again, it will be updated with the fresher content.

My page is fully loaded from the local peer. How much traffic that I am generating?

If you accessed the page within the 'time tolerance' period (say at the same day), the local access would not cause any traffic (0 bytes!). However, if the period is over the time tolerance, client browser will check the server for a list of hashes of fresh page contents. This is mostly just a few hundreds bytes. If the content matches with one of the hash in the list, no more traffic will be generated. Otherwise, it will try peer/server download so that the page will be reloaded (no longer loaded from the local).

Compatibility: What kind of browsers are supported for B2B?

Currently, it is fully supported at Firefox (not in the private mode, though), Google Chrome, and Opera. Also, it does not work with private mode of Firefox due to its disabling indexedDB. So in the future, we will create a client mode that works without indexedDB (maybe no local cache, or using localStroage).

And, it is not supported in Microsoft Internet Explorer, or in Apple Safari.

For mobile platforms, it works well in Google Chrome in Android (but has no mobile view yet). However, it does not work on iOS platform at all (has no WebRTC). I tried Bowser (a browser with webRTC) in iOS, however, it is not supported. I cannot see the log messages so that I cannot tell why it does not work.

I haven't checked Microsoft Edge, could anybody tell me whether it works or not?

Play with Settings

Users can change several settings such as:

Update Tolerance

Users can change their time limit for checking update on the settings. On checking integrity of the content, B2BWiki checks the timestamp of the hash and accepts the content if hash is matched, and hash is within the tolerable time range.

The value can be changed into preset values such as “Always Fresh Content” that makes B2BWiki loads the freshest content everytime it accesses the page, or set for period. Default value is “1 day”, to make cache works for a day long.

Peer Selection Algorithm

For peer-to-peer communication, it is essential to the performance that choosing of “fast" peers to receive the page content faster. Currently, we provide 4 types of algorithm: 1) prioritize established connections, and next ordered by IP subnet matching (if more than first 2 octet of IP is the same, comes first). 2) Only prioritize established connection, 3) Only use IP subnet matching (/16), 4) random.

Peer Failure Fallback Time

On B2BWiki, the connection to the other client can be failed; it can be the case such as 1) closing browsing session while the other peer is downloading something, 2) took too long time for delivering the content, 3) network error, power off, etc. To tolerate such error, we set a timeout for a peer connection. If the peer is not responsible for the time set on this settings, B2BWiki will choose the other peer available for the content and tries download from them.

Chunk Size

On communicating page contents through RTCDataChannel which uses SCTP (Stream Control Transmission Protocol), it is better to slice large data into several chunks to reduce the complexity on managing the stream. The setting is for set the size of the chunk. Larger chunk size will give faster transmission (if there is no packet loss/unordering), and smaller chunk generally takes more time. We tested for the various settings of the clients (connected with external IP, under NAT, VPN, etc.) and chose default value as 4KB. If you uncomfortable with the chunk size, you can change the value to larger/smaller size.

Unzip Workers

B2BWiki uses ‘zip’ to compress the data, and use SHA1 for checking integrity of the page content. While web-browser runs as single-thread, compressing and calculating hash requires nontrivial amount of computation power so that it makes page load slower. To avoid this, B2BWiki pushed those computation into WebWorker, which works as multi-thread. In this settings, user can choose how many workers to ‘fork’ for compression and hash calculation. Usually, it is good to set to number of CPUs on your machine (if you have dual core machine, set it as 2).

People

Yeongjin Jang
Yang Ji
Meng Xu
Insu Yun
Taesoo Kim

Contacts

Feel free to contact Yeongjin <yeongjin.jang@gatech.edu> or SSLab <sslab@cc.gatech.edu> for any suggestions, questions, and comments.