grack.com

For a while I’ve been wondering if a hybrid Bittorrent/Reed-Solomon system would be a more effective swarming download system. 

Just to make this clear ahead of time, I don’t believe the benefits of the Bittorrent/Reed-Solomon system would kick in until the torrent had been running for a longer amount of time.  The reason for this is that Bittorrent has a built-in “rarest first” algorithm (described in this Bittorrent paper).  The obvious goal the “rarest first” algorithm to replicate the rarest pieces of the system first, while leaving the most common pieces - the ones most likely to be around longer - until the end. 

In a Bittorrent network where a seed node exists for the entire period of the torrent’s existence, this parity system would provide no benefit.  This system really shines, however, by increasing the probability that a client node can retrieve enough data from a seedless network to become a seed itself.

So, on to the idea.  I’m proposing that instead of seeding pieces of the original file at the start, the seed distributes a large number of Reed-Solomon-encoded “parity files” (perhaps twice the number of blocks as the original number of Bittorrent blocks - maybe more).  These blocks would be “virtual” - computed in real-time as they are requested by the clients.  Reed-Solomon encoding is a fast process.  I’ve seen throughput of ~360MB/s on my Athlon 2100+.  This is far more than enough for real-time encoding and distribution (the blocks could always be pre-calculated and cached for a less powerful system). 

These parity blocks would be distributed using the original Bittorrent algorithms (including rarest-first).  For simplicity, I’ll use the situation of providing 100% redundancy - twice the number of blocks as necessary to reconstruct the file.  Once a Bittorrent client had received enough blocks to restore the original file (50% of the blocks), it runs the Reed-Solomon decoding process to create the original file.  From the original file, it can then continues as a seed - and can calculate the extra parity blocks as necessary.

The goal of the system is to ensure that each Bittorrent client has as diverse a set of blocks as possible.  Imagine a situation with an infinite number of Reed-Solomon-encoded blocks, where n blocks are required to reconstruct the target file.  Each downloading client can get the next block in the infinite sequence of blocks from the server.  Once a client has received n of these blocks, it can then generate any of the other possible encoded blocks. 

If this hypothetical ad-hoc network suddenly lost each of its seed nodes, the remaining clients would only need to share n different blocks between them to reconstruct the file.  I believe that as the number of possible blocks goes up, the probability of this occurring is greater than if only n possible blocks existed.

Perhaps an example network might show the advantages:

In this system, there are 200 parts to the file.  In the original Bittorrent system, this equates to 200 blocks.  In the Reed-Solomon-encoded system with 100% redundancy, this equates to 400 blocks.  There are 30 clients trying to download this file.  When the seed node gives out, each client has 10 blocks of the file (5% in the Bittorrent case, 2.5% in the Reed-Solomon case).

First, the Bittorrent case:

So, the probability that a client has any block is 5%.  This means that there is an 95% chance that a client does not have a particular block.  The probability that none of the 30 clients has this particular block is then 95%^30 = 21.46%.  The probability that one of the clients has this block is (1-21.46%)=78.54%.  The probability that each block is possessed by at least one client is then 78.54%^200 = 1x10^-19%.

Now, in the Reed-Solomon case:

The probability that a client has any block is 2.5%.  This means that there is an 97.5% chance that a client does not have a particular block.  The probability that none of the 30 clients has this particular block is then 97.5%^30 = 46.79%.  The probability that one of the clients has this block is (1-46.79%)=53.21%.  The probability that enough blocks are contained in the system is calculated using the Binomial function with:

  • Probability of success, p = 53.21%
  • Number of trials, n = 400

We take the probability that the number of successes will be greater than or equal to 200 in this case.  The probability that not enough blocks (ie: 0-199 blocks were available, 200 were required) exist in the system is now:

k=0
Σ B(k,n=400,p=53.21%) = 9.08%. 
199

There is now a 90.92% (the inverse of the above probability) chance that the system can recover enough to provide another valid seed in the future.  This is significatly better than the 1x10^-19% percent chance from before.

Thoughts?

Read full post