For a while I’ve been wondering if a hybrid Bittorrent/Reed-Solomon system would be a more effective swarming download system.
Just to make this clear ahead of time, I don’t believe the benefits of the Bittorrent/Reed-Solomon system would kick in until the torrent had been running for a longer amount of time. The reason for this is that Bittorrent has a built-in “rarest first” algorithm (described in this Bittorrent paper). The obvious goal the “rarest first” algorithm to replicate the rarest pieces of the system first, while leaving the most common pieces - the ones most likely to be around longer - until the end.
In a Bittorrent network where a seed node exists for the entire period of the torrent’s existence, this parity system would provide no benefit. This system really shines, however, by increasing the probability that a client node can retrieve enough data from a seedless network to become a seed itself.
So, on to the idea. I’m proposing that instead of seeding pieces of the original file at the start, the seed distributes a large number of Reed-Solomon-encoded “parity files” (perhaps twice the number of blocks as the original number of Bittorrent blocks - maybe more). These blocks would be “virtual” - computed in real-time as they are requested by the clients. Reed-Solomon encoding is a fast process. I’ve seen throughput of ~360MB/s on my Athlon 2100+. This is far more than enough for real-time encoding and distribution (the blocks could always be pre-calculated and cached for a less powerful system).
These parity blocks would be distributed using the original Bittorrent algorithms (including rarest-first). For simplicity, I’ll use the situation of providing 100% redundancy - twice the number of blocks as necessary to reconstruct the file. Once a Bittorrent client had received enough blocks to restore the original file (50% of the blocks), it runs the Reed-Solomon decoding process to create the original file. From the original file, it can then continues as a seed - and can calculate the extra parity blocks as necessary.
The goal of the system is to ensure that each Bittorrent client has as diverse a set of blocks as possible. Imagine a situation with an infinite number of Reed-Solomon-encoded blocks, where n blocks are required to reconstruct the target file. Each downloading client can get the next block in the infinite sequence of blocks from the server. Once a client has received n of these blocks, it can then generate any of the other possible encoded blocks.
If this hypothetical ad-hoc network suddenly lost each of its seed nodes, the remaining clients would only need to share n different blocks between them to reconstruct the file. I believe that as the number of possible blocks goes up, the probability of this occurring is greater than if only n possible blocks existed.
Perhaps an example network might show the advantages:
In this system, there are 200 parts to the file. In the original Bittorrent system, this equates to 200 blocks. In the Reed-Solomon-encoded system with 100% redundancy, this equates to 400 blocks. There are 30 clients trying to download this file. When the seed node gives out, each client has 10 blocks of the file (5% in the Bittorrent case, 2.5% in the Reed-Solomon case).
First, the Bittorrent case:
So, the probability that a client has any block is 5%. This means that there is an 95% chance that a client does not have a particular block. The probability that none of the 30 clients has this particular block is then 95%^30 = 21.46%. The probability that one of the clients has this block is (1-21.46%)=78.54%. The probability that each block is possessed by at least one client is then 78.54%^200 = 1x10^-19%.
Now, in the Reed-Solomon case:
The probability that a client has any block is 2.5%. This means that there is an 97.5% chance that a client does not have a particular block. The probability that none of the 30 clients has this particular block is then 97.5%^30 = 46.79%. The probability that one of the clients has this block is (1-46.79%)=53.21%. The probability that enough blocks are contained in the system is calculated using the Binomial function with:
- Probability of success, p = 53.21%
- Number of trials, n = 400
We take the probability that the number of successes will be greater than or equal to 200 in this case. The probability that not enough blocks (ie: 0-199 blocks were available, 200 were required) exist in the system is now:
|B(k,n=400,p=53.21%) = 9.08%.
There is now a 90.92% (the inverse of the above probability) chance that the system can recover enough to provide another valid seed in the future. This is significatly better than the 1x10^-19% percent chance from before.
Thoughts?Read full post