Rodi (P2P)

	Functional requirements
	Use cases
	Design
	Privacy
	Help the project
	Try Rodi (beta)
	User Manual
	History
	Logos
	Post IP range

Message board

Rodi Use cases

The following diagram shows Use Cases and Actors of the system:

Search. This is one of the most important parts of the system.
btw what about kademlia, DDNS, etc.?
I try to solve the problem of moving from tracker to tracker in most natural way possible. Client looking for file use Internet browser or news letter or any other media. Information contains IP address (or range of IP addresses used by ISP if IP address is dynamic), file name, description. Optionally info can point to file containing encryption/decryption key, hash keys for the file, registration form, etc. Client send LOOK request with filename as an argument to the publisher and waits for reply(s). The procedure repeats until a response arrives. Publisher can drop LOOK request if busy or response with LOOK ACK containing NOT FOUND or FOUND. If the file is found publisher will add to the LOOK ACK file hash and IP Address/IP Port pairs of the leachers currently downloading the data and/or existing seeds. More than one LOOK ACK can be sent for single LOOK request. Next step client sends LOOK to all seeds from the table and keeps statistics of round trip delay, number of retransmissions, etc. Client makes an attempt to find as many as possible seeds containing the required data. Search process can consume significant bandwidth and time.
Two different problems exist

Find file by file title, file description, file size, file type (eMule, Kazaa, etc.)
Find file by content (Internet search engines Yahoo, Google, MSN, etc. )

Rodi engine provides both types of search. Rodi client calculates hash for the published/shared files and compiles index table to support content search. Index table is optional and can be switched off. Pay attention that Rodi network is fully distributed file search engine without single point of failure. Interesting also that different versions of Rodi client can calculate weight/rating of file in different way. For example, interest group sharing fantasy books can assign higher rating to the words belong to the document title.
The following list is search parameters provided by application:

Max number of simultaneously running searches
Max upstream/downstream bandwidth allocated for search
Max time of search
IP range(s) for search - expected a small subnet from 1 to 256 addresses, though search through all IP address space is possible. If we assume 4 bytes address space where most of the addresses are legal, upstream 10 Kbyte/s and size of LOOK packet 50 bytes it will require about 5 years to scan the whole network once (with no retransmissions).

Packet Size bytes

IP range size K

Bandwidth kBit/s

Up to

Such attempt while being possible can not be made by all clients simultaneously and normally is not feasible. BTW with 1Mbit/s upstream it will take about a month to scan the whole network. Dedicated crawlers can be developed to facilitate file search (see below)
Max number of retransmissions for any single IP
Max number of retransmissions for any single IP address range
File description
File size range in MBytes
File creation date, for example, range Feb 2001 - Jul 2002
File hash if already known
MD5 keys and encryption keys if required

Statistics provided by LOOK subsystem

Counters of request - total, last sec, last hour, last 24 hours
For every asked IP - counter of requests, average timeout, number of acks, max timeout, min timeout, standard deviation
List of nodes where file found
Overall time the search runs
Total bytes sent/received
Average upstream/downstream bandwidth for last minute, hour, 24 hours

LOOK ACK. When publisher/seed/client receives LOOK request it can drop the packet if busy, answer NOT FOUND or make best effort to find the file description or file with specified hash. If file(s) found seed sends one or more LOOK ACK responses containing exact file name, hash, IP addresses and IP ports of other seeds (seed keeps record of clients downloaded the data). Different strategies can be used by node to improve network performance and fight nodes sending LOOK requests to everybody in attempt to break the network. One of the approaches is to accept LOOKs only from well known IP addresses or IP ranges and drop all others. This scheme can be implemented using firewall and does not require additional effort of CPU where network client runs (i assume here that firewall runs on separate machine). Dedicated nodes serving LOOK requests can be used by interest group and all other nodes belong to group response to LOOK only from this dedicated node (IP is wellknown). Interest group provides opportunity of generating of positive cash flow through advertisement or annual member fee. Configuration parameters of the LOOK ACK subsystem:

Max bandwidth allocated for LOOK ACK
Assuming that swarm size is 300 leachers, upstream available is 32kBit/s, packet size 1200 bytes the total number of bytes sent is 400K and the last client is served 15min after the first. Publisher can decrease upstream dramatically by not sending all IP Address/IP port pairs. If packet size is 256 bytes it gives total bytes sent 80K and time 20s. I assume also that the downstream of the publisher is not a bottleneck.
IP ranges of nodes which can use LOOK service
Max number of LOOK requests from any specific node per min, per hour, per day, ...
Shared files - list of files which can be found by LOOK. Typically 1 or 2 files for every 10kByte/s available upstream are shared to improve overall performance of the network. For example, node with 1Mbit/s upstream can share/seed 10-20 files
MD5 keys and encryption keys if required

Crawler.To be useful crawler should accept registration (record IP address(es)) of data publishers and provide WEB access to the data base of files, current status of publishers, etc. Crawler can generate positive cash flow using the filesharing network. Cost of running the crawler is low and hopefully can be offset by profits from advertising or publishing fees. Word "crawler" is used because such server can optionally "crawl" the filesharing network using LOOK, starting from known (registered) publishers. Crawler define some internal policy about how long it keeps record for file which is inactive. Crawler also should define what inactivity means. For example, no search requests for the specified file for more than a month, or no active seeds (responding to LOOK) for the file for more than a week, etc. Optionally crawler can response to LOOK requests exactly as publisher does, but drop GET DATA requests. This way crawler runs more or less the same code any filesharing network node runs and does not require any additional software like HTTP server.
Assuming that filesharing network keeps 1 mil files spread among total 100 mil nodes with average 10 nodes storing and seeding every file crawler should keep about 10 mil records in the database. Assuming single record size to be 1K we get size of the database 10GB.
Crawler can put limits for total number of files, total number of unique nodes and number of seeds stored for every file. Typical database record can contain file name, file description, file types, file cash, IP address/IP port pairs of publishers, seeds, clients. Assuming that all such limitations are ON database can work with records of the same size.
Crawler is NOT a tracker. Crawler answers only one request of the client - first LOOK, the rest will be done by publisher/seeds. Crawler can function as a tracker, for example, keep information about upload/download ratios, response to GET DATA request with response containing list of seeding nodes, etc. but all this stuff is optional.

Download.Client make an attempt to find as many as possible nodes storing the desired data. The behavior of the client similar to bitTorrent client. It always looks to optimize download speed using faster connection with lower roundtrip delay and higher download bandwidth. From time to time client try randomly to choose node, etc. Clients runs bookkeeping of all connections it establish with statistics like number of not acked packets, roundtrip delay, average rate, etc. Records from this database are used both for optimizing download speed and when handle LOOK requests. Client send GET DATA request containing number of block(s) it lacks. Client can optionally specify maximum bandwidth, max packet size, max burst size and inter burst delay. If such constraints are specified seeding node should behave accordingly. Seeding node expects that network is reliable and need of retransmission is close to zero. Client will run timer for every request sent and if timer expires resend the request. Download parameters:

Max upstream/downstream bandwidth used by download subsystem
Max timeout, for example 2 sec
Number of simultaneous downloads/files - usually a single digit, like 1 or 2
Max Number of remote seeds in the database - from hundreds to thousands
Size of GET DATA packet - limits number of blocks, upstream usage, delay
Number of attempts before failure declared - 2 or 3 in most of the applications
How often client try new seed
Max number of seeds to download from - number of pending download requests

Stored statistics/bookkeeping

Number of requests total, last minute, hour, 24 hours
Bytes sent/received total, last minute, hour, 24 hours
Number of retransmissions total and for every host last minute, last hour, last 24 hours
Database of remote seeds
Average response time for all seeds and every single seed last minute, last hour, last 24 hours

Snubbing. Similar to bitTorrent network snubbing procedure is introduced and can be used by all network nodes - publishers, seeds, leachers. Different reasons for snubbing exist. Seed can wish to serve as many clients as possible and can snub clients with higher than average amount of uploaded data (it means that publisher will drop all request from this client until it is not unsnubbed). Leacher (seeding node) can snub a client if decides that download/upload ratio of the client is too low.
Difference between seed and seeding client is only algorithm used for snubbing. Seeding clients and seeds calculate rank for every connected client. Rank is based on the upload/download ratio, percentage of the data stored by the client and rank of the file. Two approaches can be used. In the first case the network can target faster delivery, in the second efficient bandwidth usage. Seed can attach higher rank to the clients with lager part of data or to the client with smaller part of data. In the first case client with 90% of the data downloaded will be served before the client with 10% done. In the second case seed will upload to the client with 10% and make 90% client to stay longer online. It is not immediately clear how this parameter influence overall time of data distribution and bandwidth usage (research TBD). Seeding node can snub nodes with high rate of retransmission requests, higher than average round trip delay, etc. The system supports some basic scheme of snubbing for seed and seeding client. More advanced schemes can be added in the future.
Rank of the file is based on rank of the blocks, ratio seeds/leachers. Files with higher rank are uploaded first.
Seeding node periodically (every 5 minutes, for example) recalculates ranks of the clients and files and make snubbing decision. Both file and client can be snubbed. TBD

Upload.Upload behaves similar to the scheme defined by bitTorrent. Seed makes an attempt first of all to send blocks which are requested most often. To facilitate the process clients send request for all blocks they need and seed can keep statistics of blocks availability and calculate rank of every block. If request for block with lower than average rank arrives seed can drop the packet and free upstream for higher rank blocks.
Seed starts upload when GET DATA request arrives. Upon arriving of the request seed can do one of the following

Drop the request if busy or the client is snubbed
Ack the request with NOT FOUND
Ack the request with BUSY - means the data is found, no other seeds known to have the data, you (client) have to wait, because I (seed) is busy and can't start sending the data
Ack with FOUND and list of seeding nodes having the required blocks
Start data transfer to the client

List of parameters for upload subsystem:

Max upload bandwidth used by the system
Max upload bandwidth for file
Max number of active connections - simultaneous uploads
Rules of snubbing - min upload/download ratio, file rank, timeouts, etc.
Max allowed number of requests from one client in minute, hour, 24 hours
Max data packet size. Different considerations influence the choice of the packet size. Larger packets are good for overall bandwidth usage but inevitably increase delay and can dramatically influence high priority traffic like VoIP. For example, assuming 128Kbit/s up link and 32K bytes packet size time of transmission of such packet is 2s.

Packet Size bytes

Bandwidth kBit/s

Delay

With really fat pipes like multiple E1 connections 32K bytes packet can be transmitted in 120ms or less.

Publishing. The protocol is dedicated for communication between publisher and crawler. I assume that publisher received all needed encryption keys and signatures to access the database of the crawler, for example, publisher can receive email with all information required. To make the protocol simple I add only one message - PUBLISH request. The request includes file name, file hash, file type, file size, file date, IP address or range of IP addresses and IP port through which publisher can be accessed. After sending PUBLISH, publisher will issue regular LOOK to the crawler to make sure that the file is published. Crawler optionally can try to retrieve number of blocks from the publisher to make sure that publisher is alive and responding to the GET DATA request. Crawler can repeat this procedure or use LOOK request from time to time, for example, once every 24 hours, to make sure that publisher is still online.

Distribution of statistics. Seeding nodes provide access to the file and client statistics. Any seeding node can download this information and recalculate it's own data. Part of the statistics can be accumulation time, upload/download counters, etc. Assume that leacher L ask seed S to send data. S request statistics related to L from one of the active seeds for example, randomly chosen and not belong to the same with L subnet. If upload/download ratio for L is low S can decide to ignore last request of L or snub L for some time. Interesting that if mighty adversary decides to download the data it can not be done without active uploading or without installing multiple nodes in different subnets.
Yet another method can be suggested to fight adversary from the previous example. Let's say that data is encrypted by strong encryption and the key is sent together with the last block of the data. Adversary can not (can it ?) build the case against the publisher having only the data description like file name, it needs the file itself. Adversary has to participate in data exchange with publisher downloading the whole file. In the networks like eMule or BitTorrent it means that adversary actively participates in the data distribution which is time and bandwidth consuming. Interesting that this scheme can be used in ANY existing P2P network without any changes in the protocol. Here I assume that encryption algorithm requires that the data should be downloaded in full before it can be decrypted.

Torrent file/Single block hash. The node is optimistic and starts to download the data having only hash for the whole file. After download is completed and hash found to be Ok everything is fine. If hash fails node should find out what block is corrupted. The problem can arise if, for example, "evil" node uploads corrupted blocks. Because all peers store statistics such node can be easily detected and removed from the list of peers distributed by the seeds. Here and only here torrent hash table comes to play. Any peer can ask publisher or any other trusted node to calculate hash for any block of any size. In ~O(log(size of file)) steps client can learn what block is damaged. Publisher can drop the request if busy, can response with "block size is too small", can send a table of previously calculated hash keys. The exact behaviour of the publisher depends on the application and CPU availability.

Message board

Home