|
Rodi CoreWe have two separate binaries - Core and UI. Core supports all services from data search to crawling. UI provides ways to manage the Core. UI can be as simple as CLI or extensive graphic rich application. The following diagram shows Core subsystems:
Security Downstream. Is achieved similar to the client tracker, every packet is signed with MD5 calculated for the whole data set including tracker ID and session ID, but the sent packet only contains the session ID and payload. Upon receiving the packet client is expected to recover the initial packet using the stored tracker ID, calculate MD5 sum and compare with received sum. When communicating with the host the client expects that all packets arriving from the host are signed with MD5. Host's calculate MD5 using stored tracker ID and session ID, but sends only payload and MD5 sum. Here is how different packets look for "tracker to client" and "host to client" communication
Data encryption and how to fight P-Cube, Allot Communication, Expand Networks, Lancope Inc., Ellacoya Networks, Packeteer, Audiblemagic and similar solutions. One of the possible scenario is that the payload can be encrypted by the host before sending with a private key generated by the tracker. Client receives public key generated by the tracker as part of initial handshake. In a different approach the whole data packet can be encrypted in the data exchange between host and client and between client and tracker. Let's talk more about this. There are three separate problems the system can try to answer.
Apparently analyzers use some simple rules based on IP address and port number to collect the statistics or even drop the packets if ISP decides that the traffic is illegal or parasitic. In the more advanced analyzers "deep inspection of packets, including the identification of layer-7 patterns and sequences" is supported. P2P network can use some simple encoding algorithm, for example, XOR with long key. The strength of the scheme is regulated by the length of the key, frequent renewing and total number of keys. Let's assume that length of the key is 1M characters, there are 1M different keys - hosts generate different keys for the published files. At this point a reliable analyzer is expected to store and actively use about 1T characters of keys. Let's also suggest that keys are made accessible for registered clients using different protocols, like e-mail, FTP, HTTP, etc. Because normal high speed analyzer's are real-time embedded devices they can't reach the goal of collecting 1Tbytes of keys. See also Firewalls below. It can also be argued that when an ISP sells a contract including 24/7 unlimited access at a specified bandwidth the customer's expectations are that network performance is application independent. An ISP can alternately issue a clear statement as part of the contract that, for example, P2P or proprietary VoIP applications have a low priority and services provided by ISP's mail server has higher priority. Let's say that IP address and port number can not be recovered by an intercepting system, for example, as all nodes are behind PROXY servers supporting packet encryption in both directions. This approach brings additional costs related to the maintenance of PROXY servers. However, a P2P protocol can be defined in a way that helps ISP to recognize replicas and cache the data locally. For example, GetData request is a unique URL string and GetData response is a binary packet. Such anISP friendly approach could decrease network load significantly. There is a copyright issue here, because the cache is effectively part (or even all) of probably copyrighted material therefore the ISP's cache participates in the illegal sharing of copyrighted material. However, the liability issue is far from clear, because the ISP do not make any usage of the information in the cache besides facilitating the data transfer, these are the hosts who store and publish the material (see http://www.joltid.com/ http://zdnet.com.com/2100-1104_2-1027508.html http://assembler.law.cornell.edu/uscode title 17 for more on caching). A typical Internet browser stores information in a temporary files cache and it is not considered infringing on copyright laws. Weak packet encryptionGoal of the weak payload encryption is to work around intelligent traffic analyzers looking for protocol signatures. Two schemes are suggested for packet encryption - torrent file oriented and seed oriented In both schemes only payload - bytes following 8 bytes request ID are encrypted. It does not create a problem because 64 bits request ID is generated locally by peer and essentially floating arbitrary random number. In both cases weak encryption is XOR - symmetrical, fast encryption algorithm. In the torrent file oriented scheme weak encryption key - XOR, is part of the torrent file containing hashes of the data. Using of XOR key for both requests and acks is mandatory for all peers willing to participate in the data distribution. Rodi client opens torrent file and fetches encryption key. Rodi client encrypt all outgoing packets with this key. There is no way to distinguish between requests encrypted by different keys - key is not part of the payload and there is no any initial handshake-key exchange between the peers. This scheme assumes that all peers participate in distribution of only one file. In the seed oriented scheme encryption key - XOR, is part of the host's (publisher) profile. Publisher advertises IP range/port/public key
combination on the identification server. Publisher posts XOR key of arbitrary length as an ASCII string.
Torrent oriented scheme is more flexible and will be implemented first. One of other options that encryption key can be published on the message board as part of the data advertisement or it can be the same permanent or randomly changed key used by this specific group of content publishers - publishers belonging to the same Rodi House. Firewalls. Using any and only one IP port (for example, HTTP - 80) for the communication can help to fight some of the firewall configurations. Still in some cases stronger scheme should be implemented. The system can fake look&feel of real RTP or NFS packet - IP port number, Protocol type, etc. and destination node can still process the packet as regular. The system does not make use of TCP flow control features and essentially implement it's own packet delivery protocol (it can be, for example, similar to the Frame Relay protocol). In the first phase Rodi will fake RTP packets. It makes sense, because
Among other convenient UDP based protocols application can use is DNS (port 53). UDP packet payload for DNS queries starts with arbitrary 16 bits transaction ID followed by flags or so called opcode (0x0100 - Standard query), then 16 bits number of questions, three 16 bits fields (resource records) which are typically zero and in the end list of queries. DNS opcode of the query should be preserved to go through intelligent firewall. Size of UDP packet is limited by 512 bytes or less. DNS uses TCP for responses larger than 512 bytes. See RFC1035 for details. IP network topology awareness, internal rounting protocol, looking for shortest pathTBD
Packet structure. I want to
use min possible bytes and I want to make resolving of backward compatibility
issues easy. One of the approaches to the problem is XML formatted messages. For
example, typical LOOK request could look like this
Packet ID is 64 bits integer generated by the sender. Rodi client sends encrypted request lead by this number and expects to find this number in the ack. Encryption algorithm can be different for different hosts and packet ID is the only way for Rodi client to open the ack. Rodi client keeps list of pending requests. Request is removed from the list if ack arrives with correct packet ID or timeout expires. Packet ID usually is a random number and can not be used by the traffic analyzer to distinguish the traffic. Payload of the requests sent to different destinations are encrypted with different keys using different algorithms. Packets can also be signed. Least significant byte is first - little endian (Intel) byte order.
Request Manager
requestManager signs the packet with MD5 (if needed) and place unique 32bits identifier - requestId. requestManager expects that the remote peer will response with the packet containing exactly the same identifier. requestManager use this identifier to find pending request in the pendingRequests array and forward the response to the relevant subsystem.
Message parsing, verification, building
Different approaches exist to the problem of message parsing. What I am looking for is very RAM and CPU efficient scheme, which can be easily ported
for any OO environment.
Incoming. Message object contains reference to the header and reference to the raw data packet. After packet verification and parsing object contains list of IEs. IE among other property fields keeps offset in the packet and reference to the array. Outgoing. Packet builder creates required message and adds IEs one after another to the list of IEs of the message. Every time packet builder adds a new IE it calculates required space and updates message size field. When the application makes a final call to release the raw data packet packet builder allocates array from the data pool and calls message header and IE methods to construct raw data. The following is an attempt (in the spirit of best effort communication) to provide an overall picture of class hierarchy. Message is just a place holder for references to the message header and list of IEs. IE class itself has an interface supporting linked list. Message header and IE inherit from the same parent - MessageAtom and implement interfaces required by Object Factory and Packet Builder. Among the methods required
Object Factory and Packet Builder. Number of different concepts are used when approaching the problem of handling of
big chunks of data.
System incoming flow looks like
Now let's see what happens in the packet builder - upstream link. Application has a message object from which data packet should be generated preferably without costly memcpy's. Currently application makes one memcpy when reading data from file and placing into the packet. Packet builder sets packet id in the first 8 bytes of the packet. Packet builder calls message setup method to build binary packet. Message setup calls all information elements one after another. Packet builder then runs encryption for the data. Packet ID is left unencrypted. The packet is sent out by the application - synchronous call to Socket Tx. Packet builder add packet ID and and correspondent decryption object to the list of packet IDs. Packet processor is expected to look on the list for reference to the decryption object when ack arrives. Message object calls signature information element last. Signature information element setup itself in the packet using zeros for the packet hash field. Signature information element then calculates hash of the whole packet including packet ID (packet ID is set by packet builder), encrypts the resulting hash and copy the result to the packet hash field. Management. Rodi core provides access to the internal data bases and management functions. Core opens management socket and waits for requests from the management. Management can be just a simple CLI or graphic and color rich GUI application and can be written using any language and can run on any system. This project develops two interfaces - CLI and light GUI. CLI provides menus to print current statistics, configure system parameters, start/stop upload/download, publish data, etc. GUI resembles CLI in the look&feel. GUI is table based, JDK 1.2 compliant (no Swing elements), light weight, distributed as JAR file and source (click example). Binary code limitations - GUI is not expected to be larger than the core. Because the management interface is provided through socket, browser based interfaces can be easily developed. User's fill the address line of the browser with something like http:\\localhost:6969. Browser downloads the GUI code from the engine and starts it. Any language can be used for GUI - HTML, JavaScript, Java, Flash. Core manager is implemented as a single high priority thread in the infinite loop blocked by call to the mailbox. Core Manager or Server waits for requests. Core Manager serves one request after another. Core Manager assumes that arriving packets are checked by packet processor against existig data base of authorized users - pairs of nickname and public key. Packet processor drops unauthorized packets. Deliberate attacks against Core Manager can be prevented by snubbing engine. Core Manager generates 64 bits session ID and sends current session ID with each and every request ack. Core Manager expects to find in the next request this exact session ID. Core Manager can and will change session ID from time to time for the running sessions challenging Remote Manager to recalculate security information element. On the other side of the system is Remote Manager or Client. Client uses it's private key and nickname to generate security information element and adds this information elemenet to each and every request. Session ID generated by the Core Manager is part of each and every request sent by Remote Manager. It prevents the scenario when adversary sends the same request to other Rodi clients. Similar to the Core Manager - Remote Manager generates 64 bits session ID and sends it with each and every request. Remote Manager expects to find it's session ID in all request acks arrived from the Remote Manager. Security information element is mandatory in all packest and contains nickname of the peer and MD5 of the packet encrypted with peer's private key. Remote Manager initiates session by sending login request with it's own session ID. The packet is signed with security information element. Core Manager
receives the packet from the packet processor, registrates new session, generates session ID and akcs the request with 'wrong session' status and local
session ID. Remote Manager issues new login request containing both remote and local session IDs and signed with security information element.
Management protocol on the side of the Remote Manager works as asynchronous RPC. Remote Manager is free to issue multpile requests to the Remote Manager. Remote Manager keeps list of pending requests. Remote Manager does not assume that Core Manager has any history of the requests. Core Manager always sends ack to to the served request. Ack can optionally contain data, status, etc. Exact structure of the management request is not known to the transport protocol - Rodi. Management packet contains
In the proposed implemenation 32 bits request ID is a mandatory field in the payload. Remote Manager generates unique 32 bits request ID to distinguish between requests. Core Manaqer adds copy of the request ID to the sent ack. When ack from Core Manager arrives Remote Manager looks for the request ID on the list of pending request IDs. List of request IDs is implemented as a hash table. If request ID is found Core Manager removes the request from the list and forwards the ack to the application level. In the backgorun Core Manager runs thread checking requests timeout and notifying the application. Management paylod contains at least one field - 16 bits event. All requests are acknowledged by the Core Manager. If it's required Core Manager sends data in the acknowledge. Implementation details Naturally everything starts in the class Start file Start.java. This class extends Applet btu also contains method main() so JAR can be used both as applet (run in the context of internet browser) and standalone application. It is going to remain this way. When Rodi runs as a standalone application the performance is somewhat better because there is no overhead of Java security layer. All initialization calls are done in the method init(). I make extensive use of global variables. All global variables are placed in the class Resources file Resources.java. Specifically reference to the Look thread, Log, CLI, etc. CLI use the Log for output which can be changed in the future. One of the parameters of CLI is
output stream which is any class providing print(String) method. CLI does not care where input comes from.
Rodi chat
Chat room host forwards all messages to the participating chatters. No IPs are sent, only nicknames. Chat room host can choose between two models.
In one chat room engine always attempts to send acknowledge to the chatters.
In the second model chat room is only a collector of messages.
Chat rooms can be public and private To join public chat room peer sends signed with it's private key string "join+nickname+public key". Public chat room checks uniqueness of the nickname and adds public key to the list of participating peers. To join private chat room peer sends the same message, but this time chat room only logs request. It will require manual intervention to add the peer to the list of participating chatters. Packets can be encrypted. Chat room owner should advertise enctyption algorithm and send keys to the participating peers. Participating peer can choose between two models. In the first peer only sends packets with spoofed IP. In the other peer sends packets with correct IP source in the IP header or in the encrypted payload. Participating peer or chatter can use two PCs - one to send messages and the other monitoring the chat room.
Packet rate control Rodi does not use TCP, still it needs some algorithm to prevent congestion on the receiving side. Packet rate control can be viewed as a black box with limited number of inputs and outputs.
Application running data transfer asks permission of Rate Limiter (RL) every time it should send a packet or enter receive. Currently I see two levels of RL's - one RL for every session and one global or system level RL. Session RL returns logical AND of it's own state and response of the system level RL.
SocketRx and SocketTx threads feed the System level RL and thread handling data session rx/tx operations feeds the correspondent session RL. RL contains all logic related to the dynamic fine tuning of the upstream and downstream. Every RL maintains packet shceduler - array where every entry contains burst size. Size of the array is defined in number of system ticks. For example, array
means that burst size (number of packets/bytes sent in a single system tick) is one and interburst delay is 2 system ticks. Naturaly accuracy
of the scheduler depends on the size of the array. Scheduler can contain both number of packets AND number of bytes to limit both
packet rate and byte rate if required. The same scheduler can work both for tx and rx operations.
Download&Upload
Downloader starts with empty local blocks map - all bits are zero. Downloader is expected to collect maps received from the peers and slowly build the full map of blocks presenting in the nework. There are two sources of the maps. The first source is responses to the LOOK request. All paritcipating peers are expecting to send MAP IE containing local map together with LOOK ACK if LOOK ACK contains STATUS IE FILE_FOUND. The other source of the maps is GET DATA requests and GET DATA ACKs. Downloader sends GET DATA request containing local blocks map - shows the blocks with correct hash. Downloader constantly updates the database of the maps calculating availability of every block. Availabilty of a block is summ of "1s" in the collected maps. Rarest block is the one with lowest summ. Availability of data is calculated as availability of rarest block of the data. Code of downloader can be and hopefully will be customized by any netowrk participant in the way to promise best possible transfer rates. There is little doubt that in some algorithms peers will try to spam the uploaders with GET DATA requests in attempt to get upstream slot the moment uploader closes previous session. This is not going to work, though. Uploader collect all statistics related to the peer including number of GET DATA requests and will ban spammers for some period of time. Uploader just delivers requested blocks and provide list of peers in the GET DATA ACK status BUSY and does not contain any logic related to the rarest block first, etc. Optionally uploader can implement super seed mode, but from the tests it appears that there is no reason to do it - good downloading algorithms will tend to send requests for rarest blocks first, because this is a smart thing to do. Good uploading algorithm should prioritize 'well' behaving peers. Good peer, for example, is a peer with high upload/download ratio. Uploader can upload one or two blocks and then ban the peer until it uploads similar number of blocks. The only reliable information for these decisions is the information collected locally. Global data base of all known peers keeps statistics between download/upload sessions. Uploader has two limits for the upstream - maximum number of simultaneously existing sessions and total bandwidth. Similarly downloader limits the downstream by number of sessions and total bandwidth. Objects of class Download contain a thread and a single shared instance of RequestManager. Class Download has a static method notify (similar to class Look). Class PacketProcessor calls Download.notify every time a packet arrives carring GET DATA ACK message header. Method notify takes the message ID and goes to the RequestManager to find the exact GET DATA request this GET DATA ACK for. If there is no pending request with such request ID method notify updates statistics and drops the packet. If corresponding request ID found method notify forwards message to mailbox of the download threads which issued the original request. Download thread does not expect that there is only one GET DATA ACK for every GET DATA. If GET DATA ACK contains only part of the requested data Download will add request to the RequestManager with the same request ID. GET DATA request contains information elements describing offset and size of the requested block, packet size, burst size, interburst and interpacket delay. Download assumes that GET DATA ACK packets will arrive according to the specified parameters or slower. Download can issue mulitple GET DATA requests, but not to the same peer. Download will spawn LOOK procedure for the downloading file periodically in attempt to find new peers and update blocks map. Pay attention that LOOK packets should not be issued tooo frequently because peers can start to drop the packets (all packets) from the spaming node. 5-10 min period is considered reasonable depending on the file size and results of the previous LOOK requests. When GET DATA ACK arrives download thread updates internal blocks map. Internal blocks map use much smaller block size than Map IE - in the range 64-256 Kbytes. Retransmission requests are to be issued in this resolutuon. Initial GET DATA request can be issued for the whole 4MB block. Download thread fetches the data from the arrived packet and writes data to the disk. If download thread discovers from the internal map that whole 4Mbytes block received it calculates the hash and compares with the hash found in the hash table of blocks (if available). Download thread discards 4MB block if hash check fails. Download thread updates statistics for the relevant peer to avoid sending GET DATA requests (and any other requests) to this peer in the future. The peer remains banned for some period of time or forever depending on the application. Menu download in the CLI provides following commands
Snubbing Algortithm Every packet and every request has price because processing of every request carry some costs in terms of upstream, downstream and CPU consumption. There are two kinds of packets - packets which are usefull for peer and packets which are not. Every peer collects and keeps statistics for all other peers with how many usefull packets it recieved and how many usefull packets it sent. Before processing next request from a remote peer Rodi client checks the mark of the remote peer and if it's too low drops the packet. Mark is a function of number of usefull packets and not usefull packets. There are different kinds of packets - packets which consume bandwidth, packets which are costly for CPU, etc. Different kinds of packets have different weight in the resulting mark NAT traversal TBD read STUN RFC There are four major types of the NATs (source - STUN RFC and wikipedia.com)
The suggested procedure assumes that symmetric NAT chooses external ports one after another. For example, NAT can
allocate ports starting from 1024 and continue 1025, ...
Upon initialization client attempts to discover what type of NAT is in use. It requires help of the external server - NAT
traversal server.
Client sends NAT penetration packet to all server ports. In the payload client puts unique key, local IP address and IP port.
Client expects that the server in the reply (in three replys from three different ports) will copy the unique key and put client's
external IP address and IP port. When client recieves the replys client will analyse them.
Client has to distinguish between full cone NAT and port restricted NAT. In both cases NAT maps internal port to the same external port for all connections. Client asks via NAT traversal server some other NAT traversal servers to send a packet. If the packet goes through the client is behind full cone NAT. If external peer fails to deliver the packet client assumes port restricted NAT. With every Look request/Data request client advertise discovered external IP address and port, NAT type, IP address of the NAT traversal server. This is optional for clients behind full cone NAT or clients without NAT or clients with forwarded port. If client does not advertise NAT type and external IP address remote peers assume that this client can be contacted directly from any IP address. Clients behind symmetric NAT and behind restricted NAT keep one connection with the server alive by sending periodically NAT penetration message. To reduce load of the NAT traversal server only clients behind symmetric NATs and restricted NATs are allowed to keep connecitons. NAT traversal server can enforce this rule by blocking the IP address. Client behind non-symmetric NAT is allowed to repeat the procedure of NAT discovery from time to time. Let's say that peer A learned IP address of peer B and type of the NAT of peer B is a symmetric NAT. Peer A sends message to the NAT traversal server that A attempts to contact B. NAT traversal server should respond with reply Ok if client B exists or Fail if the server does not know client B. If NAT traversal server knows client B the server forwards packet from A to B. B receives the packet, sends NAT penetration packet to server, discovers external port and sends NAT penetration packet to A - B punches the NAT. The server sends external port of B to A in Ok reply. A uses external port of B as a starting point for penetration port scan of B. NAT traversal takes time and is not always possible. Subsystem (for example, Look) willing to send a request to the peer behind symmetric NAT will use asynchronous interface. Status of the NAT penetration will be delivered by the API asynchronously using provided by the subsystem callback. The subsystem is responsible for closing the NAT penetration when there is no need to keep connection with the peer. Port scan consumes upstream of the peer running the scan and downstream of the destination peer. Client will choose peers behind symmetric NAT as a last option when there are no other peers or all other peers are busy and upstream of the client is not in use. Using of the NAT traversal server is optional. List of the Rodi NAT traversal servers is available as an XML file from regular HTTP server. Let's say that A attempts to contact B behind restricted or symmteric NAT. B connected to the NAT traversal server S. A sends Forward request to S. Payload of the Forward request contains command Wake Up and IP address and port of B. S adds IP address and port of A and forwards Wake Up message to B. B sends Forward request to S with payload containing Wake Up Ack and IP address and port of A. S forwards the request to A. B can send Wake Up Ack directly to A if A is not behind restricted NAT. When A gets Wake Up Ack A starts port scan of B. When B gets Wake Up request B starts port scan of A.
Crystalization Let's say that peer P1 has block m and does not have block n (m, n), let's say also that peer P2 can be described by pair (n, m) - has block n and does not have block m. It appears that if two such peers find each other quickly and establish connection that would reduce overhead and improve overall performance of the network. We call the process of establishing of such pairs - crystalization. In the beginning we see liquid - chaotic movement of molecules. With time peers learn maps of each other and connect to each other - crystalization phase. Good cooperation algorithm tends to crystalize the system in full - create one crystall where all peers are locked and satisfied and do not attempt to establish new connections. One of the proposed solutons of the problem that in the GET DATA ACK status BUSY peer can add estimated time when the upstream is getting free. Let's say that P1 sends GET DATA request for block n to P2. P2 can not open upstream immediately because it's busy serving other peers, but P2 wants to help P1, because P1 has a block which P2 does not - block m. P2 wants to be cooperative. P2 estimates - best effort guess - for how long time P1 should wait before it gets data from P2, sends the result in the GET DATA ACK message and puts P1 onto the waiting list. P1 is connected to the network for some time already and can estimate typical time which it takes to find free uploader with requierd blocks. P1 compares the timeout specified by P2 with the estimated and if the deal is good starts timer. When the timer expires P1 trys to send new GET DATA request to P2. P2 keeps free upstream for the P1 for some limited and relatively short time. If P1 fails to send packet P2 will use the upstream to serve other peers. Bouncer Bouncer is a thread waiting on socket rx. When a packet arrives bouncer sends the packet to other IP address. Bouncer is controled by rules. Rule contains following fields
we need to install open source security server to generate unique IDs, private and public keys and signatures. Rodi network needs some centralized system of exchange of keys. To support downloaders spoofing IP source bouncer will ignore IP source and look only for the destination IP port. If packet contains PEER IE in the payload bouncer will use this IP port and IP address to find forwarding rule. Bouncer will forward only Rodi control packets. Bouncer optionally can enforce this by analyzing the packet payload. Usefull bouncer can be able to handle at least 1-2K packets/s. Such bouncer can serve group of may be 100-200 downloaders. Data transfer
Naive implementation of data transfer plugin. Will be used for debug purposes and until
UDT like plugin is not available
If for 30s (timer T2) there is no new packets arrive downloader assumes that uploader closed the data transfer session. When uploader receives first GET it assumes that downloader has infinite bandwidth and streams the data to the receiver. Uploader sends all chunks as quick as possible. In the best case scenario only one GET (first) arrives to the uploader. After completion uploader starts T2 and if T2 expires close the session. If GET from the downloader arrives uploader restarts T2 and serves retransmission request. Uploader is free to choose the chunk size, but chunk size will not be smaller than 1024 bytes and will always be a power of two - 1024, 2048, 4096, ... Information elements Search (Look)
In the first phase only one look can be started. If LOOK request contans IP address/IP port publisher will use this address as a destination
instead source IP from the IP packet header. There is an optional periodic 'keep alive' look request a peer sends to a host to prevent
timeout of the entry in the host's table of active peers.
Look Ack
In the first phase only one look can be started. If IP address/IP port is present in the payload client (receiver) shall use this address instead of
regular IP address from IP packet header. Source IP in the IP header is not reliable and can be intentionally spoofed by the publisher.
IP addres/IP port in the payload can be address of the publisher or one of the traffic bouncers chosen by a publishers. Publishers can implement
different algorithms when choosing bouncers. In the most simple scenario a publisher chooses arbitrary IP from the pool of public bouncers. Pay
attention that bouncer usually will NOT proxy data, but only control messages
Data Request (Get Data)
If GET DATA request contans IP address/IP port publisher will use this address
as a destination instead source IP from the IP packet header.
Data Request Ack (Get Data acknowledge)
Data Transfer
Management request
NAT traversal
Crawler Download Upload Database
|