actually i was asked to write something about my work, my field by shoubham – a linuxwarrior :P but due the hectic schedule of last two weeks, dint find a sufficient time :)
In today’s world, where 1Gig/sed ethernet network speed as being very common, the demand of high speed network is rapidly increasding. All the data centers, cluster computers and many more R&D oriented organizations are in need of more speed on ethernet. The problem with the conventional NICs is that they depend a lot on the TCP/IP stack above. Even if the NIC is capable of doing highspeed data transfer, the TCP/IP overhead is keeping Host CPU (the one which comes on mother board) so busy that the machine is not able to pump enough data to NIC and eventually not utilizing the NICs’ capability to transfer at higher speed. RDMA technology can be/is one of the solutions to this problem.
Problems with conventional TCP/IP model
Lets take an example of an application which does the file transfer (using the TCP/IP stack implementation of Linux). This application has two modes: 1. server 2. client.
Server will read the file from disk, client will be waiting for the file to be received and written back to disk. Usually what happens is, server will try to read from the file, copy it into an application buffer, send it to stack, stack will send that buffer to nic, nic will do the transfer, that buffer will be received by peer nic, nic will return it back to tcp/ip stack, stack will return the buffer to client application and then buffer will be written back to local file. There are following scenarios which comes into picture while doing data transfer in this way :
1. read from file and copy it into apps buffer (not a cpu hungry operation)
2. call send() function of socket API
Linux’s tcpip stack is maintained insdie the kernel. The send function called by user will have buffer in user space. To give this buffer to TCP/IP stack (i.e. in kernel), we need to copy it from user to kernel. This copy operation is one of the most CPU hungry operation and hogs the CPU upto 80-85%. Now once the buffer is copied to kernel buffer of TCP/IP stack, the stack will start doing the header processing as of its own (like calculating checksums, fragmenting buffers and maintaining sequence and all that calculating stuffs – again some what CPU hungry operations). After all this processing, the buffer(s) will be given to nic, and nic will do the Transfer operatoin.
3. call recv() function of Socket API on peer guy.
The recv operation does the reverse things which I described for step 2. Again a CPU hungry operation will take place in copying kernel buffer back to user plus all TCP/IP processing.
4. buffer will be written to disk on the local file.
Bottom line, even if the NIC is capable to do data tx at higher speed, the TCP/IP stack plus host CPU will restrict the NIC upto certian speed. This is where RDMA comes into picture.
RDMA stands for Remote Direct Memory Access. The NIC which works on this technology is called RNIC. As the name suggest, RNIC has the capability to access the user memory of its own machine as well as (virtually) the peer machine. RDMA techology also requires some of the changes in the underlying hardware. So the RNIC has to have some additional functionalities, like TOE (TCP offload engine – i.e. the whole TCP/IP stack is implemente d into a chip), a 10Gig Ethernet standard output port, DMA controller on board and so many other perfiferals. (I dont know much details about hardware peripherals). The protocol on which RDMA technology works is called iWarp (internet wide area rdma protocol). The APIs which iWarp supports are called iWarp “verbs”. All these verbs cannot be mapped with Socket APIs directly. iWarp verbs need to take some special cares with buffers that user allocates and all that stuff. So any application which is designed by using iWarp verbs, can not communicate to any other application which is not using iWarp. To make these verbs compatible with the other socket applications, vendor needs to give support for Upper Layer Protocols (like SDP, WSD) which is beyond the scope of this article. RDMA has basically two main modes for data transfer. RDMA Read and RDMA Write.
RDMA Read/Write :
In this mode of data transfer, the machine issuing RDMA Read operation will read the memory of the peer machine. In RDMA Write, the machine issuing the RDMA write operation will be writing its own buffer to peer machines memory.
How RNIC works
Lets take the same example of an application doing file transfer, but this time using RNIC.
1. Allocate a user level buffer to read from a file.
2. register this buffer to RNIC using iWarp verbs (not a CPU hungry operation)
registering a buffer to RNIC will return an unique identification called “STag” for that registered buffer. This STags (i.e. ids of registered buffers) has to be exchanged while doing any RDMA operation (i.e RDMA read/write). Actually this registration of buffer is nothing but the mapping of kernel buffer to user buffer (instead of copying it from user space, we can straight away map these buffers to kernel space. this concept is called ZCopy OR Zero Copy).
3. Do RDMA Write (just for example) from server side (As server wants to write file to other machine, we can do RDMA Write from server, otherwise we have to do RDMA read from client side). To do RDMA Write, the machine should have the STags of the peer machine where the data should be written. This initial exchange of STags should be taken care by the application. This process looks very simple but its very very complex at the internal layers. Actually RDMA technology uses TCP/IP underneath. And this TCP/IP has to be in hardware (i.e. TOE) to get the advantage of this technology.
4. One RDMA write is done, buffer we wanted to transfer is written to peer machine.
RDMA read is almost reverse of RDMA write. The host doing RDMA read has to know the the STag of peer machine (i.e. where to read on peer machine).
Basically RDMA read/write performs better if the data size is very large. For small data size, it adds somewhat overhead of exchanging STags and all that. But for lower data size data transfer, we can TOE.
This is how the high performance with lower CPU can be obtained. And it works over ethernet, so there is no need to change underlying infrastructure. The only problem with these technology is that both the communicating party has to be RDMA enabled. If one of the party is not using RDMA, then both has to talk using Native TCP/IP stack.
Thats it!!!!! Please let me know in case of any queries.