Remote Direct Memory Access(RDMA)
Remote Direct Memory Access (RDMA) allows data to move directly from the memory of one computer into that of another without involving either one's operating system. This permits high-throughput, low-latency networking, which is especially useful in massively parallel computer clusters. RDMA relies on a special philosophy in using DMA.
RDMA supports zero-copy networking by enabling the network adapter to transfer data directly to or from application memory, eliminating the need to copy data between application memory and the data buffers in the operating system. Such transfers require no work to be done by CPUs, caches, or context switches, and transfers continue in parallel with other system operations. When an application performs an RDMA Read or Write request, the application data is delivered directly to the network, reducing latency and enabling fast message transfer.
This strategy presents several problems related to the fact that the target node is not notified of the completion of the request (1-sided communications). The common way to notify it is to change a memory byte when the data has been delivered, but it requires the target to poll on this byte. Not only does this polling consume CPU cycles, but the memory footprint and the latency increases linearly with the number of possible other nodes which limits use of RDMA in High-Performance Computing (HPC) in favor of MPI.
The Send/Recv model used by other zero-copy HPC interconnects such as Myrinet or Quadrics does not have any of these problems and presents as good performance since their native programming interface is very similar to MPI.
RDMA reduces the need for protocol overhead, which can squeeze out the capacity to move data across a network, reducing performance, limiting how fast an application can get the data it needs, and restricting the size and scalability of a cluster.
However, one must be aware that there also may exist some overhead given the need for memory registration. zero-copy protocols indeed usually imply to make sure that the memory area involved in the communications will be kept in main memory, at least during the duration of the transfer. One must for instance make sure that this memory will not be swapped out. Else, the DMA engine might use out-dated data, thus raising the risk of memory corruption. The usual way is to pin memory down so that it will be kept in main memory, but this creates a somehow unexpected overhead since this memory registration is very expensive, thus increasing the latency linearly with the size of the data. In order to address that issue, there are several attitudes that were adopted :
- deferring memory registration out of the critical path, thus somehow hiding the latency increase.
- using caching techniques to keep data pinned as long as possible so that the overhead could be reduced for application performing communications in the same memory area several times.
- pipelining memory registration and data transfer as done on Infiniband or Myrinet for instance.
- somehow getting rid of the need for registration as Quadrics high-speed networks does.
RDMA’s acceptance is also limited by the need to install a different networking infrastructure. New standards enable Ethernet RDMA implementation at the physical layer and TCP/IP as the transport, combining the performance and latency advantages of RDMA with a low-cost, standards-based solution. The RDMA Consortium and the DAT Collaborative[1] have played key roles in the development of RDMA protocols and APIs for consideration by standards groups such as the Internet Engineering Task Force and the Interconnect Software Consortium.[2] Software vendors such as Oracle Corporation support these APIs in their latest products, and network adapters that implement RDMA over Ethernet are being developed.
SCSI RDMA Protocol
From Wikipedia, the free encyclopedia
Jump to: navigation, search
The SRP (SCSI RDMA Protocol), also known as the SCSI Remote Protocol, is a protocol that allows to access a network SCSI device through RDMA. The use of RDMA makes a higher throughput and lower latency possible than what is possible through e.g. the TCP/IP communication protocol. RDMA is only possible with network adapters that support RDMA in hardware. Examples of such network adapters are Infiniband HCA's and 10 GbE network adapters with iWARP support.
As with iSER, there is the notion of a target (a system that stores the data) and an initiator (a client accessing the target) with the target performing the actual data movement. In other words, when a user writes to a target, the target actually executes a read from the initiator and when a user issues a read, the target executes a write to the initiator.
While the SRP protocol is easier to implement than the iSER protocol, iSER offers more management functionality, e.g. the target discovery infrastructure enabled by the iSCSI protocol. Furthermore, the SRP protocol never made it into an official standard. The latest draft of the SRP protocol, revision 16a, dates from July 3, 2002.
In order to use the SRP protocol, an RDMA-capable network is needed, an SRP initiator implementation and an SRP target implementation. The following software SRP initiator implementations exist:
Linux SRP initiator, available since November 2005 (kernel version 2.6.15).
Windows SRP initiator, available through the WinOF InfiniBand stack.
VMWare SRP initiator, available since January 2008 through Mellanox' OFED Drivers for VMware Infrastructure 3.
Solaris 10 SRP initiator, available through Sun's download page.
There is only one software SRP target implementation available. This is an open source implementation and can be downloaded either through the SCST project or from the OFED website.
Bandwidth and latency of storage targets supporting the SRP or the iSER protocol should be similar. On Linux there is an SRP storage target implementation available that runs inside the kernel (SCST) and an iSER storage target implementation that runs in user space (STGT). Measurements have shown that the first implementation has a lower latency and a higher bandwidth than the second. This is probably because the RDMA communication overhead is lower for a component implemented in the Linux kernel than for a user space Linux process, and not because of protocol differences.
The iSCSI Extensions for RDMA (iSER) protocol maps the iSCSI protocol over a network that provides RDMA services (like TCP with RDMA services (iWARP) or InfiniBand). This permits data to be transferred directly into SCSI I/O buffers without intermediate data copies. The Datamover Architecture (DA) defines an abstract model in which the movement of data between iSCSI end nodes is logically separated from the rest of the iSCSI protocol. iSER is one Datamover protocol. The interface between the iSCSI and a Datamover protocol, iSER in our case, is called Datamover Interface (DI).
The motivation for iSER is to more efficiently utilize RDMA operations in order to avoid unnecessary data copy and buffering requirements on the target and initiator. iSER allows data to be transferred between the initiator and target with RDMA services without any data copy at the ends. The main difference between the standard iSCSI and iSCSI over iSER in the execution of SCSI read/write commands is that with iSER the target drives all data transfer (with the exception of iSCSI unsolicited data) by issuing RDMA write/read operations, respectively. When the iSCSI layer issues an iSCSI command PDU, it calls the Send_Control primitive, which is part of the DI. The Send_Control primitive sends the STag with the PDU. The iSER layer in the target side notifies the target that the PDU was received with the Control_Notify primitive (which is part of the DI). The target calls the Put_Data or Get_Data primitives (which are part of the DI) to perform an RDMA write/read operation respectively. Then, the target calls the Send_Control primitive to send a response to the initiator.