A Brief about RDMA, RoCE v1 vs RoCE v2 vs iWARP

Hello folks,

Today we'll see what RDMA is and how different kind of network solutions work over Ethernet. For those who are wondering what these words are, this is a post about Networking and an overview can be, how to increase your network speed without adding new servers or IB over your network, wherein IB stands for InfiniBand which is basically a networking communication standard majorly used in HPC that features very high throughput and very low latency, for what HPC is, refer to my previous blog.


This is how a Mellanox IB cable looks like

In order, not to use IB, increasing the speed for Ethernet-based solutions is a better and cost-effective solution in many cases which is implemented using RDMA, thus saving cost and technology shift overhead for every admin out there.

So let's see what RDMA is, RDMA stands for Remote Direct Memory Access 
is a direct memory access from the memory of one computer into that of another without involving either one's operating system. In a Non-RDMA case, each bit of data sent is first processed by the CPU and is then forwarded to the memory which indeed ends up utilizing more of CPU thus lowering CPU performance also increasing the CPU overhead time, to overcome this problem we use RDMA. In RDMA a secure channel is created over which data is shared from host to client machine with little or no CPU inclusion thus saving a lot of overhead and creating an effective mode of communication.

To implement RDMA in our Ethernet-based solution we can use either RoCE or iWARP, so let's first see what RoCE is and RoCE v1 differs from RoCE v2.

RoCE ( RDMA over Converged Ethernet) (pronounced “rocky”)

As the name is self-explanatory  RoCE is basically the implementation of RDMA over Converged Ethernet. RoCE is a network protocol that allows remote direct memory access (RDMA) over an Ethernet network. There are two RoCE versions, RoCE v1 and RoCE v2. RoCE v1 is an Ethernet link layer protocol and hence allows communication between any two hosts in the same Ethernet broadcast domain. RoCE v2 is an internet layer protocol which means that RoCE v2 packets can be routed.

This is what RoCE but there are some terms within RoCE which are must for the implementation and understanding:

  • DBC
    • Data-Center Bridging (DCB) is an extension to the Ethernet protocol that makes dedicated traffic flows possible in a converged network scenario. DCB distinguishes traffic flows by tagging the traffic with a specific value (0-7) called a “CoS” value which stands for Class of Service. CoS values can also be referred to as “Priority” or “Tag”. Note that every node in the network (switches and servers) needs to have DCB enabled and configured consistently for DCB to work.
  • PFC
    • In standard Ethernet, we have the ability to pause traffic when the receive buffers are getting full. The downside of ethernet pause is that it will pause all traffic on the link. As the name already gives it away, Priority-based Flow Control (PFC) can pause the traffic per flow based on that specific Priority, in other words; PFC creates Pause Frames based on a traffic CoS value. This way we can manage flow control selectively for the traffic that requires it, such as storage traffic, without impacting other traffic on the link.
  • ETS
    • With DCB in-place the traffic flows are nicely separated from each other and can pause independently because of PFC but, PFC does not provide any Quality-of-Service (QoS). If your servers are able to fully utilize the full network pipe with only storage traffic, other traffic such as cluster heartbeat or tenant traffic may come in jeopardy.
      The purpose of Enhanced Transmission Selection (ETS) is to allocate bandwidth based on the different priority settings of the traffic flows, this way the network components share the same physical pipe but ETS makes sure that everyone gets the share of the pipe specified and prevent the “noisy neighbor” effect.
  • Data Center Bridging Exchange Protocol
    • This is the best thing in all of RoCE, This protocol is better known as DCBX as in also an extension on the DCB protocol, where the “X” stands for eXchange.
      DCBX can be used to share information about the DCB settings between peers (switches and servers) to ensure you have a consistent configuration on your network. So basically DCBX do all the hard work for you, you have to sit back and see how DCBX do all this.
With all this you can easily understand and implement RoCE v1 or v2 based on your network configuration, now let's see what iWARP is.


iWARP (Internet Wide-area RDMA Protocol)
iWARP is a computer networking protocol that implements remote direct memory access (RDMA) for efficient data transfer over Internet Protocol networks. iWARP and RoCE are quiet along the same lines so you already know most of the things so iWARP I'll finish with just this overview.
Since iWARP is layered on IETF-standard congestion-aware protocols such as TCP and SCTP, whereas RoCE works on UDP, it makes few requirements on the network and can be successfully deployed in a broad range of environments.
iWARP implementation is also simple since it is a TCP based RDMA Technology so we'll not dive into that.

That's all for this blog just for a graphical learning for the RoCE, I have found this great video by RoCE-initiative, I would strongly suggest that you go through this video in order to get a clear picture of what all this is.

Hope this blog was of some help.

Next Blog: What is Neural Network and what is the logic behind the prediction models


Till then,

Happy Blogging

Comments

Popular posts from this blog

Model View Controller (The basics of </>)

Cryptocurrency & Blockchain