0. Table of Contents
1. Introduction
In this work, we present our long-term operation experience on a real-world production RDMA-based container network (RCN). We notice that the performance of RCN is significantly affected by the scale. Through the careful telemetry analysis on our production RCN for a year, we find that the scalability walls are caused by the RCMA NICs (RNICs), whose hardware limits or defects siginificantly affect the performance when the RCN scales up. However, we are confronted with a key challenge that we have very limited visbility into the internals of today’s RNICs.
To address the problem, we propose a novel approach, i.e., combinatorial causal testing, to infer the most likely causes of the performance issues in RNICs based on the their common components and functionalities. The resulting system, dubbed ScalaCN, is being gradually deployed in our production RCN, and has helped us to infer and resolve 82% causes of the RNICs’ inferior performance. Our evaluation on real-world workloads show that the end-to-end network bandwidth increases by 1.4× and the packet forwarding latency decreases by 31% after deploying ScalaCN.
2. Code
We have released the code for measurement, testing, and analysis in ScalaCN. The code is available at scala-cn.github.io/scala-cn.
The code is organized as follows:
Directory | Description | Source Code |
common | common dependencies such as firmware tools, OVS tools, and tc tools | mlx_resdump |
mlx_resdump_cpp | ||
data_path.py | ||
matcher.py | ||
tc.py | ||
interpolation | performance fitting for RNICs | interpolation_bw.py |
interpolation_bw.py | ||
lib | static libs | libibverbs.so.1.14.35.0 |
libmlx5.so.1.19.35.0 | ||
testing | example code for combintorial testing and analysis | data_parser.py |
profiling_cx7_ce_20_de_20.py | ||
tracer | performance/state measurement and monitoring | agent_tracer.py |
capture-tc.py | ||
table_flow_monitor.py | ||
tracer_config.py |
3. Data
The data samples in this work are available at scala-cn.github.io/data.
Each single data file regarding the RNIC performance on different component/functionality configurations is the output of the perftest
tool.
An example of the data file is as follows:
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : [DEV]
Number of qps : 4 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
CQ Moderation : 100
Mtu : 1024[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x06b8 PSN 0x74cf8c
GID: 00:00:00:00:00:00:00:00:00:00:[IP]
local address: LID 0000 QPN 0x06b9 PSN 0xb9086
GID: 00:00:00:00:00:00:00:00:00:00:[IP]
local address: LID 0000 QPN 0x06ba PSN 0xbe6e58
GID: 00:00:00:00:00:00:00:00:00:00:[IP]
local address: LID 0000 QPN 0x06bb PSN 0x7eaac7
GID: 00:00:00:00:00:00:00:00:00:00:[IP]
remote address: LID 0000 QPN 0x05d0 PSN 0x277fa8
GID: 00:00:00:00:00:00:00:00:00:00:[IP]
remote address: LID 0000 QPN 0x05d1 PSN 0x29b472
GID: 00:00:00:00:00:00:00:00:00:00:[IP]
remote address: LID 0000 QPN 0x05d2 PSN 0x69c1d4
GID: 00:00:00:00:00:00:00:00:00:00:[IP]
remote address: LID 0000 QPN 0x05d3 PSN 0xc29e93
GID: 00:00:00:00:00:00:00:00:00:00:[IP]
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
8388608 20000 9360.24 9360.24 0.001170
---------------------------------------------------------------------------------------
4. Defects Found
We also report the RNICs’ possible defects found during our measurement study.
No. | Symptom | Layer | Ratio | Most Likely Cause |
---|---|---|---|---|
S1 | Repetitive flow re-offloading | Virtual Switch | 17.1% | Flow entries in the RNIC are deleted although they are not aged. |
S2 | Kernel stagnation | RNIC driver | 5.9% | The driver cannot handle the timeout of the RNIC’s executing an operation. |
S3 | Kernel crash on new flows | RNIC driver | 5.2% | The driver frees a null pointer when the RNIC fails to create a flow entry. |
S4 | Slow flow state maintenance | RNIC hardware | 11.4% | Flow deletion in the RNIC takes much longer (9×) time than expected. |
S5 | Intermittent software forwarding | RNIC hardware | 15.3% | Flow counters are not updated timely. The virtual switch reads a “dirty” value. |
S6 | Poor performance of specific flows | RNIC hardware | 29.9% | Flow table query in the RNIC is performed sequentially in the on-chip memory. |
S7 | PCIe link down when unbinding VFs | RNIC hardware | 8.4% | Race condition emerges when the RNIC cleans up allocated resources. |
S8 | RNIC unresponsiveness | RNIC hardware | 6.8% | VXLAN encapsulation contexts exceed the RNIC’s buffer capacity. |