# Handling 100 Gb/s RDMA, and NVMe in Apache Crail and Pocket

Patrick Stuedi

## Hardware Changes since 2010

~ 2014: starting Crail project

|         | 2010             | 2015              | 2020              |     |
|---------|------------------|-------------------|-------------------|-----|
| Storage | 50 MB/s<br>(HDD) | 500 MB/s<br>(SSD) | 16 GB/s<br>(NVMe) | 10x |
| Network | 1 Gb/s           | 10 Gb/s           | 100 Gb/s          | 10x |
| CPU     | ~3 GHz           | ~3 GHz            | ~GHz              |     |

Reynold keynote, https://databricks.com/session\_na20/wednesday-morning-keynotes

## Challenges

- Difficult to leverage modern networking and storage hardware
- Example (2016): sorting 12 TB on a 128 node cluster, all data in DRAM, 100 Gb/s full bisection network



### **Software Overheads**

|            | 1 Gbps   | HDD      | 100 Gbps  | Flash    |
|------------|----------|----------|-----------|----------|
| Bandwidth  | 117 MB/s | 140 MB/s | 12.5 GB/s | 3.1 GB/s |
| cycle/unit | 38,400   | 10,957   | 360       | 495      |

software overhead are spread over the entire stack



HotNets'16

#### How do Supercomputers solve this?



#### System Software Environment

- Linux OS enabling storage + embedded compute
- OFED RDMA & TCP/IP over BG/Q Torus ailure resilient
- Standard middleware GPFS, DB2, MapReduce, Streams

#### Active Storage Target Applications

- Parallel File and Object Storage Systems
- Graph, Join, Sort, order-by, group-by, MR, aggregation
- Application specific storage interface

#### PCIe Flash Board



| Flash Storage | 2012<br>Targets |
|---------------|-----------------|
| Capacity      | 2 TB            |
| I/O Bandwidth | 2 GB/s          |
| IOPS          | 200 K           |

#### **BGAS Rack Targets**

| -               |             |
|-----------------|-------------|
| Nodes           | 512         |
| Storage Cap     | 1 PB        |
| I/O Bandwidth   | 1 TB/s      |
| Random IOPS     | 100 Million |
| Compute Power   | 104 TF      |
| Network Bisect. | 512 GB/s    |
| External 10GbE  | 512         |



IBM BlueGene Active Store Project (2012)

Key architectural balance point: All-to-all throughput roughly equivalent to Flash throughput

#### **RDMA on Azure**



|   |            | General Purpose                                                                              | Compute<br>Optimized                                                                      | Memory<br>Optimized                                                        | Storage<br>Optimized                 | GPU                                                                   |  | High<br>Performance<br>Compute                                                                              |  |
|---|------------|----------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------|----------------------------------------------------------------------------|--------------------------------------|-----------------------------------------------------------------------|--|-------------------------------------------------------------------------------------------------------------|--|
|   | Туре       | Av2, B, DCsv2, Dv2,<br>Dsv2, Dv3, Dsv3,<br>Dav4, Dasv4, Ddv4,<br>Ddsv4,Dv4, Dsv4             | Fsv2                                                                                      | M, Mv2, Dv2,<br>DSv2, Ev3, Esv3,<br>Eav4, Easv4, Ev4,<br>Esv4, Edv4, Edsv4 | Lsv2                                 | NC, NCv2, NCv3, I<br>NDv2, NV, NVv3, N                                |  | H, HBv2, HC, HB                                                                                             |  |
| D | escription | Balanced CPU and<br>memory                                                                   | High ratio of<br>compute to memory                                                        | High ratio of<br>memory to<br>compute                                      | High disk<br>throughput<br>and IO    | Specialized with single<br>or multiple NVIDIA<br>GPUs                 |  | High memory and<br>compute power –<br>fastest and most<br>powerful                                          |  |
|   | Uses       | Testing and<br>development, small-<br>medium databases,<br>low-medium traffic<br>web servers | Medium traffic web<br>servers, network<br>appliances, batch<br>processing, app<br>servers | Relational<br>database services,<br>analytics, larger<br>caches            | Big Data,<br>SQL, NoSQL<br>databases | Compute intension<br>graphics-intension<br>visualization<br>workloads |  | Batch processing,<br>analytics, molecular<br>modeling, fluid<br>dynamics, low<br>latency RDMA<br>networking |  |





## **RDMA Networking**

- User-level network architecture
- Kernel bypass
  - NIC queues accessible from user-space
- Transport stack offloading
  - Infiniband, RoCE, iWARP

## **RDMA Networking: Benefits**

- User-level network architecture
- Kernel bypass
  - NIC queues accessible from user-space
- Transport stack offloading
  - Infiniband, RoCE, iWARP

Low latency no sycalls Zero-copy directly DMA from/to userspace buffers Low CPU usage transport offloading High bandwidth

## **RDMA Networking: Benefits**

- User-level network architecture
- Kernel bypass
  - NIC queues accessible from user-space
- Transport stack offloading
  - Infiniband, RoCE, iWARP

Low latency no sycalls Zero-copy directly DMA from/to userspace buffers Low CPU usage transport offloading High bandwidth High bandwidth / core

### **RDMA Two-Sided Operations**



Source does not need to know buffer address at target. Receiver is notified.

## **RDMA One-Sided Operations**



# NVM Express (NVMe)

- Host-controller interface for PCI attached SSDs
- Enables user-level access for storage
  - Map device queues into user-space
  - SPDK, NVMe-over-Fabrics

#### Integrating User-level I/O with Data Processing Systems









Can't implement every operation for all the different hardware, framework and deployment options

#### Integrating User-level I/O with Data Processing Systems (2)



#### **Crail Architecture**



#### **Performance Challenges**

## **Performance Challenges**

- 1. Must handle millions of storage operations per second on a large number files with a wide range of data sizes
  - Example: the Spark shuffle engine built on top of Crail creates #partition files per core in the cluster. With 128 machines each running 3 executors with 5 cores each that is 11M files (!)
  - Files have a wide range of data sizes



## **Performance Challenges**

- 1. Must handle millions of storage operations per second on a large number files with a wide range of data sizes
  - Example: the Spark shuffle engine built on top of Crail creates #partition files per core in the cluster. With 128 machines each running 3 executors with 5 cores each that is 11M files (!)
  - Files have a wide range of data sizes
- 2. Should be able to read/write at line speed (e.g., 100 Gb/s) using a single core (for a reasonable I/O size)
- 3. Must support reading/writing of tiny files in a few microseconds
- 4. Overall CPU consumption of the storage system should be kept low
- 5. Must be able to store data volumes > cluster DRAM

.

#### 1. Fast data path using one-sided RDMA

- Metadata needs to include target address for read/write (not shown)
- Have the NIC DMA to/from actual application buffers



#### 1. Fast data path using one-sided RDMA

- Metadata needs to include target address for read/write (not shown)
- Have the NIC DMA to/from actual application buffers



very close to

Reading 256B

**HW** limits

- 1. Fast data path using one-sided RDMA
  - Metadata needs to include target address for read/write
  - Have the NIC DMA to/from actual application buffers

#### 2. Scale metadata RPC using two-sided RDMA

- Keep RPC request/response messages small (< 128 bytes)</li>
- Process RPC in-place on receiving core
- Avoid NUMA remote memory access



- 1. Fast data path using one-sided RDMA
  - Metadata needs to include target address for read/write —
  - Have the NIC DMA to/from actual application buffers -

#### 2. Scale metadata RPC using two-sided RDMA

- Keep RPC request/response messages small (< 128 bytes) —
- Process RPC in-place on receiving core -
- Avoid NUMA remote memory access

1 metadata server can serve ~10 million requests per second



increasing load

15

- 1. Make data path fast using one-sided RDMA
  - Metadata needs to include target address for read/write
  - Have the NIC DMA to/from actual application buffers
- 2. Scale metadata RPC using two-sided RDMA
  - Keep RPC request/response messages small (< 128 bytes)
  - Process RPC in-place on receiving core
  - Avoid NUMA remote memory access

3. No threads, execute all I/O in the process context of the app

4. Avoid interrupts for small data transfers and RPCs

5. Per NUMA node pre-pinned buffer pool for application memory

#### 6. Horizontal tiering

Store data in Flash **iff** all DRAM in the cluster is exhausted





#### **Example: Spark Shuffle using Crail**



#### Sorting 12TB on 128 Node Cluster



|                     | Spark/Crail | Winner 2014 | Winner 2016 |
|---------------------|-------------|-------------|-------------|
| Size (TB)           | 12.8        | 100         | 100         |
| Time (sec)          | 98          | 1406        | 134         |
| Total cores         | 2560        | 6592        | 10240       |
| Network HW (Gbit/s) | 100         | 10          | 100         |
| Rate/core (GB/min)  | 3.13        | 0.66        | 4.4         |

www.sortingbenchmark.org

### **Pocket: Ephemeral Storage for Serverless Analytics**

#### **Serverless Analytics**

• Serverless frameworks are increasingly being used for interactive analytics

PyWren (SoCC'17)

ExCamera (NSDI'17)



gg: The Stanford Builder

Amazon Aurora Serverless

### **Serverless Analytics**

- Serverless frameworks are increasingly being used for interactive analytics
  - Exploit massive parallelism with large number of serverless tasks



# **Challenge: Data Sharing**

- Serverless analytics involve multiple stages of execution
  - Serverless tasks need an efficient way to communicate intermediate data between different stages
- Today: such data sharing is implemented using remote storage
  - Enables fast and fine-grained scaling
- Problem: existing storage platforms not suitable
  - Slow (e.g., S3)
  - No dynamic scaling (e.g. Redis)
  - Designed for either small or large data sets
- Can we use Crail?

#### **Crail Deployment Modes**



#### **Crail Deployment Modes**



#### **Pocket Overview**



#### **Pocket: Resource Utilization**



Pocket cost-effectively allocates resources based on user/framework hints

#### **Pocket: Autoscaling**



#### References

- crail.apache.org
- github.com/apache/incubator-crail (Java)
- github.com/patrickstuedi/crailnative (C++)
- Pocket: Elastic ephemeral storage for serverless analytics, OSDI'2019
- github.com/stanford-mast/pocket
- Wimpy nodes with 10 GbE: Leveraging One-sided RDMA operations to boost Memcached, USENIX ATC'12

#### **Contributers**

Crail: Patrick Stuedi, Animesh Trivedi, Jonas Pfefferle, Bernard Metzler, Adrian Schuepbach, Ana Klimovic, Yuval Degani

Pocket: Ana Klimovic, Yawen Wang, Patrick Stuedi, Animesh Trivedi, Jonas Pfefferle, Christos Kozyrakis