

# OSMOSIS: Enabling Multi-Tenancy in Datacenter SmartNICs

Mikhail Khalilov, Marcin Chrapek, Siyuan Shen, Alessandro Vezzu, Thomas Benz, Salvatore Di Girolamo, and Timo Schneider, *ETH Zürich;* Daniele De Sensi, *ETH Zürich and Sapienza University of Rome;* Luca Benini and Torsten Hoefler, *ETH Zürich* 

https://www.usenix.org/conference/atc24/presentation/khalilov

# This paper is included in the Proceedings of the 2024 USENIX Annual Technical Conference.

July 10–12, 2024 • Santa Clara, CA, USA

978-1-939133-41-0

Open access to the Proceedings of the 2024 USENIX Annual Technical Conference is sponsored by



جامعة الملك عبدالله للعلوم والتقنية King Abdullah University of Science and Technology

THE REPORT OF TH



# **OSMOSIS: Enabling Multi-Tenancy in Datacenter SmartNICs**

Mikhail Khalilov<sup>1</sup>, Marcin Chrapek<sup>1</sup>, Siyuan Shen<sup>1</sup>, Alessandro Vezzu<sup>1</sup>, Thomas Benz<sup>2</sup>, Salvatore Di Girolamo<sup>1</sup>, Timo Schneider<sup>1</sup>, Daniele De Sensi<sup>1,3</sup>, Luca Benini<sup>2</sup>, and Torsten Hoefler<sup>1</sup>

<sup>1</sup>Department of Computer Science, ETH Zürich <sup>2</sup>Department of Information Technology and Electrical Engineering, ETH Zürich <sup>3</sup>Department of Computer Science, Sapienza University of Rome

#### Abstract

Multi-tenancy is essential for unleashing SmartNIC's potential in datacenters. Our systematic analysis in this work shows that existing on-path SmartNICs have resource multiplexing limitations. For example, existing solutions lack multi-tenancy capabilities such as performance isolation and QoS provisioning for compute and IO resources. Compared to standard NIC data paths with a well-defined set of offloaded functions, unpredictable execution times of SmartNIC kernels make conventional approaches for multi-tenancy and QoS insufficient. We fill this gap with OSMOSIS, a SmartNICs resource manager co-design. OSMOSIS extends existing OS mechanisms to enable dynamic hardware resource multiplexing of the on-path packet processing data plane. We integrate OSMOSIS within an open-source RISC-V-based 400Gbit/s SmartNIC. Our performance results demonstrate that OSMO-SIS fully supports multi-tenancy and enables broader adoption of SmartNICs in datacenters with low overhead.



Figure 1: A predictable NIC data path versus the unpredictable sNIC kernel execution.

#### 1 Introduction

Network data plane design has undergone two decades of exciting research, leading to the achievement of submicrosecond packet processing host latency [8, 25, 27, 38, 41,47–49,63,75,79,86]. SmartNICs (sNICs) have further improved processing times by enabling direct in-network packet processing, thereby reducing data movement [45]. sNICs started a trend in datacenter networking acceleration [50,96] similar to the GPU trend in high-performance computing [98].

sNICs enable running *kernels* on programmable, energyefficient cores tailored for packet processing and integrated within the host network interface card (NIC) System-on-Chip (SoC). These cores are attached directly (i.e., *on-path*) to the datacenter Ethernet or InfiniBand link [5,57]. Such a design reduces the latency of some applications since the sNIC can process the packets in the network [61] and reply directly without moving the packets to/from the host OS networking stack [1,34]. This enables the offload and acceleration of several workloads such as distributed learning gradients aggregation [93,98], disaggregation and storage [29,33,52,65,66], Key-Value Stores (KVS) [78,92, 104], Remote Procedure Calls (RPCs) [14, 56, 60, 82, 102], network protocols and telemetry [14, 16, 23, 42, 67, 89, 102, 103].

Network resources in a datacenter are multiplexed between tenants through a virtualization layer [12, 18, 54, 69, 106]. However, processing user code by sNICs brings a set of considerable resource management issues. As Figure 1 shows, NICs have three resources that must be multiplexed: compute, Direct Memory Access (DMA) bandwidth, and egress bandwidth. The traditional NIC data path only forwards packets to host memory and executes simple operations with a predictable and bounded complexity. Typically, the number of incoming bytes equals the number of outgoing bytes, and NICs do not run any elaborate processing on them. In contrast, sNICs can execute unpredictably complex stateful offloads [77]. For example, heavily used in machine learning [9] Allreduce operates on the payload and is compute-bound, while storage offloading predominantly accesses host memory and is DMA/IO bound. sNICs need to operate on uncoordinated, non-deterministic, and concurrent data streams while meeting Service Level Objective (SLO) policies set by the administrator.

Achieving a fair resource multiplexing for sNICs is challenging. sNICs combine characteristics of an accelerator, such as a GPU, and a traditional NIC. While this provides the aforementioned benefits, the resource management of neither is directly applicable due to the unique sNIC requirements (Section 3). Conventional RDMA NICs (rNICs) have bounded and predictable workloads (e.g., atomics, scatter-gather RDMA reads/writes) and often use link bandwidth allocation as a "iust enough" mechanism for resource isolation and Oualityof-Service (QoS) measure between tenants. Although rNICs exhibit bounded and foreseeable behavior, achieving fairness is challenging [100] even within their simpler than sNIC context. In contrast, accelerators fall entirely under the governance of the host OS, which oversees all active kernels [51,53]. These accelerators neither generate nor receive events beyond instructions from accelerated applications, setting them apart from sNICs capable of executing arbitrary kernels independently of the host's involvement.

Furthermore, for sNICs to sustain the sub-*nano*second packet arrival intervals at fully utilized 400Gbit/s link (Section 3, [28]), resource multiplexing must be conducted fast. On-path sNICs have much stricter compute and buffering constraints than traditional NICs and accelerators due to the packet rate and the three multiplexed resources (compute, DMA, and egress). This issue is even more critical as network rates constantly increase and are expected to exceed Terabit per second by 2025 [15,24,36,95].

A common approach to effectively manage processing at high packet arrival rates involves implementing resource management in hardware [2, 4, 28]. This is usually accomplished through scheduling policies such as Weighted Round Robin (WRR), which divide link bandwidth among tenants [20, 21, 100]. However, because sNICs have varying application kernel requirements, incorporating WRR for compute resource allocation can lead to unfairness. For example, as we show in Section 3, if one application (e.g., Allreduce) is compute-bound and takes twice as much compute time as a non-compute-bound application (e.g., KVS), the former will be able to process twice as many bytes. Other recently proposed methods for compute isolation in sNICs are not optimal for all scenarios as they are either non-work conserving [32] or rely on the host CPU as a fallback path [60].

We tackle these issues by introducing OSMOSIS (<u>Operating System Support for Streaming In-Network Processing</u>) (Section 4). OSMOSIS is a lightweight sNIC management layer that supports performance-critical dataplane management in hardware and non-critical management tasks in a flexible software runtime. OSMOSIS is a fair, work-conserving sNIC resource manager that requires minimal hardware footprint and employs expressive yet simple Service Level Objective (SLO) semantics. In OSMOSIS, the sNIC is exposed to the tenant as Single-Root Input/Output Virtualization (SR-IOV) Virtual Function (VF). This allows the administrator to allocate proportionally more *compute processing units, egress bandwidth*, and *DMA bandwidth* to VFs associated with high-priority tenants.

We implement (Section 5) and evaluate (Section 6) OS-MOSIS on top of one of the available open-source on-path sNIC architectures, PsPIN [19,35]. PsPIN is based on energyefficient silicon-proven RISC-V cores. In our setup, PsPIN is the hardware backbone for packet processing using kernels written in C. Our performance evaluation focuses on typical datacenter workloads such as storage IO and in-network Allreduce, and shows that OSMOSIS provides comprehensive support for multi-tenancy without sacrificing performance.

In summary, we make the following contributions:

- sNIC multi-tenancy: We show typical multi-tenancy sNIC problems and define a set of requirements for highperformance sNICs. These requirements serve as a guideline for developing sNICs that can meet the needs of diverse workloads and tenant environments (Section 3).
- 2. *OSMOSIS:* We introduce OSMOSIS, a lightweight opensource sNIC resource manager based on fair and workconserving scheduling policies. OSMOSIS is a minimal hardware footprint solution to the problem of fair and efficient resource sharing in multi-tenant sNICs with diverse application needs (Section 4).
- 3. *Evaluation:* We implement OSMOSIS in an open-source on-path 400Gbit/s sNIC by extending it with schedulers and a control path prototype (Section 5). We use this implementation to verify and evaluate OSMOSIS. We demonstrate how it solves the defined sNIC problems and handles multi-tenant applications fairly with varying resource requirements while minimizing tail latency (Section 6).

# 2 Background

From the system's perspective, we abstract out the sNIC as a packet processing accelerator between the network fabric and the host CPU, GPU, or FPGA. Existing sNICs can be classified broadly into two categories: *off-path* and *on-path* [60].

Off-path sNICs add an entire CPU complex to the network card, often running a full operating system (e.g., Linux). This design enables a management plane based on receive side scaling (RSS) to be conveniently implemented [8, 64, 79]. However, they often suffer from lower performance in terms of latency, bandwidth, and packet processing rates due to their system design, which closely resembles the CPU-centered host architecture (e.g., Broadcom Stingray and Nvidia Bluefield data processing units (DPUs) both feature ARM SoCs with PCIe and DRAM).

On-path sNICs share packet input buffers with *processing units* (PUs) tailored for highly-parallel packet processing (e.g., LiquidIO [62], Netronome [71], PsPIN [19], Data Path Accelerator (DPA) introduced in Bluefield 3 DPU [17, 72, 73]). On-path sNICs typically provide programming API for writing *kernels* that process traffic on PUs, on per-packet (PsPIN [19])



Figure 2: Schematic overview of on-path sNIC architectures. Red arrows indicate the data path and blue arrows correspond to the control/management path.

and/or per-message granularity (Bluefield-3 FlexIO API [72]). PUs typically feature three layers of the memory hierarchy, e.g., L1 single-cycle access scratchpad, L2 memory with access latency of 15-50 cycles, and host side memory (either off-path SoC or host CPU memory). L1 and L2 memories could be organized as multi-level caches (e.g., LiquidIO) or be explicitly managed by the user (e.g., PsPIN).

To our knowledge, OSMOSIS is the first solution to achieve fair resource multiplexing for on-path sNICs in a multi-tenant context. We selected one of the possible synthesizable opensource on-path sNIC implementations available in the literature, namely, PsPIN. PsPIN is open-source, based on energyefficient silicon-proven RISC-V cores, and allows users to write packet processing kernels in C and explicitly manage sNIC memory [19]. OSMOSIS could have been equivalently implemented in any other on-path framework [62, 71, 72]. For example, we discuss how OSMOSIS can be supported with BlueField-3 DPA in Section 5.3.

#### 2.1 Challenges of Resource Isolation

We generalize on-path sNIC architecture in Figure 2. Packets decoded from the sNIC physical layer (e.g., Ethernet MAC) arrive at the sNIC inbound engine 1 and are initially stored at the L2 packet buffer organized as a set of per-application first-in-first-out (FIFO) queues. Next 2, packets are scheduled for processing on available PUs where kernel execution is initiated 3. Kernels execute using three resources: PUs, DMA, and Egress bandwidth. Each application uses these resources differently (e.g., compute- or IO-bound) depending on its needs. In general, these resources can be used as follows:

- OPUs: computing (e.g., hashing the packet header or summing values in an Allreduce reduction);
- 4 DMA engine: transferring data to read/write in sNIC memory (e.g., KVS cache in sNIC L2 memory) or host memory (e.g., KVS cold storage);
- 5 Egress engine: sending packet replies (e.g., reply to a read request with a value from the KVS cache).

Metrics to measure the quality of resource multiplexing by datacenter tenants, known as Service-Level Objectives (SLOs), are typically tied to the conventional NIC path displayed in Figure 2 by considering tail latency [18] and throughput [70, 88]. However, these SLOs do not consider the sNIC data path with its unique resource multiplexing discussed in Section 3, such PU time, tail latency of DMA over host interconnect, and buffer space. Existing proposals have only partially addressed this issue by introducing performance isolation mechanisms, such as multi-level packet scheduling [28, 60, 91] and static resource allocation [32] of shared resources (see Section 7). Yet, due to the kernels' dynamic and unpredictable nature, static assignments do not solve the problem. OSMOSIS fills this gap by providing bounded guarantees for the sNIC resource availability to tenants using dynamic resource multiplexing.

#### 3 Multi-Tenant sNICs

Datacenter applications differ in their resource requirements, thus, leading to different resource multiplexing bottlenecks. Our quantitative analysis highlights these issues in multi-tenant setups of existing sNIC stacks [19, 72], yielding sNIC multi-tenancy requirements. These insights directly led to the microarchitectural and software choices for OSMOSIS. We use a 400 Gbit/s link for all experiments (more details on the setup in Section 6).

Per-packet time budget (PPB): While studies of datacenter traffic show that only a fraction of the established connections actively exchange data at any given time [10,84,101], they can still saturate the link bandwidth. To analyze the implications of this for sNICs we define per-packet time budget (PPB) using PU count N, packet size P, and link bandwidth B as  $PPB(N, P, B) = N \times (P/B)$ . In this case, we model the sNIC as a M/M/m queue where PPB defines the condition which needs to be satisfied for the queue to be stable  $[13]^1$ . To be more specific, PPB represents how long the sNIC can process a packet until the next one arrives, assuming a fully utilized link. If PPB is exceeded, the per-application ingress queue will eventually fill up during transient traffic bursts leading to packet drops or falling back to link flow control (e.g., PFC [107]) and a possible violation of per-VF SLO policy.

Figure 3 compares service times of IO– and computebound workloads with theoretical PPB assuming that tenant workloads fit one packet and that the sNIC has only one tenant. We observe that all workloads with packet size  $\leq 64$  Bytes fail to fit in PPB. Compute-bound workloads (i.e., Aggregate, Reduce, Histogram) whose execution time scales linearly with packet payload length exceed the PPB for all packet sizes bottlenecking the PUs. Notably, IO-bound kernels above

 $<sup>^{1}1/\</sup>lambda = P/B$ , m = N, to achieve  $\rho < 1$ ,  $1/\mu > N \cdot P/B$ , where PPB =  $1/\mu$ .



Figure 3: sNIC core (PU) processing time needed to serve 1 packet for common sNIC kernels. Workloads with triangle markers are compute-bound, and circular markers are IO-bound. All workloads with  $\leq$  64B packet size (including 28 bytes IPv4/UDP-header) exceed PPB showing congestion at PUs when link bandwidth is fully utilized. Note that our setup supports Ethernet payload sizes below 64B to accommodate custom interconnects [44].



Figure 4: *Congestor* and *Victim* tenants' flows with equal priorities are mapped to two different SR-IOV VFs with equal shares of Ingress bandwidth. With the round-robin scheduling of per-flow queues, the *Congestor* tenant with  $2 \times$  higher compute cost per packet occupies a proportionally larger number of cores than the *Victim* tenant.

256 Bytes (i.e., DMA writes/reads, Egress packet sends) fit PPB as they avoid PU congestion but are bottlenecked by the link bandwidth. However, as we will demonstrate, *IO-bound* workloads are sensitive to DMA transfer contention on the host interconnect.

**PU contention:** While a single tenant can cause pressure on the ingress queue and contention of PUs, multiple tenants can lead to unfairness. For example, consider two computebound tenants with different requirements. One of them, the *Congestor*, has twice as large compute cost per packet as the other, the *Victim*, leading to twice as many cycles on PU to finish the kernel. During the burst, *Congestor* and *Victim* push packets at the corresponding per-application (per-VF) queues at the same ingress rate. As Figure 4 shows, using the conventional round robin (RR) scheduling of per-application queues across 8 sNIC PUs, the *Congestor* uses  $2 \times$  the PUs used by the *Victim*.

**R1** *sNIC manager should fairly allocate compute components (e.g., PUs, cryptographic accelerators) while serving tenants with different compute costs per packet.* 

Egress and DMA engines contention: Similarly, as the



Figure 5: Slow-down of various IO operations (e.g., DMA and sending packets to Egress) initiated by the tenant's kernel results in HoL-blocking small requests due to underlying IO path contention.

compute-bound kernels cause contention on PUs, IO-bound kernels can lead to contention on the appropriate DMA or egress engines. IO-bound kernels running on different PUs can simultaneously initiate IO requests through the same sNIC engines, e.g., DMA requests from a KVS application. In case the underlying interconnect (e.g., PCIe or AXI [81]) is blocking and lacks the support of QoS provisioning, *the issue of multiple concurrent requests may result in Head-of-Line (HoL) blocking* [1].

For example, consider two IO-bound tenants with different IO requirements. The *Victim* has constant 64B packets, while the *Congestor* increases its packet size from 64B to 4096B. As Figure 5 shows, the contention on the IO engine leads to an order of magnitude higher latency of the *Victim*'s messages without considerably affecting the *Congestor*'s flow. This unfairly increases the latency of one of the tenants by  $4-15 \times$ .

**R2** *sNIC manager should fairly allocate DMA and egress bandwidth (e.g., using AXI and PCIe) between running ker-nels and be resilient to HoL-blocking.* 

**Memory management:** Applications have diverse memory runtime needs, with dynamic memory allocation causing an unknown *a priori* memory consumption. In extreme cases, a tenant could monopolize all sNIC memory, e.g., L1 packet buffers, resulting in HoL-blocking for others. Introducing virtual memory (paging) semantics could lead to substantial memory access overheads, as each page fault significantly amplifies memory access latency [40].

**R3** sNIC manager should fairly allocate memory using lightweight allocation strategies defined in the control plane.

**Scheduling overhead:** Existing *software* packet processing data paths [8, 25, 79] were designed for off-path sNICs or conventional host processing. As recent studies show [47] *effectiveness* of kernel execution scheduling in terms of achieved maximum utilization while running on off-path sNICs supported by OS's like Linux is driven by the latency of context switching [27, 47]. PU cycles are wasted during context switching to transition between the kernel states. We benchmark context-switching of Linux running on host and off-path sNIC (Bluefield-2 ARM SoC). We compare these to the state-

| PU                                | Frequency | ISA    | Linux | Caladan | RTOS |
|-----------------------------------|-----------|--------|-------|---------|------|
| Host Ryzen 7 5700                 | 3.8GHz    | x86    | 28576 | 211     | _    |
| BF-2 DPU A72                      | 2.5GHz    | ARMv8  | 13250 | 192     | -    |
| PULP cores [6]<br>(used in PsPIN) | 1GHz      | RISC-V | -     | _       | 121  |

Table 1: Average latency of context switching between 2 processes. Measurements shown in PU cycles scaled to 1 GHz (i.e., 1 ns/cycle).

of-the-art Caladan scheduler we ported to the ARM ISA [27]. For reference, we also show the context switching latency of PULP cores as implemented in PsPIN used to evaluate OSMOSIS. Notably, we observe that the context switching latencies we report in Table 1 are higher or of the same order of magnitude as the PPB from the analysis presented in Figure 3.

**R4** Data path performance should not be impacted by overheads stemming from software scheduling policies, providing low-latency scheduling of kernel execution.

**Control path priority:** If a tenant on the sNIC exceeds compute or time budgets, an immediate response is needed from the host's control plane for *control traffic*, e.g., it needs to be handled within the error path of the application running on the host CPU or off-path sNIC cores. However, communication between sNIC and host uses system interconnect (e.g., PCIe), typically adding an overhead of 0.5 - 3 usec per read/write request. Congestion in the interconnect (Figure 5, [1]) can lead to HoL-blocking of control traffic and unpredictable packet processing.

**R5** *sNIC* accelerated packet processing should prioritize *control-path traffic.* 

**QoS API:** NIC capabilities are exposed to tenants through a virtualization layer (OS hypervisor) that provides an illusion of full resource ownership. SR-IOV is a standardized extension for the PCIe interconnect and a conventional way to implement NIC virtualization. It is utilized in many conventional industry-standard NICs, e.g., ConnectX and BlueField NICs. In SR-IOV, each NIC physical function (PF) (such as TX and RX capabilities) is multiplexed between several virtual functions (VFs). Each VF is exposed to the tenant through an OS hypervisor as a stand-alone PCIe NIC. To our knowledge, existing production rNICs and sNICs support only Ingress and Egress bandwidth allocation on the basis of VFs and not compute or DMA resources.

**R6** sNIC management plane should support conventional QoS provisioning mechanisms for all types of resources.



Figure 6: Abstract model of OSMOSIS-enabled sNIC. Packets are mapped by Matching Engine to FMQs and dispatched for execution by the scheduler.

# 4 OSMOSIS

We present OSMOSIS in Figure 6. We begin with a high-level overview of how OSMOSIS manages the three competing sNIC resources and satisfies the multi-tenancy requirements outlined in the previous section. We then demonstrate how this is achieved by dividing the system into two components. The first is a non-critical, flexible software control plane that handles management tasks and runs on the host CPU or offpath sNIC cores. The second is a performance-critical data plane scheduler designed specifically to support SLO policy enforcement and integrate within the on-path sNIC SoC. Within this section, each part is explained in depth.

|                        | PUs                            | DMA         | Egress   | Memory          |
|------------------------|--------------------------------|-------------|----------|-----------------|
| Scheduler              | WLBVT                          | WRR         | WRR      | Static          |
| SLO knob               | Priority<br>Kernel cycle limit | Priority    | Priority | Allocation size |
| Fulfilled requirements | <b>R1 R4 R6</b>                | R2 R4 R5 R6 | R2 R4 R6 | <b>R3 R4 R6</b> |

Table 2: OSMOSIS resource management principles with all six fulfilled multi-tenancy requirements.

#### 4.1 High-level Overview

**(1)** Flow execution context creation: To utilize sNIC packet processing, tenants create a flow *execution context* (ECTX). ECTX encapsulates the flow processing state, such as the SLO policy and the packet processing *kernel*, a piece of code compiled for the target PU architecture and describing the actions for each packet destined for the flow.

**CTX initialization:** After the tenant provides the basic elements of an ECTX, OSMOSIS instantiates it. It allocates a virtualized sNIC interface through the host OS hypervisor and associates it with a tenant IP address and SLO policy. It also sets up the IOMMU to allow kernel access to specific host pages, *statically* allocates on sNIC memory and loads the kernel binary into sNIC memory.

3 Matching packets to flow management queue: The

sNIC matching engine filters packets that require sNIC processing. All incoming packets are matched against the threetuple (in case of UDP) or five-tuple (in case of TCP) of active sNIC ECTXs. Once matched, *packet descriptors* (e.g., pointer to packets in sNIC memory) are stored at one of the *flow management queues* (FMQs). FMQs store all information regarding an active flow ECTX on the sNIC hardware. FMQs are organized as FIFO queues of packet descriptors with an additional memory state to store running execution information (e.g., BVT metric).

**4 PU scheduling:** Once a PU becomes available, OSMO-SIS schedules the packet at the head of one of the FMQs. To achieve fair PU allocation, OSMOSIS implements a centralized, non-preemptive scheduler inspired by the Borrowed Virtual Time (BVT) policy [22, 47]. BVT aims to allow each tenant to obtain the same amount of access time to the scheduled resource by keeping track of their past usage. OSMOSIS FMQ scheduler *allocates sNIC PUs to FMQs with the smallest priority-adjusted past PU usage measured in cycles* while maintaining the SLO policy specified by the sNIC administrator, such as the upper per-FMQ PU cycle limit.

**6 Kernel execution and IO management:** Upon loading the packet into local PU memory, the PU can process it using the relevant kernel. As seen in Section 3, parallel kernel executions on different PUs can lead to head-of-line blocking (HoL-blocking) and uncertain tail latency for DMA to sNIC/host memory and egress data transfers. For example, kernels can pipeline large storage reads by overlapping asynchronous DMA reads of packet-sized payloads with egress packet sending. OSMOSIS mitigates this by fairly arbitrating IO paths, breaking sizable DMA requests into smaller transactions, and scheduling them with a near-perfect fairnessweighted round-robin (WRR) policy. FMQs supply DMA and egress engines with tenant IO priorities for initiated IO requests. This ensures that each tenant obtains a priority-based fair bandwidth chunk when moving data within L2 or host memory using DMA reads/writes.

# 4.2 Flexible software control plane

OSMOSIS offers a host OS API for sNIC packet processing management, encompassing ECTX creation and offloading specific flow handling to the sNIC. Tenant-initiated offloading involves the creation of a flow ECTX. ECTX facilitates tenant control using the following components.

**SLO policy:** The SLO policy sets compute, DMA, and egress priorities, kernel cycle budget, packet buffer size, and on-sNIC memory. OSMOSIS offers transparent SLO management via SLO knobs indicated in Table 2. By default, all tenants' FMQs share equal priority. To achieve perfect fairness in such a scenario, all flows should get the same portion of PUs and IO bandwidth at any time. Increasing the priority of the ECTX leads to *proportionally* more resources (PUs, bandwidth) allo-

cated to the ECTX. A per-kernel cycle limit is adjustable for total or individual kernel execution times and curbs excessive PU usage. Cycle-limit also prevents users from writing illbehaved code (e.g., infinite while(true) loop). We assess SLO's impact on resource fairness in Section 6.

**Kernel binary:** kernel binary cross-compiled by the tenant is loaded into sNIC memory by the control plane and is later executed on the flow packets. The kernel binary can compute and schedule DMA and egress requests according to the tenant requirements.

A virtualized sNIC device: A virtualized device is allocated for the tenant, e.g., SR-IOV Virtual Function (VF). OSMOSIS associates an IP address with the VF and uses it later for matching, i.e., the VF is 1:1 associated with a single FMQ. Similarly, FMQ-based management can be exposed through any other sNIC virtualization interface, e.g., [54], [106].

A matching rule: The matching rule matches packets from the sNIC inbound stream to the ECTX and manages their processing within the same FMQ. A matching rule allows the tenants to open multiple ports on the same virtualized device. The matching engine can match packets based on their UDP/TCP header contents. For example, it can match the IP address and the destination port of the application.

**sNIC memory segments:** The sNIC memory segments are allocated statically to each kernel depending on the requested memory size. The kernels can store the application state in sNIC local memory, e.g., KVS-cache or packet filter table. An error is returned if the tenant uses too much memory or the kernel binary is larger than the SLO policy limits.

**Host memory pages:** The ECTX specifies which host pages can be accessed from the specific kernel via DMA. The DMA engine on the sNIC interfaces the host memory with an IOMMU, translating host virtual addresses to physical addresses. The IOMMU also checks whether the sNIC is accessing an allowed memory region. The control plane initializes the IOMMU with appropriate page tables during execution context creation.

**Event queue (EQ):** An event queue allows the user application to track events like kernel execution errors. When an error occurs (e.g., illegal memory access or exceeding execution time), OSMOSIS informs the host via an event in the kernel's ECTX EQ. A host OSMOSIS API call from the application checks this queue for error messages. EQ can be realized as contiguous sNIC memory mapped to the host virtual address space, akin to RDMA Verbs API EQ [44]. EQ control path traffic shares the sNIC DMA data path (e.g., PCIe or CXL) with regular kernel execution (e.g., DMA initiated within the kernel) but gets the highest IO priority due to tenants' immediate action needs.

# 4.3 Hardware data plane

OSMOSIS provides low management overhead with a minimal hardware footprint. We present two key mechanisms that help us to achieve this goal: a hardware flow abstraction (FMQs) and scheduling algorithms suitable for hardware implementation (WLBVT and DWRR).

**Flow management queues** (FMQs) generalize a packet flow similarly to how a hardware thread generalizes a process. If the tenant needs to offload multiple workloads, each workload kernel (binary) must be associated with its own FMQ. FMQs store matched packet descriptors in a FIFO queue and monitor the flow processing performance. The scheduler then uses these measures to allocate compute resources fairly and enforce per-flow priorities. Processing the FIFO queue triggers kernel executions on sNIC PUs, resembling program instruction execution flow in traditional OS processes.

FMQs also store part of the ECTX state, such as the matching rule, pointers to the kernel binary, and the SLO policy definition. The host-side control plane manages and initializes FMQs that appear as MMIO registers in SR-IOV VF address space. FMQs are highly extensible. For example, the OSMOSIS priority model is compatible with datacenter Ethernet [43]. In case of congestion on the FMQ FIFO queue, the packets can be marked with the appropriate Ethernet ECN congestion flag or can supply the per-FMQ telemetry information [2, 3, 26, 44, 58, 107].

```
def pu_limit(ActiveFMQs, fmq):
    prio_sum = 0
    for fmq in FMQs:
      if not fmq.empty:
        prio_sum += fmq.prio
    return ceil(len(FMQs) * fmq.prio / prio_sum)
  def update_tput(FMQs): #called at each clock cycle
8
9
    for fmq in FMQs:
      fmq.total_pu_occup += fmq.cur_pu_occup
10
      if not fmq.empty or fmq.cur_pu_occup > 0:
11
        fmq.bvt += 1 # update only in active state
      fmq.tput = fmq.total_pu_occup / fmq.bvt
14
  def get_fmq_idx(): #called once PU core is free
15
    min_tput = MAX_INT
16
    for fmq in ActiveFMQs:
      if fmq.pu_occup < pu_limit(activeFMQs, fmq):</pre>
18
        if fmq.tput / fmp.prio < min_tput:</pre>
19
          min_tput = fmq.tput / fmq.prio
20
          fmq_idx = fmq.idx
  return fmq_idx
22
```

Listing 1: WLBVT FMQ scheduler procedural pseudocode.

**FMQ Scheduler** allocates PUs across flows with different compute, DMA, and egress costs-per-packet that are not known *a priori*. Thus, to achieve fair compute utilization, the FMQ arbitration policy needs to be *invariant to the costper-byte of the packet* (see Figure 4). OSMOSIS implements a hardware scheduler as simple and scalable as the deficitweighted round-robin (DWRR) but with a minimal additional area footprint (see Section 5).

OSMOSIS introduces a greedy *Weight Limited Borrowed Virtual Time* (WLBVT) policy, a hybrid of the Weighted Fair Queuing (WFQ) model of FMQ weights and Borrowed Virtual Time (BVT) scheduler. We adopt the BVT algorithm to suit sNIC hardware implementation constraints [22,47] and present our scheduler in pseudo-code Listing 1. Intuitively, our scheduler aims to allocate each tenant the same amount of PU processing time normalized by priority while ensuring that each tenant is served fairly during PU contention.

An FMQ is in an active state if it contains packet descriptors in the FIFO queue or if its packets are currently being processed on any PU. Flow throughput is updated (update\_tput) at each sNIC clock cycle only if the corresponding FMQ is active. The scheduler (get\_fmq\_idx) returns the index of the non-empty FMQ that fits the upper limit of weighted PU occupation (pu\_limit called in line 21) and has the lowest current throughput normalized by FMQ priority (lines 22, 23).

The weighted PU occupation's upper limit guarantees fair QoS for tenants based on their priority. pu\_limit is calculated with a *ceil* function to ensure fairness in case of more active FMQs than PUs or non-integer division. The lowest priority normalized throughput equalizes access to oversubscribed PUs over time, favoring users utilizing fewer resources. Our approach can also accommodate total virtual time per tenant (i.e., line 21), which could be useful for billing purposes, thus expanding policy flexibility.

**Kernel execution** is a *short-lived* event as each execution only processes one packet. In OSMOSIS, we run kernels to completion [8, 79]. We avoid context-switching for several reasons. As shown in Table 1, context switching can introduce significant overhead. It also increases the complexity of the hardware data path and requires an additional state per each active kernel.

# 4.4 Discussion

**Unified front-end API:** We envision that the internal frontend OSMOSIS API we use for kernel offloading and specifying SLO policy (Figure 6 and Table 2) can be exposed through conventional networking APIs. For example, the socket offloading (e.g., autonomous NIC offloads [77]) can be mapped to the FMQ. OSMOSIS can be exposed as an extension to the ioctl or setsockopt system call API and socket operations error handling path. For the native RDMA deployments, OS-MOSIS can be exposed as part of the completion queue and queue pair configuration space (see Section 5.3) for further discussion.

**Run-to-completion model:** If a kernel exceeds a set time limit (e.g., per-FMQ watchdog timer), it's terminated with a hardware interrupt, and the host application receives notification via the corresponding EQ. In this light, we believe

that run-to-completion semantics underpins the sNIC programming model that, together with OSMOSIS fair priority adjusted schedulers, ensures predictable packet processing tail latency and also excludes compute-intensive tasks better suited for GPUs or FPGAs. Assuming that the datacenter operator doesn't know the details of the tenant's code to be executed on the PU, the run-to-completion model also ensures the prevention of the execution of ill-behaved code (e.g., a kernel that contains infinite loops).

**Virtual memory:** In principle, OSMOSIS could give each flow the illusion of infinite virtual memory using paging. However, this has two problems. First, translating the page's virtual address to a physical address is a combinational logic operation that will increase the latency of each memory access by at least one cycle. In PsPIN, the on-path backend of OSMOSIS that we discuss in Section 5, accesses to the L1 scratchpad memory require only 1 cycle. Second, when using demand paging, swapping in and out memory pages (e.g., between the NIC and the host) also introduces some latency. Because kernels are not context-switched, the PU would actively wait for the page to be swapped in, thereby wasting a large part of the cycle budget.

**Congestion management:** We assume that OSMOSIS is deployed within the lossless network (e.g., InfiniBand, Ro-CEv2), and FMQs never drop packets. By design, OSMOSIS is compatible with conventional congestion signaling (e.g., ECN) and flow control mechanisms (e.g., Ethernet DCB) supported by existing lossless fabrics. It can also be deployed with DCQCN [107] and DCTCP [3]. From the transport protocol perspective, the packet queueing delay within the FMQs and the corresponding execution of the packet kernel is just another source of latency. For example, the FMQ abstraction deployed with Ethernet can support RED/ECN marking [26,44]. Another mechanism that FMQs can easily support is supplying the P4 INT-MD telemetry information [2] to enable the HPCC protocol [58].

Encrypted traffic and compute accelerators: The sNIC handles data movement and may also require accessing the packet contents. Hence, it should be able to decrypt packets (e.g., QUIC [103]). sNICs can support either per-PU cryptographic accelerators (e.g., Intel AES-NI [37]) or a shared accelerator for efficiency (e.g., like in Marvell LiquidIO [62]) exposed via ISA extensions. In the latter case, the accelerator arbitration resembles PUs, making WLBVT scheduling suitable for compute resource management.

**IO security:** Host memory is protected against unauthorized DMA transfers using an IOMMU setup by OSMOSIS when the host creates the flow context. Similarly, local sNIC memory accesses need to be protected. This can be achieved, for example, by a *Physical Memory Protection* unit (PMP) [99] as shown in Section 5.1.

# **5** Implementation

We implement OSMOSIS atop PsPIN [19, 35], an opensource on-path sNIC<sup>2</sup>. We adopt PsPIN as a backend for performance-critical operations within OSMOSIS by extending its host-side API to support multiple ECTXs and specify tenant SLOs using 335 lines of code (LOCs) in C.

We integrated functional blocks of OSMOSIS (i.e., matching engine, WLBVT scheduler, and DMA request fragmentation) written in 1216 LOCs of C++ with cycle-accurate simulation PsPIN SystemVerilog backend. In addition, we also implemented these components as synthesizable SystemVerilog IP blocks for hardware cost estimations. These open-source blocks can serve as a future prototype for ASIC or FPGA-based implementation of OSMOSIS.

# 5.1 Implementing OSMOSIS on top of PsPIN

**Packet processing units:** OSMOSIS PsPIN architecture is based on scalable silicon-proven RISC-V PULP SoC [19, 55, 83]. The PUs are RI5CY 32-bit cores organized in clusters. Each PsPIN cluster contains 8 PUs clocked at 1GHz and coupled with a 1 cycle, multi-banked local scratchpad memory (referred to as *L1*). For our experiments, we use the default configuration of the PsPIN PU cluster with 1 MiB L1 data, and 4 KiB L1 instruction caches. Clusters share a global 4 MiB L2 packet buffer and a 4 MiB L2 kernel buffer, which can be used for local data storage.

**Portable programming API:** OSMOSIS utilizes PsPIN infrastructure to offload the packet processing to the PUs. The user writes a C kernel cross-compiled on the host for the RISC-V ISA architecture. The kernels are then loaded and executed on the flow packets according to the sPIN API [35]. PsPIN has a low-latency kernel invocation mechanism ( $\leq 10$  cycles), i.e., each PU executes a loop polling for a function pointer with the address of the kernel and flow context.

Kernel IO: The PsPIN API enables blocking and nonblocking IO calls within kernel code. The PsPIN cluster scratchpad memory is interconnected with the sNIC L2 kernel buffer, host DMA engine buffer, and sNIC egress engine buffer through the 512-bit AXI DMA link. This setup enables read and write transfers between these buffers, with PUs accessing other cluster memories and shared L2 kernel memory in 10 to 30 cycles. This design also transparently supports sNIC egress packet send: a DMA write from kernel scratchpad memory to the NIC egress engine buffer. PU core L1 scratchpad interfaces an Ethernet egress pipeline over the AXI protocol. PsPIN IO-calls configure a DMA command with addresses, length, and a completion handle pointer. The cluster command FIFO queues outstanding IO commands, and a WRR policy arbitrates per-cluster queues for DMA engine access.

<sup>&</sup>lt;sup>2</sup>https://spclgitlab.ethz.ch/mkhalilov/pspin-osmosis

**Memory management:** PsPIN allows specifying the size of the contiguous L1 and L2 memory regions allocatable to tenants' kernels and supports memory isolation using the Physical Memory Protection (PMP) unit. When the kernel accesses L1 and L2 memories, the virtual memory addresses are translated to physical addresses with relocation registers. The PMP then checks that the addresses are within the valid segment range. Like the relocation registers, the PMP unit does not increase the memory access latency [19].

# 5.2 OSMOSIS Schedulers

**FMQ scheduling implementation:** FMQ encompasses a FIFO queue, ECTX (detailed in Section 4), and scheduling state. The FIFO queue holds packet descriptors, each containing a 32-bit pointer to the packet. The scheduling state includes a BVT counter tracking tenant resource use and a priority. We implemented the counter as a 64-bit register to avoid overflow<sup>3</sup>. A 16-bit register stores the FMQ priority. Our SystemVerilog WLBVT implementation with 128 FMQs synthesizes at 1 GHz, making a scheduling decision in five cycles. Most latency stems from the weight-limiting requiring integer division, which is challenging for fast hardware implementation. We hide this latency using pipelining, overlapping FMQ arbitration with packet DMA from the L2 packet buffer to the cluster scratchpad (at least 13 cycles for a 64-byte packet).

Enhanced DMA engine: To prevent HoL-blocking, OS-MOSIS applies transfer fragmentation on both the hostinterfacing DMA engine and the egress engine. We implement two modes of fragmentation: a software fragmentation implemented within the kernel call for a DMA transfer and a hardware fragmentation within the DMA engine. The software approach wraps pspin\_dma\_read/write and pspin\_send\_packet with a function, dividing larger requests into smaller chunks. We issue multiple non-blocking DMA requests of smaller sizes while internally maintaining the state for each transfer. While this optimization mitigates HoL-blocking (as shown in Section 6), it also hinders the throughput of large DMA requests. To minimize this, we expand the functional model of AXI to enable hardware DMA fragmentation offloading. This involves managing the state for multiple outstanding AXI write requests and arbitrating them with the WRR scheduler.

# 5.3 Integration with other on-path SmartNICs

OSMOSIS could be applied to the on-path sNIC designs besides PsPIN. The PsPIN datapath architecture and the generalpurpose C programming model share many similarities with the commercially available NVIDIA BlueField data path accelerator (DPA) introduced in Bluefield 3 DPU [17, 72, 73]. Similarly to PsPIN, DPA could be extended with OSMOSIS to enable kernel execution QoS.

**Compute management:** DPA invokes user-defined kernels upon completion of RDMA operations. The hardware mechanics of packet scheduling and kernel activation are close in both architectures. In DPA, after scheduling network completion queues, the generated completion event activates kernel execution on the DPA hardware thread. This is equivalent to the OSMOSIS kernel execution request generated after scheduling flow management queues (FMQs). Thus, WL-BVT FMQ scheduling could be 1:1 mapped to DPA-managed RDMA Completion Queues (CQs) scheduling.

**IO management:** From the kernel IO perspective, like in PsPIN, DPA cores interface a dedicated NIC DMA engine for non-blocking kernel-initiated data movement towards egress and the host. The BlueField DMA engine can also support DMA request fragmentation to avoid HoL-blocking. IO operations initiated from DPA cores during kernel execution, i.e., RDMA Work Requests (WRs), could be assigned with a desired Service Level (SL) mapped to the underlying RDMA Virtual Lane (VL), i.e., SL2VL mapping mechanism [44].

**Software API:** The DPA kernel offloading API (DOCA FlexIO API [72]) can be extended to support OSMO-SIS SLO enforcement. Specifically, the CQ and QP attributes can include OSMOSIS-related knobs (e.g., compute/IO priorities, memory size, etc.) passed by the tenant in flexio\_cq\_create(..) and flexio\_qp\_create(..).

# 6 Evaluation

We study how OSMOSIS allocates sNIC resources under different traffic conditions and workload requirements. We investigate the following research questions:

- 1. How does the area of OSMOSIS-enabled sNIC chip scale up with the ingress link rates and the number of tenants?
- 2. What are the overheads of OSMOSIS compared to the reference PsPIN implementation?
- 3. What is the maximum load that OSMOSIS can sustain?
- 4. How fair are OSMOSIS resource allocations?

# 6.1 Hardware Scaling

We synthesized OSMOSIS and PsPIN SystemVerilog IP blocks at 1GHz in GlobalFoundries 22nm node process to estimate hardware area costs using Synopsys Design Compiler NXT in topographic mode.

<sup>&</sup>lt;sup>3</sup>The 64-bit counter overflow with updates done every cycle at 1 GHz will happen in  $2^{64} \div 10^{-9}$  s/op  $\div 60s \div 60m \div 24h \div 365.25d \approx 584$  yrs.



Figure 7: The cost model of sNIC SoC area synthesized in 22nm GF process, compared to the theoretical per packet budget (averaged for different packet sizes at 64 - 4096 B interval) achieved with 400/800/1600 Gbit/s ingress link rates.

**sNIC area scaling with compute capacity:** PsPIN clusters utilize a hierarchical SoC-interconnect similar to Manticore scale-out study [105]. We group four clusters into a *quadrant* sharing a local interconnect. Each quadrant connects to L2 memory, allowing all cores to access the shared packet buffer. Synthesis studies [19, 55] indicate negligible area increases and timing overheads when adding ports to L2. In Figure 7, PsPIN demonstrates linear compute capacity scaling relative to the core area. For instance, 4 PU clusters offer adequate perpacket budget (PPB) (Section 3) to sustain compute-bound Reduce workload with up to 512-byte packets.

**OSMOSIS Schedulers Scaling:** Figure 8 shows the hardware area consumption of OSMOSIS schedulers. We observe a linear scaling of the FMQ and DMA engine schedulers with the number of inputs. Assigned with a custom packet matching rule, one FMQ scheduler input can serve millions of requests, such as independent IO reads/writes (see Figure 11). Compared to RR, WLBVT needs  $7 \times$  more gates, yet with 128 FMQs, WLBVT area consumption takes only 1% of PsPIN cluster and L2 memory area. With a reasonable hardware footprint, OSMOSIS enables the hardware scheduling of up to 128 tenants subscribed to the same SmartNIC.

**FPGA prototype scalability:** We also integrated the PsPIN sNIC with the Xilinx UltraScale+ VCU1525 FPGA and Corundum NIC for our experimental setup [85]. The preliminary experiments with an FPGA-based client-server prototype showed that the peak throughput is only around 6 Gbit/s. This limitation arises because the PsPIN IP is designed for ASIC production. Due to space and timing constraints, PsPIN cannot be synthesized on our FPGA at more than 40 MHz with only 16 cores and 1/4'th of the target 400 Gbit/s design's L1/L2 memories. This makes the FPGA-based PsPIN



Figure 8: WLBVT and WRR exhibit linear area scaling in the GF 22nm process. Bar captions indicate gate count and relative area compared to 4 PU clusters with 4 MiB L2.

testbed unsuitable for full-system evaluation. Instead, we use cycle-accurate simulation for full-scale system evaluation with packet arrival, scheduling, and processing deadlines observable in the next-generation sNIC link speeds and SoC core counts.

#### 6.2 Experimental Methodology

We evaluate OSMOSIS runtime performance using cycleaccurate Verilator v4.228 SystemVerilog simulator [90]. Our experimental testbed features two setups: a *Reference (baseline) PsPIN implementation*, i.e., a conventional on-path sNIC without multi-tenant OS, and a *PsPIN implementation enhanced with OSMOSIS management*.

Both setups feature 4 PsPIN clusters of 8 1GHz cores, achieving 400 Gbit/s ingress/egress bandwidth. L2 and host memories can be accessed through a 512 Gbit/s AXI link. We used randomly pre-generated packet traces that fully saturate ingress link bandwidth. Packet arrival sequences follow a uniform distribution, and packet sizes are sampled from a lognormal distribution [10, 84, 101]. For fairness measurements, we use Jain's fairness metric [39]. It scales between 1 and 1 divided by the number of tenants: a metric of *y* implies y% fair treatment, leaving (100 - y)% starved. Fair treatment ensures equal priority-adjusted resource access for each tenant.

A RR scheduler is available in the reference PsPIN implementation, thus we consider it as a baseline. To our knowledge, production on-path designs (e.g., BlueField-3 DPA) use static compute management. We consider a dynamic scheduler over static allocation for the baseline since work conservancy is an essential requirement for datacenter energy efficiency. We discuss how OSMOSIS differs from existing NIC management solutions in Section 7.



Figure 9: The fairness of WLBVT and RR with two tenants of different compute cost per byte. The orange line in the PU occupation subplots represents the *Congestor* tenant, whose workload consumes  $2 \times$  more cycles per packet than the *Victim* tenant depicted by the blue line.

#### 6.3 Synthetic Benchmarks

We evaluate OSMOSIS on synthetic benchmarks to assess its overheads in a low-complexity environment.

**R1 R5** Fair HPU allocation: We run two applications, one with a larger *compute cost per byte*, the *Congestor*, and the other with a smaller one, the *Victim*. Both spin in a for loop to simulate a compute-bound task. Figure 9 shows how RR overallocates PUs to the *Congestor*, leading to lower fairness, as shown by Jain's metric. WLBVT consistently splits all the resources equally between tenants. When the *Victim* has no outstanding packets, WLBVT allows the *Congestor* to overtake more PUs. WLBVT enables fair compute resource allocation within OSMOSIS and does not cause slowdowns within the benchmarks.

**R2 R5 Resolving HoL-blocking:** We evaluate the scaling of throughput of the Congestor and the kernel completion time of the Victim while conducting only Egress transfers that involve AXI writes. Figure 10 presents how OSMOSIS resolves HoL-blocking. Depending on the fragmentation method, the *Victim*'s kernel completion time can be reduced by an order of magnitude while preserving a relative slowdown of only around  $2\times$ . The throughput reduction stems from control traffic overhead related to fragmentation, i.e., splitting one large transfer into smaller N transfers introduces N additional protocol handshakes between sender and receiver. When accessing local sNIC memories (i.e., remote scratchpads and L2), it can be mitigated through a custom SystemVerilog implementation of the PsPIN AXI protocol, allowing for parallel transfer states as proposed in other works [11, 46, 80]. Addressing this issue for host-side traffic that crosses AXI bus boundaries would require a fine-grained QoS protocol for PCIe and CXL interconnects [1].

We also observed two bottlenecks: ingress and egress. In the ingress bottleneck, the incoming link bandwidth is the limit, while in the egress one, the AXI bus congestion causes slowdowns. While the overheads come from the intercon-



Figure 10: The impact on the *Congestor* throughput and the *Victim* kernel completion time as a function of the *Congestor* size and various fragment sizes. State transition between ingress and egress bottleneck depicts where the line rate of the egress path became saturated.

nect, OSMOSIS scheduling does not introduce overheads, as evident for low *Congestor* sizes.

#### 6.4 Datacenter Workloads

Additionally, we evaluate a set of real datacenter workloads supplied with the PsPIN benchmarking package [19]. We study the *Aggregation* [74], *Reduction* [9] and *Histogram* [7] benchmarks as examples of compute-bound workloads with incrementally increasing inter-kernel memory synchronization requirements, i.e., from local on-PU computation with one atomic operation in *Aggregation*, to random memory accesses, each with an atomic summation in *Histogram*.

We also evaluate an IO-bound benchmarking set. Our goal is to exercise NIC DMA read/write data paths towards the host memory, the pattern typical for data path offloading of storage RPCs and TCP segment delivery [65,68,77,87]. In *IO read/write* workloads, a target memory location is stored directly in the packet application header. The multiple clients make concurrent IO requests to the same storage node, and we serialize all requests through 1 FMQ that serves either read or write requests.

In the *Filtering* benchmark, to lookup the destination DMA memory address (e.g., KVS-cache location or packet forwarding table context address), the kernel needs to compute the hash of the L7-header used as a lookup table index stored in sNIC LLC.

**Management overheads:** To assess the influence of OSMO-SIS management on applications' performance, we start by running them in isolation. Figure 11 displays how OSMOSIS does not introduce considerable overheads for compute-bound workloads. These oscillate within  $\pm 3\%$  of the baseline PsPIN implementation and reach the maximum of 310Mpps for the *Aggregation* workloads. For IO-bound workloads, OSMO-



Figure 11: The relative packet throughput of common datacenter workloads run in a standalone mode as a function of packet size with their raw performance in million packets per second (Mpps) at the top of the bars. Up to a 3% throughput increase with OSMOSIS compared to the PsPIN baseline stems from a kernel completion time variability introduced by the compute/IO schedulers.

SIS introduces overheads stemming from the fragmentation, which have been discussed in Section 6.3. This can be resolved by extending the AXI bus protocol [46, 80]. While overheads reach from 23% to 2% and represent the cost of introducing fair and efficient multi-tenancy, the workloads still achieve 332Mpps in the *IO write* case.

**Application mixtures:** Evaluating applications in isolation is not representative of real workloads which occur in multitenant datacenters for which OSMOSIS was designed and where multiple users contend for resources. We consider two application sets: *compute* and *IO*, each resulting in tenant resource contention.

The compute-bound set comprises the *Reduce* and *Histogram* workloads. Each is introduced as a *Victim* (64B packets for *Reduce* and 64-128 packets for *Histogram*) and *Congestor* (4KB packets for *Reduce* and 3072 – 4096 byte packets for *Histogram*). As Figure 12a shows, these workloads saturate the PUs of the sNIC within the first couple thousand cycles and introduce compute congestion. Using OSMOSIS WLBVT scheduling, each tenant obtains an average allocation of 47% fairer than that of the typical RR implementation as measured using Jain's metric. Such allocations ensure SLO fulfillment and result in 39% faster *flow completion times* (FCT) because of lower average contention while only sacrificing 3% of the *Histogram Congestor*. OSMOSIS thus achieves a fair and efficient resource allocation.

The multi-tenancy system must efficiently manage *all* resources in coordination [30, 31]. We illustrate this scenario in Figure 12b, where the IO set includes 4 kernels of varying complexity so that the code executed on the PUs produces various data movement patterns. The set consists of IO *read* and *write* flows, introduced again as both a *Victim* and *Congestor*. While *reads* and *writes* share the NIC ingress, the utilized DMA paths are opposite to each other.

The write packets have a variable-length packet size (up to

128B and 4KB for *Victim* and *Congestor*, correspondingly) proportional to the payload size. The payload of the *read* flow has a fixed size and contains 2 64-bit values (read location in memory and its size varied for *Victim* and *Congestor*). While each *read* packet will spend fewer cycles in the NIC ingress, it will induce up to  $2 \times$  more data movement work compared to *write*, i.e., DMA read from the host memory followed by sending towards egress. This results in seemingly continuous distributions for *read* requests that are processed slower than bursty *writes*.

Figure 12b shows that, similarly to the compute case, OS-MOSIS obtains a consistently fairer allocation than a RR scheduler (up to 83%) as measured by the average Jain's fairness metric. We notice that the *writes* are processed much faster than the *reads*. OSMOSIS also manages to reduce FCT for all tenants by up to 63%. Such large improvement comes from addressing the HoL-blocking problem, leading to a more efficient allocation. The *IO read Congestor* is initially suppressed to let other tenants fairly finish their workloads and then obtains full exclusive utilization, eliminating contention and allowing it to regain the lost performance. On the other hand, the other tenants are fairly allocated and, as Figure 13 shows, they do not suffer from HoL-blocking.

Figure 13 also displays the true cost of the aforementioned gains. While the overall FCT is reduced for all tenants, the single kernel completion time shows a different story. The HoL-blocking is resolved for the *Victim* tenants, for which the kernel completion time is reduced more than fivefold. However, the other *Congestor* tenants display an up to  $8 \times$  increased median kernel completion time. While OSMOSIS increases the median per packet processing time, it also achieves overall FCT gains for the IO set by allocating the resources fairly, and by parallelizing the packets appropriately.

#### 7 Related Work

In this section, we summarize recent milestones in NIC resource management research and provide a qualitative comparison of existing solutions with OSMOSIS.

Justitia [106] and PicNIC [54] are rNIC virtualization layers lacking on-NIC compute management. They function as software controllers between the NIC and host application, handling RDMA read/write operations atop the RDMA API. Lynx [94] focuses on sNIC GPU data movement offloading but similarly manages traffic at a *per-message* granularity and lacks detailed analysis of multi-tenancy issues.

Floem [76], FairNIC [32], and iPipe [60] specifically target on-path sNICs programmability. All three solutions lack flow priorities implementation. FairNIC aims for multi-tenant use cases by statically allocating compute and IO bandwidth to flows. This approach can potentially cause under-utilization or unfairness [47, 79, 86]. iPipe [60] proposes to move the execution of packet processing to the host CPU in case of congested sNIC resources. We design OSMOSIS for scenar-



Figure 12: The evolution of tenant performance and average fairness against the simulated time. The upper sub-plots show the total Jain's fairness score computed over all flows at once. The percentages indicate the reduction in FCT for each tenant.



Figure 13: The completion time distribution for IO-bound applications for two fragment sizes.

ios where on-path sNIC *fully* offloads the packet processing, and the host CPU runs a server-local non-networking path on the data processed by sNIC, e.g., computation on the results of in-network reduction or host-local distributed file system management.

Per-flow priority management is present in PANIC [59] and Menshen [97]. Both solutions specialize in Reconfigurable Match Tables (RMT) pipeline architectures, e.g., PANIC is tailored for FPGA-based sNICs. The applicability scope of OSMOSIS is different, focusing on programmable on-path designs such as Bluefield-3 DPA and PsPIN, which explore a different type of parallelism. In on-path sNICs the packets of the same flow are processed in parallel with user-defined C kernels. These kernels run on tens to hundreds of energyefficient cores integrated within SoC. To efficiently distribute packets across a large core count and sustain the line rate, onpath sNICs are constrained with low-latency hardware packet schedulers lacking reconfigurability.

To our knowledge, OSMOSIS is the first solution that can

support fair work-conserving SLO-based traffic management integrated within the on-path sNICs hardware data path.

#### 8 Conclusions

Enabling user-level on-NIC processing in modern multitenant datacenters brings resource multiplexing and hardware/software co-design challenges. OSMOSIS solves sNIC multi-tenancy by distributing sNIC resources, the egress and DMA bandwidth, and processing units across flows with different priorities, input bandwidth, and computational requirements. To achieve a fair distribution of resources, OSMOSIS relies on sNIC-specific principles, such as work-conserving allocation of compute and IO resources. The evaluation shows that OSMOSIS efficiently redistributes resources, enabling QoS, performance isolation, and prioritization between various mixtures of flows. OSMOSIS improves FCT by up to 60% and is fairer by up to 83% than typical schedulers. We believe that OSMOSIS could enable wider adoption of on-path sNICs in cloud datacenters with low overhead.

#### Acknowledgments

We thank anonymous reviewers and our shepherd Chenxi Wang for insightful comments to improve the paper. This project received funding from EuroHPC-JU under the grant agreements RED-SEA, No. 055776 and DEEP-SEA, No. 95560, the EuroHPC-JU "The European Pilot" project under the grant agreement No. 101034126 as part of the EU Horizon 2020 research and innovation programme, and a donation from Intel.

#### References

- [1] AGARWAL, S., AGARWAL, R., MONTAZERI, B., MOSHREF, M., ELMELEEGY, K., RIZZO, L., DE KRUIJF, M. A., KUMAR, G., RAT-NASAMY, S., CULLER, D., ET AL. Understanding host interconnect congestion. In *Proceedings of the 21st ACM Workshop on Hot Topics in Networks* (2022), pp. 198–204.
- [2] AGRAWAL, A., AND KIM, C. Intel tofino2–a 12.9 tbps p4programmable ethernet switch. In 2020 IEEE Hot Chips 32 Symposium (HCS) (2020), IEEE Computer Society, pp. 1–32.
- [3] ALIZADEH, M., GREENBERG, A., MALTZ, D. A., PADHYE, J., PA-TEL, P., PRABHAKAR, B., SENGUPTA, S., AND SRIDHARAN, M. Data center tcp (dctcp). In *Proceedings of the ACM SIGCOMM 2010 Conference* (2010), pp. 63–74.
- [4] ANDERSON, T. E., OWICKI, S. S., SAXE, J. B., AND THACKER, C. P. High-speed switch scheduling for local-area networks. ACM Transactions on Computer Systems (TOCS) 11, 4 (1993), 319–352.
- [5] ATTIG, M., AND BREBNER, G. 400 gb/s programmable packet parsing on a single fpga. In 2011 ACM/IEEE Seventh Symposium on Architectures for Networking and Communications Systems (2011), IEEE, pp. 12–23.
- [6] BALAS, R., AND BENINI, L. Risc-v for real-time mcus software optimization and microarchitectural gap analysis. In 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE) (2021), pp. 874–877.
- [7] BARTHELS, C., MÜLLER, I., SCHNEIDER, T., ALONSO, G., AND HOEFLER, T. Distributed join algorithms on thousands of cores. *Proceedings of the VLDB Endowment 10*, 5 (2017), 517–528.
- [8] BELAY, A., PREKAS, G., KLIMOVIC, A., GROSSMAN, S., KOZYRAKIS, C., AND BUGNION, E. {IX}: a protected dataplane operating system for high throughput and low latency. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14) (2014), pp. 49–65.
- [9] BEN-NUN, T., AND HOEFLER, T. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Computing Surveys (CSUR) 52, 4 (2019), 1–43.
- [10] BENSON, T., AKELLA, A., AND MALTZ, D. A. Network traffic characteristics of data centers in the wild. In *Proceedings of the 10th* ACM SIGCOMM Conference on Internet Measurement (New York, NY, USA, 2010), IMC '10, Association for Computing Machinery, p. 267–280.
- [11] BENZ, T., ROGENMOSER, M., SCHEFFLER, P., RIEDEL, S., OTTA-VIANO, A., KURTH, A., HOEFLER, T., AND BENINI, L. A highperformance, energy-efficient modular dma engine architecture. arXiv preprint arXiv:2305.05240 (2023).
- [12] BLÖCHER, M., WANG, L., EUGSTER, P., AND SCHMIDT, M. Switches for hire: Resource scheduling for data center in-network computing. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (2021), pp. 268–285.
- [13] BOLCH, G., GREINER, S., DE MEER, H., AND TRIVEDI, K. S. Queueing Networks and Markov Chains: Modeling and Performance Evaluation with Computer Science Applications. Wiley-Interscience, USA, 1998.
- [14] BORROMEO, J. C., KONDEPU, K., ANDRIOLLI, N., AND VAL-CARENGHI, L. Fpga-accelerated smartnic for supporting 5g virtualized radio access network. *Computer Networks 210* (2022), 108931.
- [15] CAI, Q., VUPPALAPATI, M., HWANG, J., KOZYRAKIS, C., AND AGARWAL, R. Towards μ s tail latency and terabit ethernet: disaggregating the host network stack. In *Proceedings of the ACM SIGCOMM* 2022 Conference (2022), pp. 767–779.

- [16] CHANG, B., AKELLA, A., D'ANTONI, L., AND SUBRAMANIAN, K. Learned load balancing. In *Proceedings of the 24th International Conference on Distributed Computing and Networking* (2023), pp. 177–187.
- [17] CHEN, X., ZHANG, J., FU, T., SHEN, Y., MA, S., QIAN, K., ZHU, L., SHI, C., LIU, M., AND WANG, Z. Demystifying datapath accelerator enhanced off-path smartnic. arXiv preprint arXiv:2402.03041 (2024).
- [18] DEAN, J., AND BARROSO, L. A. The tail at scale. Communications of the ACM 56, 2 (2013), 74–80.
- [19] DI GIROLAMO, S., KURTH, A., CALOTOIU, A., BENZ, T., SCHNEI-DER, T., BERANEK, J., BENINI, L., AND HOEFLER, T. A risc-v innetwork accelerator for flexible high-performance low-power packet processing. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA) (2021), IEEE, pp. 958–971.
- [20] DONG, Y., YANG, X., LI, J., LIAO, G., TIAN, K., AND GUAN, H. High performance network virtualization with sr-iov. *Journal of Parallel and Distributed Computing* 72, 11 (2012), 1471–1480.
- [21] DONG, Y., YU, Z., AND ROSE, G. Sr-iov networking in xen: Architecture, design and implementation. In *Workshop on I/O virtualization* (2008), vol. 2.
- [22] DUDA, K. J., AND CHERITON, D. R. Borrowed-virtual-time (bvt) scheduling: supporting latency-sensitive threads in a general-purpose scheduler. ACM SIGOPS Operating Systems Review 33, 5 (1999), 261–276.
- [23] ERAN, H., FUDIM, M., MALKA, G., SHALOM, G., COHEN, N., HERMONY, A., LEVI, D., LISS, L., AND SILBERSTEIN, M. Flexdriver: A network driver for your accelerator. In *Proceedings of the* 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (2022), pp. 1115–1129.
- [24] ETHERNET ALLIANCE. Ethernet Roadmap 2022. https:// ethernetalliance.org/technology/ethernet-roadmap/.
- [25] FARSHIN, A., BARBETTE, T., ROOZBEH, A., MAGUIRE JR, G. Q., AND KOSTIĆ, D. Packetmill: toward per-core 100-gbps networking. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (2021), pp. 1–17.
- [26] FLOYD, S., AND JACOBSON, V. Random early detection gateways for congestion avoidance. *IEEE/ACM Transactions on networking 1*, 4 (1993), 397–413.
- [27] FRIED, J., RUAN, Z., OUSTERHOUT, A., AND BELAY, A. Caladan: Mitigating interference at microsecond timescales. In *Proceedings* of the 14th USENIX Conference on Operating Systems Design and Implementation (2020), pp. 281–297.
- [28] GAO, P., DALLEGGIO, A., XU, Y., AND CHAO, H. J. Gearbox: A hierarchical packet scheduler for approximate weighted fair queuing. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22) (2022), pp. 551–565.
- [29] GAO, P. X., NARAYAN, A., KARANDIKAR, S., CARREIRA, J., HAN, S., AGARWAL, R., RATNASAMY, S., AND SHENKER, S. Network requirements for resource disaggregation. In *12th USENIX symposium* on operating systems design and implementation (OSDI 16) (2016), pp. 249–264.
- [30] GHODSI, A., SEKAR, V., ZAHARIA, M., AND STOICA, I. Multiresource fair queueing for packet processing. In *Proceedings of the* ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication (2012), pp. 1– 12.
- [31] GHODSI, A., ZAHARIA, M., HINDMAN, B., KONWINSKI, A., SHENKER, S., AND STOICA, I. Dominant resource fairness: Fair allocation of multiple resource types. In 8th USENIX symposium on networked systems design and implementation (NSDI 11) (2011).

- [32] GRANT, S., YELAM, A., BLAND, M., AND SNOEREN, A. C. Smartnic performance isolation with fairnic: Programmable networking for the cloud. In *Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication* (2020), pp. 681–693.
- [33] GUO, Z., SHAN, Y., LUO, X., HUANG, Y., AND ZHANG, Y. Clio: A hardware-software co-designed disaggregated memory system. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (2022), pp. 417–433.
- [34] HAECKI, R., MYSORE, R. N., SURESH, L., ZELLWEGER, G., GAN, B., MERRIFIELD, T., BANERJEE, S., AND ROSCOE, T. How to diagnose nanosecond network latencies in rich end-host stacks. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22) (2022), pp. 861–877.
- [35] HOEFLER, T., DI GIROLAMO, S., TARANOV, K., GRANT, R. E., AND BRIGHTWELL, R. spin: High-performance streaming processing in the network. In *Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis* (2017), pp. 1–16.
- [36] HOEFLER, T., ROWETH, D., UNDERWOOD, K., ALVERSON, R., GRISWOLD, M., TABATABAEE, V., KALKUNTE, M., ANUBOLU, S., SHEN, S., MCLAREN, M., ET AL. Data center ethernet and remote direct memory access: Issues at hyperscale. *Computer 56*, 7 (2023), 67–77.
- [37] HOFEMEIER, G., AND CHESEBROUGH, R. Introduction to intel aes-ni and intel secure key instructions. *Intel, White Paper 62* (2012).
- [38] HØILAND-JØRGENSEN, T., BROUER, J. D., BORKMANN, D., FASTABEND, J., HERBERT, T., AHERN, D., AND MILLER, D. The express data path: Fast programmable packet processing in the operating system kernel. In *Proceedings of the 14th international conference* on emerging networking experiments and technologies (2018), pp. 54– 66.
- [39] HOSSFELD, T., SKORIN-KAPOV, L., HEEGAARD, P. E., AND VARELA, M. Definition of qoe fairness in shared systems. *IEEE Communications Letters* 21, 1 (2016), 184–187.
- [40] HUNTER, A., KENNELLY, C., TURNER, P., GOVE, D., MOSELEY, T., AND RANGANATHAN, P. Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21) (July 2021), USENIX Association, pp. 257–273.
- [41] IBANEZ, S., MALLERY, A., ARSLAN, S., JEPSEN, T., SHAHBAZ, M., KIM, C., AND MCKEOWN, N. The nanopu: A nanosecond network stack for datacenters. In 15th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 21) (2021), pp. 239–256.
- [42] IBANEZ, S., MALLERY, A., ARSLAN, S., JEPSEN, T., SHAHBAZ, M., KIM, C., AND MCKEOWN, N. Enabling the reflex plane with the nanopu. arXiv preprint arXiv:2212.06658 (2022).
- [43] IEEE. 802.1Qbb: IEEE Standard for Local and Metropolitan Area Networks—Virtual Bridged Local Area Networks – Amendment: Priority-based Flow Control. https://l.ieee802.org/dcb/ 802-1qbb/.
- [44] INFINIBAND TRADE ASSOCIATION. InfiniBand Specification. https://www.infinibandta.org.
- [45] IVANOV, A., DRYDEN, N., BEN-NUN, T., LI, S., AND HOEFLER, T. Data movement is all you need: A case study on optimizing transformers. *Proceedings of Machine Learning and Systems 3* (2021), 711–732.
- [46] JIANG, Z., YANG, K., FISHER, N., GRAY, I., AUDSLEY, N. C., AND DONG, Z. Axi-ic rt: Towards a real-time axi-interconnect for highly integrated socs. *IEEE Transactions on Computers* 72, 3 (2022), 786– 799.

- [47] KAFFES, K., CHONG, T., HUMPHRIES, J. T., BELAY, A., MAZ-IÈRES, D., AND KOZYRAKIS, C. Shinjuku: Preemptive scheduling for {μsecond-scale} tail latency. In *16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19)* (2019), pp. 345–360.
- [48] KALIA, A., KAMINSKY, M., AND ANDERSEN, D. G. Design guidelines for high performance RDMA systems. In 2016 USENIX Annual Technical Conference (USENIX ATC 16) (Denver, CO, June 2016), USENIX Association, pp. 437–450.
- [49] KALIA, A., KAMINSKY, M., AND ANDERSEN, D. G. Datacenter rpcs can be general and fast. arXiv preprint arXiv:1806.00680 (2018).
- [50] KATSIKAS, G. P., BARBETTE, T., CHIESA, M., KOSTIĆ, D., AND MAGUIRE JR, G. Q. What you need to know about (smart) network interface cards. In *International Conference on Passive and Active Network Measurement* (2021), Springer, pp. 319–336.
- [51] KHAWAJA, A., LANDGRAF, J., PRAKASH, R., WEI, M., SCHKUFZA, E., AND ROSSBACH, C. J. Sharing, protection, and compatibility for reconfigurable fabric with amorphos. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18) (2018), pp. 107–127.
- [52] KIM, J., JANG, I., REDA, W., IM, J., CANINI, M., KOSTIĆ, D., KWON, Y., PETER, S., AND WITCHEL, E. Linefs: Efficient smartnic offload of a distributed file system with pipeline parallelism. In *Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles* (2021), pp. 756–771.
- [53] KOROLIJA, D., ROSCOE, T., AND ALONSO, G. Do os abstractions make sense on fpgas? In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation (2020), pp. 991– 1010.
- [54] KUMAR, P., DUKKIPATI, N., LEWIS, N., CUI, Y., WANG, Y., LI, C., VALANCIUS, V., ADRIAENS, J., GRIBBLE, S., FOSTER, N., ET AL. Picnic: predictable virtualized nic. In *Proceedings of the ACM Special Interest Group on Data Communication*. 2019, pp. 351–366.
- [55] KURTH, A., RÖNNINGER, W., BENZ, T., CAVALCANTE, M., SCHUIKI, F., ZARUBA, F., AND BENINI, L. An open-source platform for high-performance non-coherent on-chip communication. *IEEE Transactions on Computers* 71, 8 (2021), 1794–1809.
- [56] LAZAREV, N., XIANG, S., ADIT, N., ZHANG, Z., AND DELIM-ITROU, C. Dagger: efficient and fast rpcs in cloud microservices with near-memory reconfigurable nics. In *Proceedings of the 26th* ACM International Conference on Architectural Support for Programming Languages and Operating Systems (2021), pp. 36–51.
- [57] LE, Y., CHANG, H., MUKHERJEE, S., WANG, L., AKELLA, A., SWIFT, M. M., AND LAKSHMAN, T. Uno: Uniflying host and smart nic offload for flexible packet processing. In *Proceedings of the 2017 Symposium on Cloud Computing* (2017), pp. 506–519.
- [58] LI, Y., MIAO, R., LIU, H. H., ZHUANG, Y., FENG, F., TANG, L., CAO, Z., ZHANG, M., KELLY, F., ALIZADEH, M., ET AL. Hpcc: High precision congestion control. In *Proceedings of the ACM Special Interest Group on Data Communication*. 2019, pp. 44–58.
- [59] LIN, J., PATEL, K., STEPHENS, B. E., SIVARAMAN, A., AND AKELLA, A. PANIC: A High-Performance programmable NIC for multi-tenant networks. In *14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)* (Nov. 2020), USENIX Association, pp. 243–259.
- [60] LIU, M., CUI, T., SCHUH, H., KRISHNAMURTHY, A., PETER, S., AND GUPTA, K. Offloading distributed applications onto smartnics using ipipe. In *Proceedings of the ACM Special Interest Group on Data Communication*. 2019, pp. 318–333.
- [61] LIU, M., PETER, S., KRISHNAMURTHY, A., AND PHOTHILIMTHANA, P. M. E3: Energy-efficient microservices on smartnic-accelerated servers. In USENIX annual technical conference (2019), pp. 363–378.

- [62] MARVELL. LiquidIO-III. https://www.marvell.com/content/ dam/marvell/en/public-collateral/embedded-processors/ marvell-liquidio-III-solutions-brief.pdf.
- [63] MIANO, S., SANAEE, A., RISSO, F., RÉTVÁRI, G., AND ANTICHI, G. Domain specific run time optimization for software data planes. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (2022), pp. 1148–1164.
- [64] MICROSOFT. Introduction to Receive Side Scaling. https://learn.microsoft.com/en-us/windows-hardware/ drivers/network/introduction-to-receive-side-scaling.
- [65] MIN, J., LIU, M., CHUGH, T., ZHAO, C., WEI, A., DOH, I. H., AND KRISHNAMURTHY, A. Gimbal: enabling multi-tenant storage disaggregation on smartnic jbofs. In *Proceedings of the 2021 ACM SIGCOMM 2021 Conference* (2021), pp. 106–122.
- [66] MINTURN, D. Nvm express over fabrics. In 11th Annual OpenFabrics International OFS Developers' Workshop (2015).
- [67] MOGUL, J. C. Tcp offload is a dumb idea whose time has come. In *HotOS* (2003), pp. 25–30.
- [68] MOON, Y., LEE, S., JAMSHED, M. A., AND PARK, K. {AccelTCP}: Accelerating network applications with stateful {TCP} offloading. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20) (2020), pp. 77–92.
- [69] MUDIGONDA, J., YALAGANDULA, P., MOGUL, J., STIEKES, B., AND POUFFARY, Y. Netlord: a scalable multi-tenant network architecture for virtualized datacenters. ACM SIGCOMM Computer Communication Review 41, 4 (2011), 62–73.
- [70] NAMYAR, P., SUPITTAYAPORNPONG, S., ZHANG, M., YU, M., AND GOVINDAN, R. A throughput-centric view of the performance of datacenter topologies. In *Proceedings of the 2021 ACM SIGCOMM* 2021 Conference (2021), pp. 349–369.
- [71] NETRONOME. Agilio SmartNICs. https://www.netronome.com/ products/smartnic/overview/.
- [72] NVIDIA. Bluefield-3 DPA FlexIO. https://docs.nvidia.com/ doca/sdk/dpa-subsystem-programming-guide/index.html.
- [73] NVIDIA. Bluefield-3 DPU. https://www.nvidia.com/ content/dam/en-zz/Solutions/Data-Center/documents/ datasheet-nvidia-bluefield-3-dpu.pdf.
- [74] ORDONEZ, C., AND CHEN, Z. Horizontal aggregations in sql to prepare data sets for data mining analysis. *IEEE transactions on knowledge and data engineering 24*, 4 (2011), 678–691.
- [75] PETER, S., LI, J., ZHANG, I., PORTS, D. R., WOOS, D., KRISH-NAMURTHY, A., ANDERSON, T., AND ROSCOE, T. Arrakis: The operating system is the control plane. ACM Transactions on Computer Systems (TOCS) 33, 4 (2015), 1–30.
- [76] PHOTHILIMTHANA, P. M., LIU, M., KAUFMANN, A., PETER, S., BODIK, R., AND ANDERSON, T. Floem: A programming system for NIC-Accelerated network applications. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18) (Carlsbad, CA, Oct. 2018), USENIX Association, pp. 663–679.
- [77] PISMENNY, B., ERAN, H., YEHEZKEL, A., LISS, L., MORRISON, A., AND TSAFRIR, D. Autonomous nic offloads. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (2021), pp. 18–35.
- [78] POURHABIBI, A., SUTHERLAND, M., DAGLIS, A., AND FALSAFI, B. Cerebros: Evading the rpc tax in datacenters. In *MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture* (2021), pp. 407–420.
- [79] PREKAS, G., KOGIAS, M., AND BUGNION, E. Zygos: Achieving low tail latency for microsecond-scale networked tasks. In *Proceedings of* the 26th Symposium on Operating Systems Principles (2017), pp. 325– 341.

- [80] RESTUCCIA, F., BIONDI, A., MARINONI, M., CICERO, G., AND BUTTAZZO, G. Axi hyperconnect: A predictable, hypervisor-level interconnect for hardware accelerators in fpga soc. In 2020 57th ACM/IEEE Design Automation Conference (DAC) (2020), IEEE, pp. 1– 6.
- [81] RESTUCCIA, F., PAGANI, M., BIONDI, A., MARINONI, M., AND BUTTAZZO, G. Is your bus arbiter really fair? restoring fairness in axi interconnects for fpga socs. ACM Transactions on Embedded Computing Systems (TECS) 18, 5s (2019), 1–22.
- [82] RIVITTI, A., BIFULCO, R., TULUMELLO, A., BONOLA, M., AND PONTARELLI, S. ehdl: Turning ebpf/xdp programs into hardware designs for the nic. In *Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages* and Operating Systems, Volume 3 (2023), pp. 208–223.
- [83] ROSSI, D., CONTI, F., MARONGIU, A., PULLINI, A., LOI, I., GAUTSCHI, M., TAGLIAVINI, G., CAPOTONDI, A., FLATRESSE, P., AND BENINI, L. Pulp: A parallel ultra low power platform for next generation iot applications. In 2015 IEEE Hot Chips 27 Symposium (HCS) (2015), IEEE Computer Society, pp. 1–39.
- [84] ROY, A., ZENG, H., BAGGA, J., PORTER, G., AND SNOEREN, A. C. Inside the social network's (datacenter) network. *SIGCOMM Comput. Commun. Rev.* 45, 4 (Aug. 2015), 123–137.
- [85] SCHNEIDER, T., XU, P., AND HOEFLER, T. Fpspin: An fpga-based open-hardware research platform for processing in the network. arXiv preprint arXiv:2405.16378 (2024).
- [86] SEYEDROUDBARI, H., VANAVASAM, S., AND DAGLIS, A. Turbo: Smartnic-enabled dynamic load balancing of µs-scale rpcs. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA) (2023), IEEE, pp. 1045–1058.
- [87] SHASHIDHARA, R., STAMLER, T., KAUFMANN, A., AND PETER, S. {FlexTOE}: Flexible {TCP} offload with {Fine-Grained} parallelism. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22) (2022), pp. 87–102.
- [88] SINGLA, A., GODFREY, P. B., AND KOLLA, A. High throughput data center topology design. In 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14) (2014), pp. 29–41.
- [89] SIRACUSANO, G., GALEA, S., SANVITO, D., MALEKZADEH, M., ANTICHI, G., COSTA, P., HADDADI, H., AND BIFULCO, R. Rearchitecting traffic analysis with neural network interface cards. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22) (Renton, WA, Apr. 2022), USENIX Association, pp. 513–533.
- [90] SNYDER, W. Verilator: Open simulation-growing up. DVClub Bristol (2013).
- [91] STEPHENS, B. E., AKELLA, A., AND SWIFT, M. M. Loom: Flexible and efficient nic packet scheduling. In NSDI (2019), vol. 19, pp. 33– 46.
- [92] SUN, S., ZHANG, R., YAN, M., AND WU, J. Skv: A smartnicoffloaded distributed key-value store. In 2022 IEEE International Conference on Cluster Computing (CLUSTER) (2022), IEEE, pp. 1– 11.
- [93] SWAMY, T., RUCKER, A., SHAHBAZ, M., GAUR, I., AND OLUKO-TUN, K. Taurus: a data plane architecture for per-packet ml. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (2022), pp. 1099–1114.
- [94] TORK, M., MAUDLEJ, L., AND SILBERSTEIN, M. Lynx: A smartnicdriven accelerator-centric architecture for network servers. In *Proceed*ings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (2020), pp. 117–131.
- [95] ULTRAETHERNET CONSORTIUM. The New Era Needs a New Network. https://ultraethernet.org.

- [96] VAHDAT, A., AND MILOJICIC, D. The next wave in cloud systems architecture. *Computer* 54, 10 (2021), 116–120.
- [97] WANG, T., YANG, X., ANTICHI, G., SIVARAMAN, A., AND PANDA, A. Isolation mechanisms for High-Speed Packet-Processing pipelines. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22) (Renton, WA, Apr. 2022), USENIX Association, pp. 1289–1305.
- [98] WANG, Z., HUANG, H., ZHANG, J., WU, F., AND ALONSO, G. {FpgaNIC}: An {FPGA-based} versatile 100gb {SmartNIC} for {GPUs}. In 2022 USENIX Annual Technical Conference (USENIX ATC 22) (2022), pp. 967–986.
- [99] WATERMAN, A., LEE, Y., AVIZIENIS, R., PATTERSON, D. A., AND ASANOVIĆ, K. The risc-v instruction set manual volume ii: Privileged architecture version 1.9. Tech. Rep. UCB/EECS-2016-129, EECS Department, University of California, Berkeley, Jul 2016.
- [100] WEIBAI, X. J., XU, Y., ELHADDAD, M., RAINDEL, S., PADHYE, J., AND ZHUO, A. R. L. D. Understanding rdma microarchitecture resources for performance isolation.
- [101] WOODRUFF, J., MOORE, A. W., AND ZILBERMAN, N. Measuring burstiness in data center applications. In *Proceedings of the 2019 Workshop on Buffer Sizing* (New York, NY, USA, 2019), BS '19, Association for Computing Machinery.
- [102] YAN, Y., BELDACHI, A. F., NEJABATI, R., AND SIMEONIDOU, D. P4-enabled smart nic: Enabling sliceable and service-driven optical data centres. *Journal of Lightwave Technology* 38, 9 (2020), 2688– 2694.
- [103] YANG, X., EGGERT, L., OTT, J., UHLIG, S., SUN, Z., AND AN-TICHI, G. Making quic quicker with nic offload. In *Proceedings of* the Workshop on the Evolution, Performance, and Interoperability of QUIC (2020), pp. 21–27.
- [104] YUAN, Y., HUANG, J., SUN, Y., WANG, T., NELSON, J., PORTS, D. R., WANG, Y., WANG, R., TAI, C., AND KIM, N. S. Rambda: Rdma-driven acceleration framework for memory-intensive μs-scale datacenter applications. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA) (2023), IEEE, pp. 499–515.
- [105] ZARUBA, F., SCHUIKI, F., AND BENINI, L. Manticore: A 4096-core risc-v chiplet architecture for ultraefficient floating-point computing. *IEEE Micro* 41, 2 (2020), 36–42.
- [106] ZHANG, Y., TAN, Y., STEPHENS, B., AND CHOWDHURY, M. Justitia: Software Multi-Tenancy in hardware Kernel-Bypass networks. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22) (Renton, WA, Apr. 2022), USENIX Association, pp. 1307–1326.
- [107] ZHU, Y., ERAN, H., FIRESTONE, D., GUO, C., LIPSHTEYN, M., LIRON, Y., PADHYE, J., RAINDEL, S., YAHIA, M. H., AND ZHANG, M. Congestion control for large-scale rdma deployments. ACM SIGCOMM Computer Communication Review 45, 4 (2015), 523–536.