Coupling ECN Marking Thresholds with Dynamic Buffer Allocation

Introduction Explicit Congestion Notification (ECN) enables network devices to signal incipient congestion to endpoints without resorting to packet drops. A device marks a packet's IP header with the Congestion Experienced (CE) codepoint when a specific queue metric exceeds a configured Active Queue Management (AQM) threshold. The sender, upon learning of the CE mark through transport-layer feedback, proactively reduces its sending rate. Conventionally, the ECN marking threshold is established as a static value chosen by the network operator. This static approach functions adequately when the maximum buffer available to a given queue is also static and predictable. However, the architecture of modern data center switches heavily relies on dynamic buffer allocation. In such architectures, the maximum buffer a queue is permitted to consume (Buf_Thrd) fluctuates significantly based on the total available shared buffer and the instantaneous number of active queues drawing from it. Dynamic buffer allocation schemes, such as those utilizing the alpha parameter model, are widely deployed in commodity switching silicon to maximize memory utilization. When Buf_Thrd shrinks (e.g., due to an incast event activating many queues), a static ECN threshold originally positioned well below the nominal buffer limit may suddenly be equal to or greater than the current Buf_Thrd. In this scenario, the device is forced into tail drop before the queue occupancy ever reaches the ECN threshold. The ECN mechanism effectively fails, yielding severe packet loss and higher tail latency rather than graceful rate reduction. Conversely, when the network load decreases and Buf_Thrd expands, the static threshold may sit far below the actual buffer capacity. This drastically underutilizes available buffering, generating premature congestion signals that trigger unnecessary rate reduction and diminish overall link utilization. Unlike sojourn-time based AQM algorithms (such as CoDel or PIE ), which inherently adapt to buffer size variations by measuring delay rather than bytes, queue-depth based marking mechanisms (e.g., standard step-marking in DCTCP or RoCEv2 environments) are highly vulnerable to dynamic buffer fluctuations. This document specifies an operational mechanism that continually derives the ECN marking threshold (ECN_Thrd) from the instantaneous value of Buf_Thrd. The computation introduces two operator-configurable parameters to maintain predictable headroom. The approach offers a deterministic, hardware-friendly solution to maintain a consistent relationship between ECN marking and buffer availability.

Terminology In the context of this document, a "queue" typically refers to a per-port, per-traffic-class transmission queue within a forwarding device.

Buf_Thrd (Buffer Threshold):: The dynamic buffer allocation limit for a specific queue. This represents the maximum amount of shared buffer memory that the queue is currently authorized to occupy. Buf_Thrd is periodically or event-driven recomputed by the device's buffer management subsystem.
ECN_Thrd (ECN Threshold):: The active ECN marking threshold for a queue. When the instantaneous or averaged queue occupancy meets or exceeds ECN_Thrd, the device applies the CE codepoint to arriving ECN-capable packets.
Offset:: A configurable parameter dictating the desired buffer headroom (typically measured in bytes or cells) maintained between Buf_Thrd and ECN_Thrd. The Offset acts as a shock absorber for packets already in-flight during the control loop feedback delay.
ECN_Floor:: A configurable parameter establishing the minimum permissible boundary for ECN_Thrd. It acts as a safeguard against ECN_Thrd collapsing to excessively low values (e.g., below a single MTU), which would cause catastrophic throughput degradation via aggressive continuous marking.

Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 when, and only when, they appear in all capitals, as shown here.

Applicability Statement This document is explicitly applicable to network forwarding devices utilizing queue-depth based ECN marking mechanisms in conjunction with a dynamic buffer allocation scheme. It is primarily targeted at Data Center Networks (DCN) and high-speed interconnects where instantaneous queue length or average queue length is evaluated against a byte-based or cell-based threshold. This specification does NOT target devices employing sojourn-time based AQMs (e.g., , ), as time-based algorithms naturally abstract away the physical buffer size and are generally immune to the dynamic shared buffer problem described herein. The operational logic defined here is strictly internal to the network device. It does not alter the ECN wire protocol, IP-layer ECN codepoint semantics, or the transport-layer negotiation standardized in . The method is compatible with Classic ECN marking as well as modern scalable congestion controls (e.g., the L4S architecture and its ECN protocol ). In a DualQ Coupled AQM architecture, the dynamically computed ECN_Thrd may serve as the target threshold for the Classic queue, leaving the L4S queue's specialized marking behavior independent.

Problem Statement To formalize the problem context, consider a Top-of-Rack (ToR) switch equipped with a 12 MB shared buffer pool and 48 egress ports. Under light traffic conditions with only 4 queues active, the dynamic buffer management may assign a Buf_Thrd of 3 MB to each active queue. Assuming a network operator statically configures an ECN threshold of 200 KB, the system operates with 2.8 MB of effective headroom, providing ample shock absorption. However, during a coordinated incast event where all 48 ports become heavily congested, the shared buffer is fractured, and the dynamic Buf_Thrd for each queue plummets to 250 KB. The statically configured 200 KB ECN threshold now yields a mere 50 KB of headroom. In high-speed environments (e.g., 100Gbps+), 50 KB is significantly smaller than the Bandwidth-Delay Product (BDP) of the control loop. Consequently, the queue will hit the tail drop limit (Buf_Thrd) before the transport sender has time to react to the CE marks, inducing severe retransmission timeouts and latency spikes. Conversely, if the operator statically configures the ECN threshold to 2 MB to optimize for high throughput under light load, the ECN mechanism will completely fail during the incast event because the static ECN threshold (2 MB) heavily exceeds the active Buf_Thrd (250 KB). A deterministic, dynamic coupling between Buf_Thrd and ECN_Thrd is necessary to resolve these dual failure modes without relying on static compromises.

Dynamic Coupling Architecture

Prerequisite: Buffer State Awareness The foundation of this architecture requires the device's forwarding plane to expose the current Buf_Thrd value to the AQM/ECN marking engine. The specific memory management algorithm (e.g., alpha-based proportional allocation) calculating Buf_Thrd is outside the scope of this document. The sole prerequisite is that Buf_Thrd is continuously updated and accessible with low latency.

Reference Algorithm for ECN Threshold Network devices SHOULD compute ECN_Thrd continuously based on Buf_Thrd, Offset, and ECN_Floor. To ensure stability across all load extremes, the logic is segmented into three distinct operational regions: Region A -- Sufficient Buffer (Nominal State): Condition: (Buf_Thrd - Offset) > ECN_Floor. The buffer allocation is generous enough to accommodate the full requested headroom (Offset). Here, ECN_Thrd = Buf_Thrd - Offset. The ECN threshold securely tracks the dynamic buffer limit, guaranteeing precisely the configured absorption capacity. Region B -- Constrained Buffer (Congested State): Condition: (Buf_Thrd - Offset) <= ECN_Floor AND Buf_Thrd > ECN_Floor. The shared buffer is highly constrained. Enforcing the full Offset would depress ECN_Thrd below the critical ECN_Floor, risking excessive marking and severe throughput collapse. To mitigate this, the threshold is clamped: ECN_Thrd = ECN_Floor. The available headroom compresses to (Buf_Thrd - ECN_Floor), prioritizing reasonable throughput over optimal packet absorption. Region C -- Critical Buffer (Exhaustion State): Condition: Buf_Thrd <= ECN_Floor. The queue's buffer allocation has collapsed to or below the minimum floor. In this critical state, clamping ECN_Thrd to ECN_Floor would result in ECN_Thrd >= Buf_Thrd, rendering ECN useless (tail drops would occur silently). Thus, ECN_Thrd = Buf_Thrd. While zero headroom remains, the device marks packets exactly at the tail drop boundary, ensuring the network still transmits explicit congestion signals. The reference logic is expressed as follows: ECN_Floor: RETURN Buf_Thrd - Offset // Region A: Optimal tracking ELSE IF Buf_Thrd > ECN_Floor: RETURN ECN_Floor // Region B: Floor clamped ELSE: RETURN Buf_Thrd // Region C: Drop boundary ]]>

State Transition of Dynamic ECN Threshold | | ECN_Floor? | +-------------+-------------+ | YES | NO v v ECN_Thrd = +------------------+ Buf_Thrd - | Buf_Thrd > | Offset | ECN_Floor? | [Region A] +--------+---------+ | YES | NO v v ECN_Thrd = ECN_Thrd = ECN_Floor Buf_Thrd [Region B] [Region C] ]]> This algorithm requires minimal logic gates (two comparators and one subtractor), ensuring it can be evaluated in standard Application-Specific Integrated Circuit (ASIC) pipelines with nominal nanosecond latency.

Architectural Invariants Implementations conforming to this framework SHOULD validate the following invariants to prevent anomalous traffic handling: 1. ECN_Thrd MUST NOT exceed Buf_Thrd (ECN_Thrd <= Buf_Thrd). This mathematically guarantees ECN marking is always attempted prior to or simultaneously with queue tail drop. 2. ECN_Thrd MUST NOT fall below ECN_Floor, UNLESS the maximum physical buffer limit (Buf_Thrd) has itself fallen below ECN_Floor.

Operational Considerations

Update Synchronization ECN_Thrd MUST be inherently recomputed concurrently with any transition in Buf_Thrd. Event-driven synchronization is highly RECOMMENDED over periodic polling. Polling introduces phase-delay, leaving the ECN_Thrd stale during the most critical microsecond inflection points of transient congestion. If atomic hardware updates are impossible, implementations SHOULD bias the asynchronous race condition to temporarily favor a lower ECN_Thrd (causing a premature mark) over a higher ECN_Thrd (causing an unnotified drop).

Tuning the Offset Parameter The Offset represents the network's required "shock absorber." Operators SHOULD calibrate the Offset to slightly exceed the expected Bandwidth-Delay Product (BDP) of the typical congestion control feedback loop: Offset ≈ Link_Rate * RTT In contemporary intra-data-center fabrics (RTT ~20-50 microseconds, 400 Gbps links), Offset values ranging from 1 MB to 2.5 MB are operationally appropriate. Oversizing the Offset prematurely throttles flows; undersizing it invites high tail-drop rates despite ECN capability.

Tuning the ECN_Floor Parameter ECN_Floor establishes the maximum throttling severity. It MUST NOT be configured smaller than the Maximum Transmission Unit (MTU) of the link (e.g., 9000 bytes). For environments executing Data Center TCP (DCTCP) , ECN_Floor SHOULD typically mirror the static thresholds recommended for shallow buffering (e.g., 30 KB to 100 KB), preventing the queue from emptying completely while maintaining ultra-low queuing delay.

Implementation Status [RFC Editor: Please remove this section before publication.] This section records the status of known implementations of the protocol defined by this specification at the time of posting of this Internet-Draft, and is based on a proposal described in RFC 7942. The description of implementations in this section is intended to assist the IETF in its decision processes in progressing drafts to RFCs. The dynamic ECN threshold coupling mechanism described in this document has been implemented and validated in the data plane of Centec Networks' switching silicon, specifically designed to mitigate micro-bursts and incast congestion in large-scale RDMA over Converged Ethernet (RoCEv2) deployments by China Mobile.

Related Work AQM recommendations generalized in outline the complexities of parameter tuning. While this document aligns with the intent of , it specifically isolates and resolves the intersection of AQM and dynamic shared buffering, a domain not fully explored in legacy AQM guidelines. The AI-based ECN approach proposed in targets similar parameter adaptation via machine learning. The framework in this document, conversely, advocates for a mathematically deterministic data-path calculation, demanding no training data, no external control-plane telemetry loop, and zero inference latency. TCP Alternative Backoff with ECN (ABE) optimizes how endpoints react to CE marks. ABE is strictly complementary; it refines the sender response, whereas this architecture ensures the network device generates those marks at structurally correct moments.

Security Considerations This specification introduces an automated internal parameter coupling within the network forwarding plane. It does not exchange new protocol messages across the wire, thus introducing no new cryptographic or protocol-level attack surfaces. Operational Degradation via Misconfiguration: Invalid configuration of Offset or ECN_Floor can initiate self-inflicted Denial of Service (DoS) behaviors. For instance, an immensely inflated Offset might universally push the system into Region C, effectively disabling early congestion warning. Implementations SHOULD validate parameter inputs through management interfaces and emit warnings if Offset exceeds typical physical buffer allocations. Internal Signaling Integrity: The architectural dependency between the memory management unit (MMU) and the ECN marking engine requires deterministic internal signaling. If the internal update of Buf_Thrd is delayed or corrupted under heavy system load, the ECN_Thrd calculation will be based on stale memory constraints, leading to temporary periods of over-marking or under-marking. Hardware designs SHOULD prioritize this internal signaling path. Buffer Exhaustion Vectors: Malicious, non-responsive flows could intentionally occupy massive allocations of the shared buffer pool. In dynamic buffer architectures, this action compresses the Buf_Thrd for all other benign queues, plunging them into Region B or Region C. This is an inherent vulnerability of shared memory switches, not generated by this ECN algorithm. Operators MUST utilize per-queue maximum caps, port-level QoS scheduling, and admission control to insulate queues from cross-traffic buffer starvation.

IANA Considerations This document has no IANA actions.