<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE rfc [
  <!ENTITY nbsp "&#160;">
]>
<rfc xmlns:xi="http://www.w3.org/2001/XInclude"
     ipr="trust200902"
     docName="draft-yang-dynamic-ecn-threshold-00"
     category="info"
     submissionType="IETF"
     consensus="true"
     version="3">

  <front>
    <title abbrev="Dynamic ECN Threshold">
      Coupling ECN Marking Thresholds with Dynamic Buffer Allocation
    </title>
    <seriesInfo name="Internet-Draft"
                value="draft-yang-dynamic-ecn-threshold-00"/>

<author fullname="Jin Yang" initials="J." surname="Yang">
      <organization>China Mobile</organization>
      <address>
        <postal>
          <city>Beijing</city>
          <code>100053</code>
          <country>China</country>
        </postal>
        <email>yangjinwl@chinamobile.com</email>
      </address>
    </author>

    <author fullname="Weiqiang Cheng" initials="W." surname="Cheng">
      <organization>China Mobile</organization>
      <address>
        <postal>
          <city>Beijing</city>
          <code>100053</code>
          <country>China</country>
        </postal>
        <email>chengweiqiang@chinamobile.com</email>
      </address>
    </author>

    <author fullname="Yuchi Tian" initials="Y." surname="Tian">
      <organization>China Mobile</organization>
      <address>
        <postal>
          <city>Beijing</city>
          <code>100053</code>
          <country>China</country>
        </postal>
        <email>tianyuchi@chinamobile.com</email>
      </address>
    </author>

    <author fullname="Junjie Wang" initials="J." surname="Wang">
      <organization>Centec</organization>
      <address>
        <postal>
          <city>Suzhou</city>
          <code>215000</code>
          <country>China</country>
        </postal>
        <email>wangjj@centec.com</email>
      </address>
    </author>

    <author fullname="Guoying Zhang" initials="G." surname="Zhang">
      <organization>Centec</organization>
      <address>
        <postal>
          <city>Suzhou</city>
          <code>215000</code>
          <country>China</country>
        </postal>
        <email>zhanggy@centec.com</email>
      </address>
    </author>

    <date year="2026" month="March" day="1"/>
    <area>Transport</area>
    <workgroup>Transport Area Working Group</workgroup>

    <keyword>ECN</keyword>
    <keyword>dynamic threshold</keyword>
    <keyword>buffer management</keyword>
    <keyword>data center</keyword>
    <keyword>AQM</keyword>
    <keyword>shared buffer</keyword>

    <abstract>
      <t>Explicit Congestion Notification (ECN) marking thresholds are
      typically configured statically. In modern network devices that employ
      dynamic buffer allocation -- where the maximum buffer available
      to a queue fluctuates dynamically based on the number of active queues 
      and the remaining shared buffer pool -- a static ECN threshold can 
      frequently become misaligned with the actual instantaneous buffering capacity.</t>
      <t>This misalignment can lead to pathological behaviors: either premature marking 
      (which underutilizes available buffers and throttles throughput) or late marking 
      (which provides no advance warning before tail drop occurs). This document 
      describes an operational framework and a deterministic reference algorithm for 
      dynamically coupling the ECN marking threshold with the dynamic buffer allocation 
      limit. By maintaining an adaptive relationship through configurable parameters, 
      this mechanism ensures robust congestion signaling across varying load conditions 
      without requiring complex external machine-learning models or per-flow tracking.</t>
    </abstract>
  </front>

  <middle>
    <section anchor="introduction">
      <name>Introduction</name>
      <t>Explicit Congestion Notification (ECN) <xref target="RFC3168"/> enables network 
      devices to signal incipient congestion to endpoints without resorting to packet drops. 
      A device marks a packet's IP header with the Congestion Experienced (CE)
      codepoint when a specific queue metric exceeds a configured Active Queue Management 
      (AQM) threshold. The sender, upon learning of the CE mark through transport-layer
      feedback, proactively reduces its sending rate.</t>

      <t>Conventionally, the ECN marking threshold is established as a static value
      chosen by the network operator. This static approach functions adequately when the 
      maximum buffer available to a given queue is also static and predictable. However, 
      the architecture of modern data center switches heavily relies on dynamic buffer 
      allocation. In such architectures, the maximum buffer a queue is permitted to consume 
      (Buf_Thrd) fluctuates significantly based on the total available shared buffer and the 
      instantaneous number of active queues drawing from it. Dynamic buffer allocation schemes, 
      such as those utilizing the alpha parameter model, are widely deployed in commodity 
      switching silicon to maximize memory utilization.</t>

      <t>When Buf_Thrd shrinks (e.g., due to an incast event activating many queues), a
      static ECN threshold originally positioned well below the nominal buffer limit may 
      suddenly be equal to or greater than the current Buf_Thrd. In this scenario, the 
      device is forced into tail drop before the queue occupancy ever reaches the ECN 
      threshold. The ECN mechanism effectively fails, yielding severe packet loss and higher 
      tail latency rather than graceful rate reduction.</t>

      <t>Conversely, when the network load decreases and Buf_Thrd expands, the static 
      threshold may sit far below the actual buffer capacity. This drastically underutilizes 
      available buffering, generating premature congestion signals that trigger unnecessary 
      rate reduction and diminish overall link utilization.</t>

      <t>Unlike sojourn-time based AQM algorithms (such as CoDel <xref target="RFC8289"/> 
      or PIE <xref target="RFC8033"/>), which inherently adapt to buffer size variations by 
      measuring delay rather than bytes, queue-depth based marking mechanisms (e.g., standard 
      step-marking in DCTCP <xref target="RFC8257"/> or RoCEv2 environments) are highly 
      vulnerable to dynamic buffer fluctuations.</t>

      <t>This document specifies an operational mechanism that continually derives the ECN
      marking threshold (ECN_Thrd) from the instantaneous value of Buf_Thrd. The 
      computation introduces two operator-configurable parameters to maintain predictable 
      headroom. The approach offers a deterministic, hardware-friendly solution to maintain a 
      consistent relationship between ECN marking and buffer availability.</t>
    </section>

    <section anchor="terminology">
      <name>Terminology</name>
      <t>In the context of this document, a "queue" typically refers to a per-port, 
      per-traffic-class transmission queue within a forwarding device.</t>
      <dl>
        <dt>Buf_Thrd (Buffer Threshold):</dt>
        <dd>
          <t>The dynamic buffer allocation limit for a specific queue. This represents 
          the maximum amount of shared buffer memory that the queue is currently authorized 
          to occupy. Buf_Thrd is periodically or event-driven recomputed by the device's 
          buffer management subsystem.</t>
        </dd>

        <dt>ECN_Thrd (ECN Threshold):</dt>
        <dd>
          <t>The active ECN marking threshold for a queue. When the instantaneous or 
          averaged queue occupancy meets or exceeds ECN_Thrd, the device applies the CE 
          codepoint to arriving ECN-capable packets.</t>
        </dd>

        <dt>Offset:</dt>
        <dd>
          <t>A configurable parameter dictating the desired buffer headroom (typically 
          measured in bytes or cells) maintained between Buf_Thrd and ECN_Thrd. The Offset 
          acts as a shock absorber for packets already in-flight during the control loop 
          feedback delay.</t>
        </dd>

        <dt>ECN_Floor:</dt>
        <dd>
          <t>A configurable parameter establishing the minimum permissible boundary for 
          ECN_Thrd. It acts as a safeguard against ECN_Thrd collapsing to excessively low 
          values (e.g., below a single MTU), which would cause catastrophic throughput 
          degradation via aggressive continuous marking.</t>
        </dd>
      </dl>

      <section anchor="req-lang">
        <name>Requirements Language</name>
        <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL",
        "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED",
        "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are
        to be interpreted as described in BCP 14
        <xref target="RFC2119"/> <xref target="RFC8174"/> when, and
        only when, they appear in all capitals, as shown here.</t>
      </section>
    </section>

    <section anchor="applicability">
      <name>Applicability Statement</name>
      <t>This document is explicitly applicable to network forwarding devices utilizing 
      queue-depth based ECN marking mechanisms in conjunction with a dynamic buffer 
      allocation scheme. It is primarily targeted at Data Center Networks (DCN) and high-speed 
      interconnects where instantaneous queue length or average queue length is evaluated 
      against a byte-based or cell-based threshold.</t>

      <t>This specification does NOT target devices employing sojourn-time based AQMs 
      (e.g., <xref target="RFC8289"/>, <xref target="RFC8033"/>), as time-based algorithms 
      naturally abstract away the physical buffer size and are generally immune to the 
      dynamic shared buffer problem described herein.</t>

      <t>The operational logic defined here is strictly internal to the network device. 
      It does not alter the ECN wire protocol, IP-layer ECN codepoint semantics, or the 
      transport-layer negotiation standardized in <xref target="RFC3168"/>.</t>

      <t>The method is compatible with Classic ECN marking as well as modern scalable 
      congestion controls (e.g., the L4S architecture <xref target="RFC9330"/> and its 
      ECN protocol <xref target="RFC9331"/>). In a DualQ Coupled AQM 
      <xref target="I-D.ietf-tsvwg-aqm-dualq-coupled"/> architecture, the dynamically computed 
      ECN_Thrd may serve as the target threshold for the Classic queue, leaving the L4S 
      queue's specialized marking behavior independent.</t>
    </section>

    <section anchor="problem-statement">
      <name>Problem Statement</name>
      <t>To formalize the problem context, consider a Top-of-Rack (ToR) switch equipped 
      with a 12 MB shared buffer pool and 48 egress ports. Under light traffic conditions with 
      only 4 queues active, the dynamic buffer management may assign a Buf_Thrd of 3 MB to 
      each active queue. Assuming a network operator statically configures an ECN threshold of 
      200 KB, the system operates with 2.8 MB of effective headroom, providing ample shock 
      absorption.</t>

      <t>However, during a coordinated incast event where all 48 ports become heavily congested, 
      the shared buffer is fractured, and the dynamic Buf_Thrd for each queue plummets to 
      250 KB. The statically configured 200 KB ECN threshold now yields a mere 50 KB of 
      headroom. In high-speed environments (e.g., 100Gbps+), 50 KB is significantly smaller 
      than the Bandwidth-Delay Product (BDP) of the control loop. Consequently, the queue 
      will hit the tail drop limit (Buf_Thrd) before the transport sender has time to react 
      to the CE marks, inducing severe retransmission timeouts and latency spikes.</t>

      <t>Conversely, if the operator statically configures the ECN threshold to 2 MB to 
      optimize for high throughput under light load, the ECN mechanism will completely 
      fail during the incast event because the static ECN threshold (2 MB) heavily exceeds 
      the active Buf_Thrd (250 KB).</t>
      
      <t>A deterministic, dynamic coupling between Buf_Thrd and ECN_Thrd is necessary to 
      resolve these dual failure modes without relying on static compromises.</t>
    </section>

    <section anchor="computation">
      <name>Dynamic Coupling Architecture</name>

      <section anchor="buf-thrd">
        <name>Prerequisite: Buffer State Awareness</name>
        <t>The foundation of this architecture requires the device's forwarding plane to 
        expose the current Buf_Thrd value to the AQM/ECN marking engine. The specific 
        memory management algorithm (e.g., alpha-based proportional allocation) calculating 
        Buf_Thrd is outside the scope of this document. The sole prerequisite is that 
        Buf_Thrd is continuously updated and accessible with low latency.</t>
      </section>

      <section anchor="ecn-derivation">
        <name>Reference Algorithm for ECN Threshold</name>
        <t>Network devices SHOULD compute ECN_Thrd continuously based on Buf_Thrd, 
        Offset, and ECN_Floor. To ensure stability across all load extremes, the logic 
        is segmented into three distinct operational regions:</t>

        <t>Region A -- Sufficient Buffer (Nominal State):</t>
        <t>Condition: (Buf_Thrd - Offset) &gt; ECN_Floor.
        The buffer allocation is generous enough to accommodate the full requested 
        headroom (Offset). Here, ECN_Thrd = Buf_Thrd - Offset. The ECN threshold securely 
        tracks the dynamic buffer limit, guaranteeing precisely the configured absorption capacity.</t>

        <t>Region B -- Constrained Buffer (Congested State):</t>
        <t>Condition: (Buf_Thrd - Offset) &lt;= ECN_Floor AND Buf_Thrd &gt; ECN_Floor.
        The shared buffer is highly constrained. Enforcing the full Offset would depress 
        ECN_Thrd below the critical ECN_Floor, risking excessive marking and severe throughput 
        collapse. To mitigate this, the threshold is clamped: ECN_Thrd = ECN_Floor. 
        The available headroom compresses to (Buf_Thrd - ECN_Floor), prioritizing reasonable 
        throughput over optimal packet absorption.</t>

        <t>Region C -- Critical Buffer (Exhaustion State):</t>
        <t>Condition: Buf_Thrd &lt;= ECN_Floor.
        The queue's buffer allocation has collapsed to or below the minimum floor. In this 
        critical state, clamping ECN_Thrd to ECN_Floor would result in ECN_Thrd &gt;= Buf_Thrd, 
        rendering ECN useless (tail drops would occur silently). Thus, ECN_Thrd = Buf_Thrd. 
        While zero headroom remains, the device marks packets exactly at the tail drop boundary, 
        ensuring the network still transmits explicit congestion signals.</t>

        <t>The reference logic is expressed as follows:</t>
        <sourcecode type="pseudocode"><![CDATA[
function compute_ecn_threshold(Buf_Thrd, Offset, ECN_Floor):
    IF (Buf_Thrd - Offset) > ECN_Floor:
        RETURN Buf_Thrd - Offset          // Region A: Optimal tracking
    ELSE IF Buf_Thrd > ECN_Floor:
        RETURN ECN_Floor                  // Region B: Floor clamped
    ELSE:
        RETURN Buf_Thrd                   // Region C: Drop boundary
]]></sourcecode>

        <figure anchor="fig-ecn-logic">
          <name>State Transition of Dynamic ECN Threshold</name>
          <artwork><![CDATA[
                     Buf_Thrd Update Event
                           |
                           v
              +---------------------------+
              | (Buf_Thrd - Offset) >     |
              |        ECN_Floor?         |
              +-------------+-------------+
              | YES         | NO
              v             v
    ECN_Thrd =     +------------------+
    Buf_Thrd -     | Buf_Thrd >       |
    Offset         |    ECN_Floor?    |
    [Region A]     +--------+---------+
                   | YES    | NO
                   v        v
         ECN_Thrd =   ECN_Thrd =
         ECN_Floor    Buf_Thrd
         [Region B]   [Region C]
]]></artwork>
        </figure>

        <t>This algorithm requires minimal logic gates (two comparators and one subtractor), 
        ensuring it can be evaluated in standard Application-Specific Integrated Circuit (ASIC) 
        pipelines with nominal nanosecond latency.</t>
      </section>

      <section anchor="invariants">
        <name>Architectural Invariants</name>
        <t>Implementations conforming to this framework SHOULD validate the following 
        invariants to prevent anomalous traffic handling:</t>
        <t>1. ECN_Thrd MUST NOT exceed Buf_Thrd (ECN_Thrd &lt;= Buf_Thrd). This mathematically 
        guarantees ECN marking is always attempted prior to or simultaneously with queue tail drop.</t>
        <t>2. ECN_Thrd MUST NOT fall below ECN_Floor, UNLESS the maximum physical buffer 
        limit (Buf_Thrd) has itself fallen below ECN_Floor.</t>
      </section>
    </section>

    <section anchor="op-considerations">
      <name>Operational Considerations</name>
      
      <section anchor="sync-guidance">
        <name>Update Synchronization</name>
        <t>ECN_Thrd MUST be inherently recomputed concurrently with any transition in Buf_Thrd. 
        Event-driven synchronization is highly RECOMMENDED over periodic polling. Polling 
        introduces phase-delay, leaving the ECN_Thrd stale during the most critical microsecond 
        inflection points of transient congestion. If atomic hardware updates are impossible, 
        implementations SHOULD bias the asynchronous race condition to temporarily favor a 
        lower ECN_Thrd (causing a premature mark) over a higher ECN_Thrd (causing an unnotified drop).</t>
      </section>

      <section anchor="offset-tuning">
        <name>Tuning the Offset Parameter</name>
        <t>The Offset represents the network's required "shock absorber." Operators SHOULD 
        calibrate the Offset to slightly exceed the expected Bandwidth-Delay Product (BDP) 
        of the typical congestion control feedback loop:</t>
        <t>Offset ≈ Link_Rate * RTT</t>
        <t>In contemporary intra-data-center fabrics (RTT ~20-50 microseconds, 400 Gbps links), 
        Offset values ranging from 1 MB to 2.5 MB are operationally appropriate. Oversizing 
        the Offset prematurely throttles flows; undersizing it invites high tail-drop rates 
        despite ECN capability.</t>
      </section>

      <section anchor="floor-tuning">
        <name>Tuning the ECN_Floor Parameter</name>
        <t>ECN_Floor establishes the maximum throttling severity. It MUST NOT be configured 
        smaller than the Maximum Transmission Unit (MTU) of the link (e.g., 9000 bytes). For 
        environments executing Data Center TCP (DCTCP) <xref target="RFC8257"/>, ECN_Floor 
        SHOULD typically mirror the static thresholds recommended for shallow buffering 
        (e.g., 30 KB to 100 KB), preventing the queue from emptying completely while maintaining 
        ultra-low queuing delay.</t>
      </section>
    </section>

    <section anchor="implementation-status">
      <name>Implementation Status</name>
      <t>[RFC Editor: Please remove this section before publication.]</t>
      <t>This section records the status of known implementations of the protocol 
      defined by this specification at the time of posting of this Internet-Draft, 
      and is based on a proposal described in RFC 7942. The description of implementations 
      in this section is intended to assist the IETF in its decision processes in progressing 
      drafts to RFCs.</t>
      <t>The dynamic ECN threshold coupling mechanism described in this document has been 
      implemented and validated in the data plane of Centec Networks' switching silicon, 
      specifically designed to mitigate micro-bursts and incast congestion in large-scale 
      RDMA over Converged Ethernet (RoCEv2) deployments by China Mobile.</t>
    </section>

    <section anchor="related-work">
      <name>Related Work</name>
      <t>AQM recommendations generalized in <xref target="RFC7567"/> outline the complexities 
      of parameter tuning. While this document aligns with the intent of <xref target="RFC7567"/>, 
      it specifically isolates and resolves the intersection of AQM and dynamic shared buffering, 
      a domain not fully explored in legacy AQM guidelines.</t>

      <t>The AI-based ECN approach proposed in <xref target="I-D.zhuang-tsvwg-ai-ecn-for-dcn"/> 
      targets similar parameter adaptation via machine learning. The framework in this document, 
      conversely, advocates for a mathematically deterministic data-path calculation, demanding 
      no training data, no external control-plane telemetry loop, and zero inference latency.</t>

      <t>TCP Alternative Backoff with ECN (ABE) <xref target="RFC8511"/> optimizes how 
      endpoints react to CE marks. ABE is strictly complementary; it refines the sender 
      response, whereas this architecture ensures the network device generates those marks 
      at structurally correct moments.</t>
    </section>

    <section anchor="security">
      <name>Security Considerations</name>

      <t>This specification introduces an automated internal parameter coupling within 
      the network forwarding plane. It does not exchange new protocol messages across the wire, 
      thus introducing no new cryptographic or protocol-level attack surfaces.</t>

      <t>Operational Degradation via Misconfiguration: Invalid configuration of Offset or 
      ECN_Floor can initiate self-inflicted Denial of Service (DoS) behaviors. For instance, an 
      immensely inflated Offset might universally push the system into Region C, effectively disabling 
      early congestion warning. Implementations SHOULD validate parameter inputs through management 
      interfaces and emit warnings if Offset exceeds typical physical buffer allocations.</t>

      <t>Internal Signaling Integrity: The architectural dependency between the memory 
      management unit (MMU) and the ECN marking engine requires deterministic internal signaling. 
      If the internal update of Buf_Thrd is delayed or corrupted under heavy system load, the ECN_Thrd 
      calculation will be based on stale memory constraints, leading to temporary periods of 
      over-marking or under-marking. Hardware designs SHOULD prioritize this internal signaling path.</t>

      <t>Buffer Exhaustion Vectors: Malicious, non-responsive flows could intentionally occupy 
      massive allocations of the shared buffer pool. In dynamic buffer architectures, this action 
      compresses the Buf_Thrd for all other benign queues, plunging them into Region B or Region C. 
      This is an inherent vulnerability of shared memory switches, not generated by this ECN algorithm. 
      Operators MUST utilize per-queue maximum caps, port-level QoS scheduling, and admission 
      control to insulate queues from cross-traffic buffer starvation.</t>
    </section>

    <section anchor="iana">
      <name>IANA Considerations</name>
      <t>This document has no IANA actions.</t>
    </section>
  </middle>

  <back>
    <references>
      <name>Normative References</name>
      <xi:include
        href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml"/>
      <xi:include
        href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.3168.xml"/>
      <xi:include
        href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8174.xml"/>
    </references>
    <references>
      <name>Informative References</name>
      <xi:include
        href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.7567.xml"/>
      <xi:include
        href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8033.xml"/>
      <xi:include
        href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8257.xml"/>
      <xi:include
        href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8289.xml"/>
      <xi:include
        href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8511.xml"/>
      <xi:include
        href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.9330.xml"/>
      <xi:include
        href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.9331.xml"/>

      <reference anchor="I-D.zhuang-tsvwg-ai-ecn-for-dcn"
                 target="https://datatracker.ietf.org/doc/draft-zhuang-tsvwg-ai-ecn-for-dcn/">
        <front>
          <title>Artificial Intelligence (AI) based ECN adaptive
          reconfiguration for datacenter networks</title>
          <author initials="Y." surname="Zhuang"/>
          <author initials="B." surname="Zhang"/>
          <author initials="H." surname="Pan"/>
          <date year="2019" month="October"/>
        </front>
        <seriesInfo name="Internet-Draft"
                    value="draft-zhuang-tsvwg-ai-ecn-for-dcn-00"/>
      </reference>

      <reference anchor="I-D.ietf-tsvwg-aqm-dualq-coupled"
                 target="https://datatracker.ietf.org/doc/draft-ietf-tsvwg-aqm-dualq-coupled/">
        <front>
          <title>DualQ Coupled AQMs for Low Latency, Low Loss and
          Scalable Throughput (L4S)</title>
          <author initials="K." surname="De Schepper"/>
          <author initials="B." surname="Briscoe" role="editor"/>
          <date year="2024"/>
        </front>
        <seriesInfo name="Internet-Draft"
                    value="draft-ietf-tsvwg-aqm-dualq-coupled-24"/>
      </reference>
    </references>
  </back>
</rfc>
