<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE rfc [
  <!ENTITY nbsp "&#160;">
]>
<rfc xmlns:xi="http://www.w3.org/2001/XInclude"
     ipr="trust200902"
     docName="draft-tian-ccwg-ibcs-datapath-processing-00"
     category="info"
     submissionType="IETF"
     consensus="true"
     version="3">

  <front>
    <title abbrev="IBCS Datapath Processing">
      Datapath Processing Architecture for In-Band Congestion Signaling (IBCS)
    </title>
    <seriesInfo name="Internet-Draft"
                value="draft-tian-ccwg-ibcs-datapath-processing-00"/>

    <author fullname="Yuchi Tian" initials="Y." surname="Tian">
      <organization>China Mobile</organization>
      <address>
        <postal>
          <city>Beijing</city>
          <code>100053</code>
          <country>China</country>
        </postal>
        <email>tianyuchi@chinamobile.com</email>
      </address>
    </author>

    <author fullname="Weiqiang Cheng" initials="W." surname="Cheng">
      <organization>China Mobile</organization>
      <address>
        <postal>
          <city>Beijing</city>
          <code>100053</code>
          <country>China</country>
        </postal>
        <email>chengweiqiang@chinamobile.com</email>
      </address>
    </author>

    <author fullname="Jin Yang" initials="J." surname="Yang">
      <organization>China Mobile</organization>
      <address>
        <postal>
          <city>Beijing</city>
          <code>100053</code>
          <country>China</country>
        </postal>
        <email>yangjinwl@chinamobile.com</email>
      </address>
    </author>

    <author fullname="Junjie Wang" initials="J." surname="Wang">
      <organization>Centec</organization>
      <address>
        <postal>
          <city>Suzhou</city>
          <code>215000</code>
          <country>China</country>
        </postal>
        <email>wangjj@centec.com</email>
      </address>
    </author>

    <author fullname="Guoying Zhang" initials="G." surname="Zhang">
      <organization>Centec</organization>
      <address>
        <postal>
          <city>Suzhou</city>
          <code>215000</code>
          <country>China</country>
        </postal>
        <email>zhanggy@centec.com</email>
      </address>
    </author>

    <author fullname="Kan Zhang" initials="K." surname="Zhang">
      <organization>China Mobile</organization>
      <address>
        <postal>
          <city>Beijing</city>
          <code>100053</code>
          <country>China</country>
        </postal>
        <email>zhangkan@chinamobile.com</email>
      </address>
    </author>

    <date year="2026" month="March" day="1"/>
    <area>Transport</area>
    <workgroup>Congestion Control Working Group</workgroup>

    <keyword>in-band congestion signaling</keyword>
    <keyword>datapath processing</keyword>
    <keyword>network element</keyword>
    <keyword>compare and replace</keyword>
    <keyword>CCWG</keyword>

    <abstract>
      <t>In-band congestion signaling protocols, such as Congestion Signaling (CSIG) 
      and High Precision Congestion Control (HPCC++), require intermediate Network 
      Elements (NEs) to actively parse scalar congestion metrics from packet headers, 
      evaluate them against local link states, and conditionally rewrite these fields 
      before transmission. To ensure end-to-end algorithmic consistency and avoid 
      unintended interactions with routing topologies (e.g., packet reordering), the 
      datapath of these NEs must adhere to a standardized logical processing model.</t>

      <t>This document defines the normative datapath processing architecture for 
      Network Elements participating in In-Band Congestion Signaling (IBCS). By 
      establishing abstract topological roles (Edge vs. Transit NEs) and standardizing 
      the "Compare-and-Replace" operational paradigm, this specification abstracts 
      the signal update logic from hardware-specific pipelines. It guarantees strict 
      orthogonality between congestion signaling and Equal-Cost Multi-Path (ECMP) 
      routing invariants, supporting diverse congestion metrics across multi-vendor 
      deployments.</t>
    </abstract>
  </front>

  <middle>
    <section anchor="introduction">
      <name>Introduction</name>
      <t>Modern high-speed data center networks increasingly rely on fine-grained, 
      In-Band Congestion Signaling (IBCS) to achieve ultra-low latency and high throughput. 
      Protocols being discussed in the IETF, such as CSIG <xref target="I-D.ravi-ippm-csig"/> 
      and HPCC++ <xref target="I-D.miao-ccwg-hpcc"/>, utilize packet headers to convey 
      link-level congestion telemetry directly to end-hosts. A fundamental paradigm of these 
      proposals is the "Compare-and-Replace" operation: as a packet traverses the network, 
      each transit Network Element (NE) compares the congestion signal carried in the packet 
      against its own local congestion metric. If the local NE represents a more severe 
      bottleneck, it overwrites the signal field with its local metric.</t>

      <t>Unlike traditional stacking-based telemetry (such as IOAM <xref target="RFC9197"/>) 
      where metadata is appended hop-by-hop, the Compare-and-Replace paradigm maintains a 
      constant header size, avoiding Maximum Transmission Unit (MTU) exhaustion. However, 
      updating a packet header on-the-fly introduces significant architectural challenges for 
      datapath pipelines. If the processing behavior is not rigorously defined, modifying 
      packet fields can inadvertently alter hash-based load balancing (ECMP), leading to 
      micro-burst flow reordering. Furthermore, inconsistent state handling at domain boundaries 
      can result in spoofed or corrupted signals reaching the congestion control algorithm.</t>

      <t>This document specifies the normative datapath behavior and abstract processing model 
      required to support IBCS safely and efficiently. It introduces a role-based architecture 
      (differentiating edge initialization from transit evaluation) and specifies a protocol-agnostic 
      extremum evaluation model (e.g., evaluating minimum available bandwidth or maximum 
      queue delay). By establishing this unified architectural framework, this document aims to 
      ensure operational interoperability and robust signal delivery across heterogeneous network infrastructures.</t>
    </section>

    <section anchor="terminology">
      <name>Terminology</name>
      <dl>
        <dt>IBCS (In-Band Congestion Signaling):</dt>
        <dd>
          <t>A general mechanism where congestion state metrics are embedded within the 
          data packet header and dynamically updated by Network Elements along the forwarding path.</t>
        </dd>

        <dt>P_Metric (Packet Metric):</dt>
        <dd>
          <t>The congestion signal value currently carried within the packet header. It 
          represents the most severe bottleneck encountered so far on the packet's path.</t>
        </dd>

        <dt>L_Metric (Local Metric):</dt>
        <dd>
          <t>The locally computed congestion metric at the transit NE's egress port 
          (e.g., residual bandwidth, queue utilization, or link delay).</t>
        </dd>

        <dt>SUF (Signal Update Function):</dt>
        <dd>
          <t>The abstract logical entity within a Network Element's datapath responsible for 
          evaluating P_Metric against L_Metric and executing the conditional header rewrite.</t>
        </dd>

        <dt>Extremum Operator:</dt>
        <dd>
          <t>The mathematical comparison operator (MIN or MAX) dictated by the specific 
          signaling protocol's semantics to determine the tightest bottleneck.</t>
        </dd>
      </dl>

      <section anchor="req-lang">
        <name>Requirements Language</name>
        <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL",
        "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED",
        "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are
        to be interpreted as described in BCP 14
        <xref target="RFC2119"/> <xref target="RFC8174"/> when, and
        only when, they appear in all capitals, as shown here.</t>
      </section>
    </section>

    <section anchor="applicability">
      <name>Applicability Statement</name>
      <t>This document defines abstract datapath behaviors applicable to single administrative 
      domains (e.g., autonomous data center fabrics) deploying end-to-end congestion control 
      loops based on fixed-length, mutable in-band signals.</t>

      <t>This specification explicitly differentiates itself from In-situ Operations, 
      Administration, and Maintenance (IOAM) <xref target="RFC9197"/> and INT <xref target="P4-INT"/>. 
      While IOAM focuses on comprehensive visibility through metadata stacking (Trace Option), 
      the behavior described herein strictly addresses fixed-length "Compare-and-Replace" 
      updates designed specifically for fast-path congestion control algorithms, where per-hop 
      state history is discarded in favor of the path's bottleneck state.</t>
    </section>

    <section anchor="ne-roles">
      <name>Topological Roles and Boundary Behaviors</name>

      <t>To guarantee the integrity of the IBCS loop, Network Elements MUST apply different 
      processing rules depending on their topological placement relative to the signaling domain. 
      This document defines three distinct abstract NE roles:</t>

      <section anchor="ingress-edge">
        <name>IBCS Ingress Edge NE</name>
        <t>The Ingress Edge NE operates at the boundary where traffic enters the trusted 
        IBCS domain (e.g., a Top-of-Rack switch receiving traffic from a bare-metal server 
        or an untrusted tenant VM). </t>
        <t>The Ingress Edge NE MUST inspect arriving packets for existing IBCS fields. 
        To prevent signal spoofing attacks, it MUST act as a signal scrubber: any recognized 
        IBCS field arriving from an untrusted interface MUST be reset to its protocol-defined 
        <tt>UNINITIALIZED</tt> state before further processing. Only after initialization 
        may the packet be passed to the SUF for its first local evaluation.</t>
      </section>

      <section anchor="transit-ne">
        <name>IBCS Transit NE</name>
        <t>Transit NEs operate entirely within the trusted boundaries of the IBCS domain 
        (e.g., Spine or Core switches). Transit NEs implicitly trust the <tt>P_Metric</tt> 
        carried in the packet header.</t>
        <t>A Transit NE MUST NOT unconditionally reset or scrub the <tt>P_Metric</tt>. 
        Its sole responsibility regarding the IBCS field is to execute the strict Compare-and-Replace 
        logic defined in <xref target="compare-replace"/>, ensuring that the metric is only 
        overwritten if the local datapath represents a tighter bottleneck.</t>
      </section>

      <section anchor="egress-edge">
        <name>IBCS Egress Edge NE</name>
        <t>The Egress Edge NE operates at the boundary where traffic exits the trusted 
        IBCS domain. If the destination is outside the administrative domain and no explicit 
        IBCS peering agreement exists, the Egress Edge NE SHOULD strip or zero-out the IBCS 
        field to prevent internal telemetry leakage to external observers.</t>
      </section>
    </section>

    <section anchor="architecture">
      <name>Abstract Datapath Processing Model</name>

      <t>Regardless of the physical hardware pipeline architecture (e.g., run-to-completion, 
      multi-stage ASIC, or programmable switch), the externally observable behavior of any 
      IBCS-enabled Network Element MUST conform to the following abstract sequence. This model 
      ensures that routing invariants are preserved.</t>

      <section anchor="ingress-parsing">
        <name>Phase 1: Header Resolution</name>
        <t>The datapath parses the designated IBCS header field to extract the current 
        <tt>P_Metric</tt>. If the NE does not recognize the protocol or the IBCS field 
        is absent, the packet MUST bypass all subsequent IBCS update logic and be forwarded 
        opaquely.</t>
      </section>

      <section anchor="ecmp-invariance">
        <name>Phase 2: Strict Forwarding Orthogonality</name>
        <t>The packet undergoes routing, access control list (ACL) application, and Equal-Cost 
        Multi-Path (ECMP) or Link Aggregation Group (LAG) path selection. </t>
        
        <t>CRITICAL REQUIREMENT: The IBCS signal update process MUST be strictly orthogonal 
        to path selection. The datapath MUST NOT mutate the <tt>P_Metric</tt> or any related 
        congestion header fields prior to or during the hash computation phase. Altering 
        header values before ECMP hashing violates fundamental flow invariance, causing 
        packets within the same microflow to traverse asymmetric paths, resulting in 
        TCP/transport reordering degradation.</t>
      </section>

      <section anchor="egress-update">
        <name>Phase 3: The Signal Update Function (SUF)</name>
        <t>Once the deterministic egress port is resolved, the Signal Update Function (SUF) 
        retrieves the real-time <tt>L_Metric</tt> specifically associated with that port. 
        The SUF evaluates <tt>P_Metric</tt> against <tt>L_Metric</tt> and conditionally 
        commits the update to the packet header. This mutation MUST be treated as an atomic 
        transaction applied immediately prior to serialization on the wire.</t>

        <figure anchor="fig-abstract-model">
          <name>Abstract Datapath Processing Model for IBCS</name>
          <artwork><![CDATA[
 +-------------+     +--------------------+     +---------------+
 |             |     | Routing, QoS, &    |     |               |
 |   Header    | --> | ECMP Path Selection| --> | Signal Update |--> Tx
 |  Resolution |     | (Hash Computation) |     | Function (SUF)|
 |             |     |                    |     |               |
 +-------------+     +--------------------+     +---------------+
   Extract             Strictly Orthogonal        Compare & Replace;
   P_Metric            (No Header Mutation)       Checksum Update
]]></artwork>
        </figure>
      </section>
    </section>

    <section anchor="compare-replace">
      <name>Normative Evaluation Rules (Compare-and-Replace)</name>

      <section anchor="generalized-logic">
        <name>Abstract Extremum Evaluation</name>
        <t>Different IBCS protocols characterize congestion semantics differently. For instance, 
        CSIG signals Minimum Available Bandwidth (requiring a MIN operator), whereas HPCC++ 
        may signal Maximum Queue Depth (requiring a MAX operator). The SUF MUST implement a 
        configurable <tt>Extremum_Operator</tt> to accommodate the semantics of the deployed protocol.</t>

        <t>The normative state-machine logic executed by the SUF is defined as follows:</t>

        <sourcecode type="pseudocode"><![CDATA[
Function SUF_Evaluate(P_Metric, L_Metric, Extremum_Operator):
    // Rule 1: Initialization Handling
    IF P_Metric == UNINITIALIZED:
        Rewrite packet header: P_Metric = L_Metric
        RETURN

    // Rule 2: Protocol-Specific Bottleneck Evaluation
    IF Extremum_Operator == MIN:
        IF L_Metric < P_Metric:
            Rewrite packet header: P_Metric = L_Metric
            
    ELSE IF Extremum_Operator == MAX:
        IF L_Metric > P_Metric:
            Rewrite packet header: P_Metric = L_Metric
    
    // Rule 3: Preservation
    // If local state is NOT the tighter bottleneck, 
    // the header MUST NOT be modified.
]]></sourcecode>
        <t>Atomicity: The rewrite operation MUST be robust. Partial byte updates or malformed 
        header emissions MUST NOT occur, even under extreme internal buffer exhaustion or 
        exception path processing.</t>
      </section>

      <section anchor="checksum">
        <name>Checksum and Integrity Implications</name>
        <t>If the IBCS field is encapsulated within an IPv4 or UDP header, the SUF MUST 
        update the corresponding Layer 3 / Layer 4 checksums. To achieve line-rate processing 
        without introducing significant latency jitter, incremental checksum calculation 
        <xref target="RFC1141"/> is highly RECOMMENDED.</t>
        <t>If the IBCS field is embedded in a Layer 2 extension or a custom tag (as commonly 
        deployed in closed data center fabrics), IP/UDP checksum modifications are bypassed, 
        substantially reducing the silicon processing overhead.</t>
      </section>
    </section>

    <section anchor="operational">
      <name>Operational Considerations</name>

      <section anchor="metric-stability">
        <name>L_Metric Stability and Sampling Frequencies</name>
        <t>The stability of the congestion control loop is inherently tied to how <tt>L_Metric</tt> 
        is generated. While the specific hardware counter implementation is out of scope for this 
        document, the NE MUST guarantee that <tt>L_Metric</tt> is relatively stable and decoupled 
        from instantaneous micro-burst noise. </t>
        <t>Network Elements SHOULD provide a configurable moving average or sampling window for <tt>L_Metric</tt>. 
        The optimal sampling interval typically corresponds to the baseline Round-Trip Time (RTT) 
        of the network domain (e.g., 5 to 50 microseconds). Sampling too frequently causes 
        signal oscillation; sampling too slowly creates stale telemetry that dampens transport responsiveness.</t>
      </section>

      <section anchor="fail-open">
        <name>Fail-Open Capability</name>
        <t>If a Network Element experiences an internal architectural fault where the real-time 
        <tt>L_Metric</tt> from the egress port becomes temporarily unavailable to the SUF, 
        the NE MUST NOT drop the packet. Instead, it MUST execute a fail-open behavior, 
        forwarding the packet with the existing <tt>P_Metric</tt> completely unmodified. 
        This guarantees that transient local datapath faults do not sever the end-to-end signaling loop.</t>
      </section>
    </section>

    <section anchor="security">
      <name>Security Considerations</name>
      <t>TBD</t>
    </section>

    <section anchor="iana">
      <name>IANA Considerations</name>
      <t>This document has no IANA actions.</t>
    </section>
  </middle>

  <back>
    <references>
      <name>Normative References</name>
      <xi:include
        href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml"/>
      <xi:include
        href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8174.xml"/>
    </references>
    <references>
      <name>Informative References</name>

      <reference anchor="I-D.ravi-ippm-csig"
                 target="https://datatracker.ietf.org/doc/draft-ravi-ippm-csig/">
        <front>
          <title>Congestion Signaling (CSIG)</title>
          <author initials="A." surname="Ravi"/>
          <author initials="N." surname="Dukkipati"/>
          <author initials="N." surname="Mehta"/>
          <author initials="J." surname="Kumar"/>
          <date year="2024" month="February"/>
        </front>
        <seriesInfo name="Internet-Draft"
                    value="draft-ravi-ippm-csig-01"/>
      </reference>

      <reference anchor="I-D.miao-ccwg-hpcc"
                 target="https://datatracker.ietf.org/doc/draft-miao-ccwg-hpcc/">
        <front>
          <title>HPCC++: Enhanced High Precision Congestion Control</title>
          <author initials="R." surname="Miao"/>
          <date year="2025" month="January"/>
        </front>
        <seriesInfo name="Internet-Draft"
                    value="draft-miao-ccwg-hpcc-03"/>
      </reference>

      <reference anchor="P4-INT"
                 target="https://github.com/p4lang/p4-applications/blob/master/docs/INT_v2_0.pdf">
        <front>
          <title>In-band Network Telemetry (INT) Dataplane Specification, v2.0</title>
          <author><organization>P4.org</organization></author>
          <date year="2020" month="February"/>
        </front>
      </reference>

      <xi:include
        href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.1141.xml"/>
      <xi:include
        href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.4302.xml"/>
      <xi:include
        href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.9197.xml"/>
    </references>
  </back>
</rfc>
