SRv6 behavior extention for Flow Control in WAN

Introduction With the continuous emergence of new AI technologies such as collaborative training and distributed inference, the scenario in which different data centers communicate via the RDMA (Remote Direct Memory Access) protocol has emerged as a new requirement. Given the high sensitivity of RDMA technology to packet loss, PFC technology is widely deployed in RoCEv2 networks within data centers. The working flow of PFC is shown in Figure 1, which is mainly composed of the following steps:

Diagram of PFC Working Mechanism a. Devices supporting PFC have multiple priority queues on the transmit interface, and the receive interface has an equal number of receive buffers. b.When a receive buffer on a downstream device (such as Device B) becomes congested, that is, the queue buffer is consumed quickly and exceeds a certain threshold (such as 1/2 or 3/4 of the port queue buffer), the corresponding mechanism will be triggered. c.Device B detects congestion sends a back-pressure signal "STOP" to the upstream device (Device A) in the data-entry direction. d.After receiving the back-pressure signal, the upstream device (Device A) stops sending the packets of the corresponding priority queue according to the signal indication and stores the data in the local interface buffer. If the consumption of the local interface buffer of Device A also exceeds the threshold, it will continue to apply back-pressure to the upstream. e.When the congestion situation of the receive buffer is alleviated, that is, the used buffer of the queue is reduced below the PFC threshold, the receiving device (device B) will send a PFC back- pressure stop message to the upstream to notify the upstream device to send packets again and resume the traffic transmission of the corresponding priority queue.

Requirements In the scenario of cross-data center communication, back-pressure frames may need to be propagated across wide area networks. The transmission conditions of wide area networks are much more complicated than those of data center networks, and thus will face some constraints. a. Tenant-Granular Back-pressure In the wide area network scenario, a physical link may carry the services of multiple tenants simultaneously.In order to avoid the mutual influence of traffic among different tenants,back-pressure signaling should support tenant-level granularity,this can be achieved by leveraging the technology of SRv6[RFC 8986]and Enhanced VPN[draft-ietf-spring-sr- for-enhanced-vpn-10] . b. Legacy Device Constraints There is a wide variety of devices in the wide area network, and many of them do not support congestion notification,upgrading all the equipment is uneconomical and difficult to implement.As a result, in many scenarios, backpressure packets cannot be transmitted hop by hop as in the data center network.Therefore, a more flexible method for conveying congestion signals is needed. Take the Figure 2 diagram as an example.The direction of traffic is R1 -> R2 -> R3-> R4-> R5. Among them, R1 and R5 support the generation and processing of back-pressure signals, while the device R2,R3,R4 does not support it. When congestion occurs on the interface between R5 and DC2, if the back-pressure signal can be transmitted to the corresponding interface of R1 in a timely manner, then it can be ensured that there will be no packet loss in the traffic. Therefore, a mechanism is needed that enables R5 to perceive the device and interface among the upstream devices of the current traffic that support the processing of back-pressure signals.

Topo for Cross-DC WAN network

Conventions and Definitions

Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 when, and only when, they appear in all capitals, as shown here.

End.X.PFC behavior The "Endpoint with L3 cross-connect and Priority-based Flow Control" behavior (abbreviated as "End.X.PFC") is a variant of the End.X behavior defined in . Based on the original End.X behavior, it incorporates additional meanings to facilitate the identification of interfaces in the network that possess the capability to handle PFC packets. Its main use is to identify a PFC-capable interface within the wide area network. By advertising this information across the network, it enables other devices and controllers in the network to implement traffic control strategies more easily. The specific behavior of a device upon receiving an End.X.PFC packet can be divided into two scenarios: case1:When the device acts as an intermediate node in the SRv6 path and the End.X.PFC is the non-final hop SID in the Segment list, the End.X.PFC behaviour exhibits the same forwarding characteristics as the End.X behaviour, that is, forwarding the traffic to the next hop through the specified L3 adjacency interface. case2:When the device acts as the final destination node of the SRv6 tunnel for PFC backpressure signals and the End.X.PFC is the final hop SID in the Segment list, the End.X.PFC behaviour parses the inner payload of the encapsulated packet, extracts the PFC backpressure frame, and executes the corresponding PFC flow control action according to the content of the frame. When N receives a packet destined to S and S is a local End.X.PFC behavior, N does the following: max_LE) or (Segments Left > Last Entry+1)) { S10. Send an ICMP Parameter Problem to the Source Address with Code 0 (Erroneous header field encountered) and pointer set to the Segments Left field, interrupt packet processing,and discard the packet. S11. } S12. Decrement IPv6 Hop Limit by 1 S13. Decrement Segments Left by 1 S14. Update IPv6 DA with Segment List[Segments Left] S15. Submit the packet to the IPv6 module for transmission to the new destination via interface J S16. } ]]> When processing the Upper-Layer header of a packet matching a FIB entry locally instantiated as an End.X.PFC SID, N does the following:

Using End.X.PFC Behavior For Congestion Notification In the topology shown in Figure 3, it is assumed that the edge devices R1 and R3 of the wide area network support the processing of PFC (Priority-based Flow Control) frames, while R2 does not. R1 and R3 can configure the behavior of End.X.PFC locally and advertise it by IGP and BGP-LS protocols.

Using the End.X.PFC Behaviour for flow control tunnel for PFC frames <---------------------------------- ]]> Assume that a tenant needs to transmit data over RDMA protocol from DC1 to DC2. The operator can deploy an L3VPN over SRv6 Policy service in the wide area network to carry this traffic. When congestion occurs at interface between R3 and DC2, or when R3 receives a PFC back-pressure signal from DC2 and its buffer exceeds the set threshold, R3 needs to propagate a back-pressure signal upstream. There are two scenarios for how R3 can accurately send the back-pressure signal to the corresponding interface on R1: Scenario 1: Controller-Pre-deployed Tunnels for PFC Back-pressure Frames The controller first calculates an SRv6 policy {R1.End.X.PFC,R2.End,R3.End} between R1 and R3 ,After the SRv6 Policy is provisioned and programmed, the controller verifies the nodes involved in the Policy, and retrieves that R1 and R3 are devices with PFC capability.The controller can identify the upstream-downstream relationship between R1 and R3 in the segment list, then pre-deploys an SRv6 tunnel from R3 to R1 with a segment list of {R3.End , R2.End, R1.End.X.PFC } . When R3 generates a PFC back-pressure frame, the frame is encapsulated into this tunnel. Upon reaching the final hop R1, R1 processes the End.X.PFC behavior. Similarly, the controller pre-deploys a reverse tunnel from R1 to R3 for carrying PFC frames. To improve the orchestration efficiency of the controller, similar to End.X.PFC, a new End.PFC behavior(eg.R3.End.PFC) can be defined to identify nodes with PFC capability.By using this behavior, the nodes along the path that require the creation of reverse tunnels can be more easily identified by the controller. Scenario 2: Device-Auto-triggered Tunnel Creation for PFC Back- pressure Frames When R3(PFC-capable node) receives the first data packet of an SRv6 policy, it analyzes the segment list in the SRH header (e.g., {R1.End.X.PFC , R2.End, R3.End }). Upon detecting an End.X.PFC behavior in the upstream path, R3 dynamically creates a reverse tunnel with a segment list of {R3.End ,R2.End, R1.End.X.PFC } to carry PFC frames. A reverse tunnel can be based on SRv6 Policy or an SRv6 BE. In the SRv6 BE scenario, it is only required to encapsulate R1.End.X.PFC into the destination IP of the packet. It should be noted that, in order to avoid packet loss on devices that do not support the PFC (Priority-based Flow Control) functionality, network slicing technology can be utilized. Tenant-level slices can be deployed on the interfaces traversed by SRv6 to provide independent queues and bandwidth resources.If slicing technology is used, the information of the reverse tunnel should also include the corresponding slicing information, such as the slice ID, etc

OAM Considerations It is necessary to perform network quality probing between the two devices that need to exchange backpressure signals. As illustrated in Figure 3, the downstream node is required to conduct real-time probing of the reachability and network status of the relevant behaviours on the upstream node. This capability enables the downstream node to dynamically adjust its own watermarks and the size of reserved buffers.

Security Considerations The security considerations of SRv6 in RFC8754 apply to this document.

IANA Considerations This document defines a new SRv6 Endpoint behavior called END.X.PFC. IANA is requested to allocate four new code points from the "SRv6 Endpoint Behaviors" sub-registry in the "Segment-routing with IPv6 data plane (SRv6) Parameters" registry: