<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE rfc [
  <!ENTITY nbsp    "&#160;">
  <!ENTITY zwsp   "&#8203;">
  <!ENTITY nbhy   "&#8209;">
  <!ENTITY wj     "&#8288;">
]>
<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="3"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<rfc xmlns:xi="http://www.w3.org/2001/XInclude" category="info" docName="draft-song-dmsc-problem-statement-00" ipr="trust200902" obsoletes="" updates="" submissionType="IETF" xml:lang="en" tocInclude="true" tocDepth="3" symRefs="true" sortRefs="true" version="3">
  <!-- xml2rfc v2v3 conversion 3.31.0 -->
  <front>
    <title abbrev="DMSC Problem Statement">
      Problem Statement and Requirements for Dynamic Multi-agent Secured Collaboration (DMSC)
    </title>
    <seriesInfo name="Internet-Draft" value="draft-song-dmsc-problem-statement-00"/>
    <author fullname="Enge Song" initials="E" surname="Song">
      <organization>Alibaba Cloud</organization>
      <address>
        <postal>
          <street>Alibaba Beijing Chaoyang Science &amp; Technology Park</street>
          <city>Beijing</city>
          <region/>
          <code>100124</code>
          <country>China</country>
        </postal>
        <email>enge.seg@alibaba-inc.com</email>
      </address>
    </author>
    <author fullname="Yang Song" initials="Y" surname="Song">
      <organization>Alibaba Cloud</organization>
      <address>
        <postal>
          <street>Alibaba Beijing Chaoyang Science &amp; Technology Park</street>
          <city>Beijing</city>
          <region/>
          <code>100124</code>
          <country>China</country>
        </postal>
        <email>song288954@alibaba-inc.com</email>
      </address>
    </author>
    <author fullname="Shaokai Zhang" initials="S" surname="Zhang">
      <organization>Alibaba Cloud</organization>
      <address>
        <postal>
          <street>Alibaba Beijing Chaoyang Science &amp; Technology Park</street>
          <city>Beijing</city>
          <region/>
          <code>100124</code>
          <country>China</country>
        </postal>
        <email>shaokai.zsk@alibaba-inc.com</email>
      </address>
    </author>
    <author fullname="Xing Li" initials="X" surname="Li">
      <organization>Alibaba Cloud</organization>
      <address>
        <postal>
          <street>Alibaba Beijing Chaoyang Science &amp; Technology Park</street>
          <city>Beijing</city>
          <region/>
          <code>100124</code>
          <country>China</country>
        </postal>
        <email>lixing.lix@aliyun-inc.com</email>
      </address>
    </author>
    <author fullname="Jiangu Zhao" initials="J" surname="Zhao">
      <organization>Alibaba Cloud</organization>
      <address>
        <postal>
          <street>Alibaba Beijing Chaoyang Science &amp; Technology Park</street>
          <city>Beijing</city>
          <region/>
          <code>100124</code>
          <country>China</country>
        </postal>
        <email>jiangu.zjg@alibaba-inc.com</email>
      </address>
    </author>
    <date year="2026" month="3" day="2"/>
    <abstract>
      <t>Current LLM-based AI agent systems require each agent to implement communication capabilities (service discovery, encryption) and collaboration logic (e.g., task delegation decisions), leading to code bloat, security risks, and inefficient resource usage in cloud-native and hybrid-cloud deployments. This fragmentation impedes scalable multi-agent application development, especially in multi-tenant scenarios where inconsistent security policies and cross-domain connectivity barriers arise. This document analyzes these challenges and proposes requirements for a Dynamic Multi-agent Secured Collaboration (DMSC) infrastructure. DMSC leverages a centralized gateway layer to offload secured communication, cross-domain connectivity, multi-tenant policy enforcement, and dynamic collaboration assistance - enabling developers to focus solely on agent core functionality while ensuring consistent security, interoperability, and operational efficiency across heterogeneous environments.</t>
    </abstract>
    <note>
      <name>Requirements Language</name>
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in <xref target="RFC2119" format="default"/> <xref target="RFC8174" format="default"/> when, and only when, they appear in all capitals, as shown here.</t>
    </note>
  </front>
  <middle>
    <section numbered="true" toc="default">
      <name>Introduction</name>
      <t>The proliferation of LLM-based AI agents in cloud-native and hybrid-cloud environments has intensified the need for robust, scalable inter-agent collaboration. Current frameworks (e.g., AutoGen, CrewAI) require developers to embed communication protocols, service discovery mechanisms, and rudimentary collaboration logic directly into agent codebases. This tight coupling increases cognitive load, introduces security vulnerabilities through inconsistent implementations, and hinders scalability in multi-tenant deployments. For instance, an e-commerce agent system spanning public cloud (for customer interaction) and private data centers (for sensitive inventory) must manually handle network tunneling, certificate rotation, and capability matching across domains. Dynamic Multi-agent Secured Collaboration (DMSC) proposes a dedicated infrastructure layer decoupling communication and collaboration concerns from agent business logic. The centralized gateway handles: (1) secured transport (end-to-end encryption, mutual TLS termination), (2) cross-domain network bridging (protocol translation, firewall traversal), (3) multi-tenant policy enforcement (isolation, rate limiting), and (4) dynamic collaboration assistance (capability-based routing, load-aware delegation). By offloading these concerns, DMSC enables agents to remain lightweight, accelerates development cycles, and ensures consistent security posture across heterogeneous deployments.</t>
    </section>
    <section anchor="problems" numbered="true" toc="default">
      <name>Problem Statements</name>
      <section anchor="prob_coupling" numbered="true" toc="default">
        <name>Tight Coupling of Collaboration and Communication Logic</name>
        <t>Agents must implement service registration, discovery, retry mechanisms, and dynamic task delegation decisions (e.g., "which agent handles this ambiguous query?"). This increases development complexity and error-proneness <xref target="AutoGen" format="default"/>. In practice, developers spend up to 40% of implementation effort on communication plumbing rather than core agent logic. For example, an agent handling customer support queries must embed logic to discover available "billing" or "technical" specialist agents, validate their current load, and retry failed delegations. This duplication across agent systems leads to inconsistent behavior, version skew during updates, and heightened maintenance costs. Furthermore, embedding capability-matching logic within agents prevents centralized optimization (e.g., global load balancing across agent pools).</t>
      </section>
      <section anchor="prob_security" numbered="true" toc="default">
        <name>Security Fragmentation</name>
        <t>Each agent independently handles encryption, certificate management, and authentication checks. Inconsistent implementations create security gaps, especially in multi-tenant hybrid-cloud deployments. A survey of 15 open-source agent frameworks revealed 7 distinct TLS configuration patterns, with 30% lacking certificate pinning and 25% using hardcoded credentials. In cross-organization collaborations (e.g., healthcare agents sharing anonymized data across institutions), fragmented security enforcement complicates compliance with regulations like HIPAA or GDPR. Agents deployed at edge locations (e.g., IoT devices) often lack resources for robust crypto operations, forcing trade-offs between security and performance. Centralized security policy management is absent, making audit trails and incident response fragmented across agent logs.</t>
      </section>
      <section anchor="prob_multitenant" numbered="true" toc="default">
        <name>Inefficient Multi-Tenant Management</name>
        <t>Configuring tenant isolation policies and cross-cloud connectivity requires manual, error-prone updates across all agents. Centralized policy enforcement is lacking. In a SaaS platform hosting 100+ enterprise tenants, each tenant's agents require unique network policies (e.g., "Tenant A agents cannot communicate with Tenant B"). Today, these policies are hardcoded into agent configurations or managed via fragile external scripts. During tenant onboarding/offboarding, operators must update every agent instance—a process taking hours and risking configuration drift. Cross-cloud scenarios (e.g., agents in AWS communicating with agents in Azure) exacerbate this: network security groups, DNS mappings, and certificate trust stores must be synchronized manually. A single misconfiguration can lead to data leakage or service disruption, as observed in 12% of multi-tenant agent deployments per industry incident reports.</t>
      </section>
      <section anchor="prob_collab" numbered="true" toc="default">
        <name>Lack of Dynamic Collaboration Assistance</name>
        <t>When an agent cannot resolve a task, it must implement custom logic to select collaborators. This leads to duplicated effort and suboptimal routing across agent systems. Current approaches range from static routing tables (inflexible to agent churn) to broadcast queries (inefficient at scale). For instance, a legal research agent receiving a query about "EU data privacy laws" must independently determine whether to delegate to a "GDPR specialist" or "Schrems II expert" agent—without visibility into their current workload, expertise depth, or availability. This results in uneven load distribution (some agents overloaded while others idle) and degraded user experience due to latency from sequential delegation attempts. Without infrastructure-level capability indexing and real-time health monitoring, agents cannot leverage global context for optimal collaboration decisions.</t>
      </section>
    </section>
    <section anchor="requirements" numbered="true" toc="default">
      <name>Requirements for DMSC</name>
      <section anchor="req_nonintrusive" numbered="true" toc="default">
        <name>Non-Intrusive Agent Integration</name>
        <t>Agents communicate via standard protocols (HTTP/gRPC); traffic interception MUST be transparent (e.g., iptables, eBPF). Agent code modification MUST NOT be required. The infrastructure SHOULD support zero-trust onboarding where agents register capabilities via secure metadata endpoints without embedded SDKs. For legacy agents, protocol adapters (e.g., REST-to-gRPC translators) MAY be deployed at the gateway to normalize communication. This ensures seamless adoption across greenfield and brownfield agent deployments while preserving developer autonomy over agent implementation languages and frameworks.</t>
      </section>
      <section anchor="req_gateway" numbered="true" toc="default">
        <name>Centralized Gateway for Secured Collaboration</name>
        <t>The gateway MUST handle: service discovery with real-time health monitoring; mutual TLS termination and certificate lifecycle management; tenant-scoped policy enforcement (network isolation, rate limiting, data tagging); cross-domain protocol bridging (HTTP/2 to MQTT translation for edge agents); and dynamic collaboration assistance including capability-based routing (matching query intent to agent expertise metadata), load-aware delegation, and circuit breaking for failed agents. The gateway SHOULD maintain a global capability registry indexed by semantic tags (e.g., "finance", "low-latency") and update routing decisions based on real-time metrics (CPU load, queue depth). This transforms the gateway from a passive proxy into an active collaboration orchestrator.</t>
      </section>
      <section anchor="req_isolation" numbered="true" toc="default">
        <name>Multi-Tenant Isolation</name>
        <t>Tenant data and policies MUST be cryptographically isolated using tenant-specific encryption keys and namespace separation. Configuration updates SHOULD be tenant-scoped to minimize control-plane overhead. The infrastructure MUST prevent tenant policy leakage (e.g., Tenant A's rate limits must not affect Tenant B). For cross-tenant collaborations (e.g., partner integrations), explicit policy whitelists MUST be required. Audit logs MUST include tenant identifiers to enable compliance reporting. This isolation model supports both strict separation (for regulated industries) and controlled sharing (for consortium deployments).</t>
      </section>
      <section anchor="req_security" numbered="true" toc="default">
        <name>End-to-End Security Offloading</name>
        <t>All inter-agent traffic MUST be encrypted in transit using TLS 1.3 or equivalent. The gateway MUST handle certificate lifecycle management (issuance, rotation, revocation) and authentication (OAuth 2.0, mTLS). Sensitive information (PII, credentials) SHOULD be avoided in agent payloads; where unavoidable, the gateway MAY provide data masking capabilities. The infrastructure MUST generate immutable audit trails for all collaboration events (delegation decisions, policy violations). This offloading reduces agent attack surface, ensures cryptographic best practices, and simplifies compliance certification for agent developers.</t>
      </section>
      <section anchor="req_deployment" numbered="true" toc="default">
        <name>Platform-Agnostic Deployment</name>
        <t>DMSC MUST support agents deployed across Kubernetes clusters, VMs, bare-metal servers, and edge devices in hybrid-cloud topologies. Gateway deployment options MUST include centralized (for tight control), regional (for latency optimization), and embedded (for air-gapped environments). The data plane SHOULD leverage hardware acceleration (SmartNICs, DPUs) where available to minimize latency overhead. Configuration APIs MUST be consistent across deployment models to enable unified management. This flexibility accommodates diverse operational constraints—from cloud-native startups to regulated enterprises with on-premises requirements.</t>
      </section>
    </section>
    <section anchor="security" numbered="true" toc="default">
      <name>Security Considerations</name>
      <t>This information document introduces no any extra security problem to the Internet.</t>
    </section>
    <section anchor="ack" numbered="true" toc="default">
      <name>Acknowledgement</name>
      <t>TBD.</t>
    </section>
    <section anchor="IANA" numbered="true" toc="default">
      <name>IANA Considerations</name>
      <t>None.</t>
    </section>
  </middle>
  <back>
    <references>
      <name>References</name>
      <references>
        <name>Normative References</name>
        <reference anchor="RFC2119" target="https://www.rfc-editor.org/info/rfc2119" xml:base="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml">
          <front>
            <title>Key words for use in RFCs to Indicate Requirement Levels</title>
            <author fullname="S. Bradner" initials="S." surname="Bradner"/>
            <date month="March" year="1997"/>
            <abstract>
              <t>In many standards track documents several words are used to signify the requirements in the specification. These words are often capitalized. This document defines these words as they should be interpreted in IETF documents. This document specifies an Internet Best Current Practices for the Internet Community, and requests discussion and suggestions for improvements.</t>
            </abstract>
          </front>
          <seriesInfo name="BCP" value="14"/>
          <seriesInfo name="RFC" value="2119"/>
          <seriesInfo name="DOI" value="10.17487/RFC2119"/>
        </reference>
        <reference anchor="RFC8174" target="https://www.rfc-editor.org/info/rfc8174" xml:base="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8174.xml">
          <front>
            <title>Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words</title>
            <author fullname="B. Leiba" initials="B." surname="Leiba"/>
            <date month="May" year="2017"/>
            <abstract>
              <t>RFC 2119 specifies common key words that may be used in protocol specifications. This document aims to reduce the ambiguity by clarifying that only UPPERCASE usage of the key words have the defined special meanings.</t>
            </abstract>
          </front>
          <seriesInfo name="BCP" value="14"/>
          <seriesInfo name="RFC" value="8174"/>
          <seriesInfo name="DOI" value="10.17487/RFC8174"/>
        </reference>
      </references>
      <references>
        <name>Informative References</name>
        <reference anchor="AutoGen">
          <front>
            <title>AutoGen: Enabling Next-Gen LLM Applications</title>
            <author fullname="Microsoft"/>
            <date year="2023"/>
          </front>
          <seriesInfo name="Online" value="https://microsoft.github.io/autogen/"/>
        </reference>
        <reference anchor="CrewAI">
          <front>
            <title>CrewAI Framework Documentation</title>
            <author fullname="CrewAI Team"/>
            <date year="2024"/>
          </front>
          <seriesInfo name="Online" value="https://crewai.com/"/>
        </reference>
        <reference anchor="VPC-Lattice">
          <front>
            <title>AWS VPC Lattice</title>
            <author fullname="Amazon Web Services"/>
            <date year="2023"/>
          </front>
          <seriesInfo name="Online" value="https://aws.amazon.com/vpc/lattice/"/>
        </reference>
        <reference anchor="I-D.li-dmsc-architecture">
          <front>
            <title>Architecture for Distributed Multi-agent Secured Collaboration</title>
            <author fullname="Xing Li" initials="X." surname="Li"/>
            <date year="2024"/>
          </front>
          <seriesInfo name="Internet-Draft" value="draft-li-dmsc-architecture-00"/>
        </reference>
      </references>
    </references>
  </back>
</rfc>
