<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="lib/rfc2629.xslt"?>
<?rfc toc="yes" ?>
<?rfc symrefs="yes" ?>
<?rfc sortrefs="yes" ?>
<?rfc compact="yes"?>
<?rfc subcompact="no" ?>
<?rfc linkmailto="no" ?>
<?rfc editing="no" ?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc rfcedstyle="yes"?>
<?rfc-ext allow-markup-in-artwork="yes" ?>
<?rfc-ext include-index="no" ?>

<rfc ipr="trust200902"
     category="exp"
     submissionType="IETF"
     docName="draft-thierry-bulk-07">
  <front>
    <title abbrev="BULK1">Binary Uniform Language Kit 1.0</title>

    <author initials="P." surname="Thierry" fullname="Pierre Thierry">
      <organization>Comonad Dev</organization>
      <address>
        <email>pierre@comonad.dev</email>
      </address>
    </author>

    <date day="09" month="02" year="2026" />
    <keyword>binary</keyword>

    <abstract>
      <t>
        This specification describes a uniform, decentrally extensible and efficient format for
        data serialization.
      </t>
    </abstract>

  </front>

  <middle>
    <section anchor="intro" title="Introduction">
      <section title="Rationale">
        <t>
          This specification aims at finding an original trade-off between uniformity, generality,
          extensibility, decentralization, compactness and processing speed for a data format. It
          is our opinion that every widely used existing format occupy a different position than
          this one in the solution space for formats, that none is better on all axes, and that
          this one is the current best on several axes, hence this new design. It is also our
          opinion that some of those existing formats constitute an optimal solution for their
          specific use case, either in a absolute sense, or at least at the time of their
          design. But the ever-changing field of IT now faces new challenges that call for a new
          approach.
        </t>
	<t>
	  In particular, whereas the previous trend for Internet and Web standards and programming
	  tools has been to create human-readable syntaxes for data and protocols, the advent of
	  technologies like <xref target="protobuf">protocol buffers</xref>, <xref
	  target="Thrift">Thrift</xref>, the various binary serializations for JSON like <xref
	  target="Avro">Avro</xref> or <xref target="Smile">Smile</xref>, or the binary <xref
	  target="HTTP2">HTTP/2</xref> seem to indicate that the time is ripe for a generalized use
	  of binary, reserved until now for the low-level protocols. The lessons about flexibility
	  learnt in the previous switch from binary to plain text can now be applied to efficient
	  binary syntaxes.
	</t>
	<section title="Definitions">
	  <t>
	    By uniformity, we mean the property of a syntax that can be parsed even by an
	    application that doesn't understand the semantics of every part of the processed
	    data. Of course, almost all syntaxes that feature uniformity contain a limited number
	    of non uniform elements. Also, uniformity really only has value in the face of
	    extension, as a fixed syntax doesn't need uniformity (it only makes the implementation
	    simpler).
	  </t>
	  <t>
	    Almost all extensible syntaxes have their extensible part uniform to a great degree. In
	    this specification, uniformity is hence evaluated on two criteria: first, the number of
	    non uniform elements (and, incidentally, their diversity), second, the fact that the
	    uniformity of the extensible part is not a limitation to the users (i.e. that the
	    temptation to extend the format in a non-uniform way is as absent as possible).
	  </t>
	  <t>
	    A good counter-example is found in most programming languages. Adding a new branching
	    construct cannot be done in a terse way without modifying the underlying
	    implementation. Such a construct either cannot be defined by user code (because of
	    evaluation rules) or can in a terribly verbose and inconvenient way (with lots of
	    boilerplate code). Notable exceptions to this limitation of programming languages are
	    Lisp, Haskell and stack programming languages.
	  </t>
	  <t>
	    On the other hand, a stack programming language is the canonical example of a
	    non-uniform language. Each operator takes a number of operands from the stack. Not
	    knowing the arity of an operator makes it impossible to continue parsing, even when its
	    evaluation was optional to the final processing. In the design space, stack programming
	    languages completely sacrifice uniformity to achieve one of the highest combination of
	    extensibility, compactness and speed of processing.
	  </t>
	  <t>
	    By generality, we mean the ability of a syntax to lend itself to describe any kind of
	    data with a reasonable (or better yet, high) level of compactness and simplicity. For
	    example, although both arrays and linked lists could be considered very general as they
	    are both able to store any kind of data, they actually are at the respective cost of
	    complexity (arrays need the embedding of data structure in the data or in the
	    processing logic) and size (in-memory linked lists can waste as much as half or two
	    third of the space for the overhead of the data structure).
	  </t>
	  <t>
	    By decentralization, we mean the ability to extend the syntax in a way that avoid
	    naming collisions without the use of a central registry. Note that the DNS, as we use
	    it, is NOT decentralized in this sense, but distributed, as it cannot work without its
	    root servers and prior knowledge of their location.
	  </t>
	</section>
	<section title="State of the art">
	  <t>
	    Uniformity, generality and extensibility are usually highly-valued traits in formats
	    design. Programming languages obviously feature them foremost, although their
	    generality usually stops at what they are supposed to express: procedures. Most of them
	    are ill-suited to represent arbitrary data, but notable exceptions include Lisp (where
	    "code is data") and Javascript, from which a subset has been extracted to exchange
	    data, JSON, which has seen a tremendous success for this purpose. JSON may lack in
	    generality and compactness, but its design makes its parsing really straightforward and
	    fast. All of them, though, lack decentralization. Some of them make it possible to
	    extend them in a distributed way if some discipline is followed (for example, by naming
	    modules after domain names), but the discipline is not mandatory (and even with domain
	    names, a change of ownership makes it possible for name collisions).
	  </t>
	  <t>
	    The SGML/XML family of formats also feature uniformity, generality and extensibility
	    and actually fare much better than programming languages on the three fronts. XML
	    namespaces also make XML naming distributed and there have been attempts at making it
	    compact (e.g. EXI from W3C, Fast Infoset from ISO/ITU or EBML).
	  </t>
	  <t>
	    All the previously cited formats clearly lack compactness, although just applying
	    standard compression techniques would sacrifice only very little processing time to
	    gain huge size reductions on most of their intended use cases, but compression may not
	    address their ineffectiveness at storing arbitrary bytes.
	  </t>
	  <t>
	    So-called binary formats pretty much exhibit the opposite trade-offs. Most of them are
	    not uniform to achieve better compactness. Some are specifically designed for a great
	    generality, but many lack extensibility. When they are extensible, it's never in a
	    decentralized way, again for reasons that have to do with compactness. They are usually
	    extremely fast to parse.
	  </t>
	  <t>
	    Actually, many binary formats are not so much formats as they are formats frameworks,
	    and exclude extensibility by design. For each use case, an IDL compiler creates a brand
	    new format that is essentially incompatible with all other formats created by the same
	    compiler (EBML specifically cites this property among its own disadvantages). If the
	    IDL compiler and framework are correctly designed, such a format usually represent an
	    optimum in compactness and speed of processing, as the compiler can also automatically
	    generate an ad-hoc optimized parser.
	  </t>
	  <t>
	    Where extensibility has been planned in existing formats, it often doesn't get used
	    that much or at all because of the complications around it. Many binary formats include
	    reserved values meant to extend them to future uses, like the <spanx
	    style="verb">CM</spanx> field in the ZIP format. A case like this one faces an
	    chicken-and-egg problem: if you don't write and get a specification officially adopted,
	    implementations might not want to include your extension, but if your extension is
	    purely theoretical and hasn't been tested in the wild, you may face resistance to get
	    it officially adopted. This is probably why even though most compression or compressed
	    archive formats include the ability to later encode other compression methods, each new
	    compression method usually comes with its own format.
	  </t>
	  <t>
	    When extensions are managed with any form of registry, another issue is that you
	    usually need to reserve a large set of values for free experimentation, and once an
	    extension gains any traction while in experimentation, its authors face the difficulty
	    to switch all existing implementations to the definitive values they'll get. And how
	    experimenters choose their temporary values makes them vulnerable to conflicts with
	    others.
	  </t>
	</section>
      </section>
      <section title="Format overview">
	<t>
	  A BULK stream is a stream of 8-bit bytes, in big-endian order. Parsing a BULK stream
	  yields a sequence of expressions, which can be either atoms or forms, which are sequences
	  of expressions.
	</t>
	<t>
	  The syntax of forms is entirely uniform, without a single exception: a
	  starting byte marker, a sequence of expressions and an ending byte marker.
	</t>
	<t>
	  Atoms have have a special syntax, for efficiency purposes: they start with a marker byte,
	  followed by a static or dynamic number of bytes, depending on the type. But there are
	  only 5 kinds: the nil atom, generic arrays, small arrays, small unsigned integers and
	  references.
	</t>
	<t>
	  Even booleans and floating-point numbers follow the uniform syntax that every other
	  expression follows.
	</t>
	<t>
	  References consist of a namespace marker (in almost all cases, a single byte) followed by
	  an identifier within this namespace (a single byte). All in all, a very little sacrifice
	  is made in compactness for the benefit of a very simple syntax: apart from nil and small
	  integers, nothing is smaller than 2 bytes, and as most forms involve a reference followed
	  by some content, a form is usually 4 bytes + its content.
	</t>
	<t>
	  A namespace marker in a BULK stream is associated to a namespace identified by some
	  identifier guaranteed to be unique without coordination (like a UUID or cryptographical
	  hash), thus ensuring decentralized extensibility. The stream can be processed even if the
	  application doesn't recognize the namespace. Parsing remains possible thanks to the
	  uniform syntax.
	</t>
	<t>
	  Combination of BULK namespaces, BULK streams and even other formats doesn't need any
	  content transformation to work. Here are some examples:
	  <list style="symbols">
	    <t>
	      The content of a BULK stream, enclosed in form starting and ending byte markers,
	      constitute a valid BULK expression. Thus BULK streams can be packed or annotated
	      within a BULK stream without modification. Annotation use cases include adding
	      metadata or cryptographic signature.
	    </t>
	    <t>
	      A BULK format could specify in its syntax the place for an expression holding
	      metadata. Whether the specification provides its own metadata forms or not, an
	      application could use a BULK serialization for MARC, TEI Header, XML or RDF for this
	      metadata expression. The vocabulary selected would be univocally expressed by the
	      namespace and every vocabulary would be parsed by the same mechanisms.
	    </t>
	    <t>
	      Whenever a content must be stored as-is instead of serialized, or a highly-optimized
	      ad hoc serialization exists for some data, anything can always be stored within an
	      array. They can contain arbitray bytes and there is no limit to their size.
	    </t>
	  </list>
	</t>
	<t>
	  Furthermore, BULK expressions can be evaluated. Most expressions evaluate to themselves,
	  but some evaluate to the result of a pure function call, making it possible to serialize
	  data in an even more compact form, by eliminating boilerplate data and repeated patterns.
	</t>
      </section>
      <section title="Conventions and Terminology">
        <t>
          The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD
          NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as
          described in <xref target="RFC2119">RFC 2119</xref>.
        </t>
        <t>
          Literal numerical values are provided in decimal or hexadecimal as appropriate.
          Hexadecimal literals are prefixed with <spanx style="verb">0x</spanx> to distinguish them
          from decimal literals.
        </t>
	<t>
	  The text notation of the BULK stream uses mnemonics for some bytes sequences. Mnemonics
	  are sequences of characters, excluding all capital letters and white space, like <spanx
	  style="verb">this-is-one-mnemonic</spanx> or <spanx
	  style="verb">what-the-%§!?#-is-that?</spanx>. They are always separated by white
	  space. Outside the use of mnemonics, a sequence of bytes (of one or more bytes) can be
	  represented by its hexadecimal value as an unsigned integer prefixed by <spanx
	  style="verb">0x</spanx> (e.g. <spanx style="verb">0x3F</spanx> or <spanx
	  style="verb">0x3A0B770F</spanx>). Such a sequence of bytes can include dashes to make it
	  more readable (e.g. <spanx
	  style="verb">0xDDA37D36-85E6-4E6D-9B51-959E1CCE366C</spanx>). Some types in this
	  specification define a special syntax for their representation in the text notation.
	</t>
	<t>
	  In the grammar, a shape is a pattern of bytes, following the rules of the text notation
	  for a BULK stream. Apart from mnemonics and fixed sequences of bytes, a shape can
	  contain:
	  <list style="symbols">
	    <t>an arbitrary sequence of a fixed number of bytes, represented by its size, i.e. a
	    number of bytes in decimal immediately followed by a B uppercase letter (e.g. <spanx
	    style="verb">4B</spanx>)</t>
	    <t>a typed sequence of bytes, represented by the name of its type, a capitalized word
	    (e.g.  <spanx style="verb">Foo</spanx>); this means a sequence of bytes whose specific
	    yield (cf. <xref target="parsing" format="title"/>) has this type</t>
	    <t>a named sequence of bytes (of zero or more bytes), represented by a sequence of any
	    character excluding '{}' between '{' and '}' (e.g. <spanx style="verb">{quux}</spanx>);
	    a named sequence can be typed or sized, in which case it is immediately followed by ':'
	    and a type or size (e.g. <spanx style="verb">{quux}:Bar</spanx> or <spanx
	    style="verb">{quux}:12B</spanx>)</t>
	  </list>
	</t>
	<t>
	  The shape that describes the byte sequence of an atom is called its parsing shape. When a
	  shape is given for a form, it merely describes the semantics of evaluating forms of that
	  shape. A reference used in such a shape can be used in different shapes, with unrelated
	  semantics.
	</t>
	<t>
	  For example, this specification defines a way do encode a string with explicit encoding
	  with forms of the shape <spanx style="verb">( string {enc}:Encoding {string}:Bytes
	  )</spanx>. But the shapes <spanx style="verb">( string {arg1}:Int {arg2}:Int )</spanx> or
	  <spanx style="verb">( {arg1}:Int string {arg2}:Int )</spanx> are syntactly valid. They
	  just evaluate to themselves as lists of three expressions, as far as this specification
	  is concerned.
	</t>
      </section>
    </section>

    <section title="BULK syntax">
      <t>
	A BULK stream is a sequence of 8-bit bytes. Bits and bytes are in big-endian order. The
	result of parsing a BULK stream is a list of abstract data, called the abstract yield. BULK
	parsing is injective: a BULK stream has only one abstract yield, but different BULK streams
	can have the same abstract yield (if they associate namespaces to different markers, see
	<xref target="nss-pkgs" format="title"/>).
      </t>
      <t>
	A processing application is not expected to actually produce the abstract yield, but an
	adaptation of the abstract yield to its own implementation, called the concrete yield. Also,
	some expressions in a BULK stream may have the semantics of a transformation of the abstract
	yield. A processing application MAY thus not produce or retain the concrete yield but the
	result of its transformation. This specification deals mainly with the byte sequence and the
	abstract yield and occasionnally provide guidelines about the concrete yield. Of course, a
	processing application MAY not produce any concrete yield at all but produce various data
	structures and side effects from parsing the BULK stream (for example, an event sourced
	application may read its event log from a BULK stream and build its application state by
	applying the events, discarding each of them as soon as it has been applied).
      </t>
      <t>
	The abstract yield is a list of expressions. Expressions can be atoms or forms. Forms
	are lists of expressions. If a byte sequence is parsed as an expression, this byte
	sequence is said to encode this expression.
      </t>
      <t>
	When a sequence of bytes is named in a shape, its name can be used in this specification to
	designate either the byte sequence, or the expression or sequence of expressions it
	encodes. When there could be ambiguity, this specification specifies which is designated.
      </t>

      <section anchor="parsing" title="Parsing algorithm">
	<t>
	  The parser operates with a context, which is a list of expressions. Each time an
	  expression is parsed, it is appended at the end of the context. The initial context is
	  the abstract yield.
	</t>
	<t>
	  At the beginning of a BULK stream and after having consumed the byte sequence encoding a
	  complete expression, the parser is at the dispatch stage. At this stage, the next byte is
	  a marker byte, which tells the parser what kind of expression comes next (the marker byte
	  is the first byte of the sequence that encodes an expression). The expression appended to
	  the context after reading a byte sequence is called the specific yield of the byte
	  sequence.
	</t>
	<t>
	  The <spanx style="verb">0x01</spanx> and <spanx style="verb">0x02</spanx> marker bytes
	  are special cases. When the parser reads <spanx style="verb">0x01</spanx>, it immediately
	  appends an empty list to the current context. This list becomes the new context. This new
	  context has the previous context as parent. Then the parser returns to its dispatch
	  stage. When the parser reads <spanx style="verb">0x02</spanx>, it appends nothing to the
	  context, but instead the parent of the current context becomes the new context and the
	  parser returns to the dispatch stage. Thus it is a parsing error to read <spanx
	  style="verb">0x02</spanx> when the context is the abstract yield.
	</t>
	<t>
	  Some forms have side-effects in their semantics. Those side-effects MUST not affect the
	  parsing of any expression. They can affect evaluation, in which case they MUST only
	  affect the evaluation of expressions in the scope of the form. The outer scope of an
	  expression is the part of its context that follows the expression. Some forms MAY define
	  an inner scope in their shape. The scope of an expression is the union of the outer and
	  inner scopes. This makes BULK lexically scoped.
	</t>
	<t>
	  Whenever a parsing error is encountered, parsing of the BULK stream MUST stop.
	</t>
	<section title="Summary of marker bytes">
	  <table>
	    <thead><tr><th>marker</th><th>shape</th><th></th></tr></thead>
	    <tbody>
	      <tr><td><spanx style="verb">00</spanx></td><td><xref target="nil" format="none"><spanx style="verb">nil</spanx></xref></td><td></td></tr>
	      <tr><td><spanx style="verb">01</spanx></td><td><xref target="form" format="none"><spanx style="verb">(</spanx></xref></td><td></td></tr>
	      <tr><td><spanx style="verb">02</spanx></td><td><xref target="form" format="none"><spanx style="verb">)</spanx></xref></td><td></td></tr>
	      <tr><td><spanx style="verb">03</spanx></td><td><xref target="array" format="none"><spanx style="verb"># Nat {content}</spanx></xref></td><td></td></tr>
	      <tr><td><spanx style="verb">04–0F</spanx></td><td></td><td><xref target="reserved" format="none">reserved</xref></td></tr>
	      <tr><td><spanx style="verb">10–7F</spanx></td><td><xref target="ref" format="none"><spanx style="verb">Ref</spanx></xref></td><td></td></tr>
	      <tr><td><spanx style="verb">80–BF</spanx></td><td><xref target="smallint" format="none"><spanx style="verb">w6[value]</spanx></xref></td><td></td></tr>
	      <tr><td><spanx style="verb">C0–FF</spanx></td><td><xref target="smallarray" format="none"><spanx style="verb">#[size] {content}</spanx></xref></td><td></td></tr>
	    </tbody>
	  </table>
	</section>
	<section anchor="eval" title="Evaluation">
	  <t>
	    A processing application MAY implement evaluation of BULK expressions and streams. When
	    evaluating a BULK stream, when the parser gets to the dispatch stage and the context is
	    the abstract yield, the last expression in the context is replaced by what it evaluates
	    to. (of course, this description is supposed to provide the semantics of BULK
	    evaluation, but a processing application MAY implement evaluation with a different
	    algorithm as long as it provides the same semantics)
	  </t>
	  <t>
	    The default evaluation rule is that an expression evaluates to itself. A name within a
	    namespace can have a value, which is what a reference associated to this name evaluates
	    to. A reference whose marker value is associated to no namespace or whose name has no
	    value evaluates to itself. How self-evaluating BULK expressions are represented in the
	    concrete yield is application-dependent, but future specifications MAY define a
	    standard API to access it, similar to the Document Object Model for XML.
	  </t>
	  <t>
	    The evaluation of a form obeys a special rule, though: if the first expression of the
	    form has type <spanx style="verb">Function</spanx>, that function is called with an
	    argument list and the form evaluates to the return value if it's an atom or the
	    evaluation of the return value if it is a form. If the function has type <spanx
	    style="verb">LazyFunction</spanx>, the argument list is the rest of the form. If the
	    function has type <spanx style="verb">EagerFunction</spanx>, the argument list is the
	    rest of the form, where each expression is replaced by what it evaluates to. Any
	    expression that has type <spanx style="verb">LazyFunction</spanx> or <spanx
	    style="verb">EagerFunction</spanx> also has type <spanx style="verb">Function</spanx>.
	  </t>
	  <t>
	    A form whose first expression doesn't have type <spanx style="verb">Function</spanx>
	    evaluates to itself.
	  </t>
	  <t>
	    When an application evaluates a BULK expression, it MUST verify that evaluation will
	    terminate in a finite number of evaluation steps. An application MAY verify finite
	    termination statically or dynamically. For example, an application MAY stop evaluation
	    in error after a predetermined number of steps.
	  </t>
	</section>
      </section>

      <section anchor="form" title="Forms">
	<t>
	  <list style="hanging">
	    <t hangText="starting marker"><spanx style="verb">0x01</spanx><br/>mnemonic: <spanx
	    style="verb">(</spanx></t>
	    <t hangText="ending marker"><spanx style="verb">0x02</spanx><br/>mnemonic: <spanx
	    style="verb">)</spanx></t>
	  </list>
	</t>

	<section title="Difference between sequence and form">
	  <t>
	    There is a difference between a byte sequence encoding several expressions among the
	    current context and a byte sequence encoding a form (i.e. a single expression that is a
	    list of expressions). As an example, let's examine several forms of the shape <spanx
	    style="verb">( foo {bar} )</spanx>.
	  </t>
	  <t>
	    <list style="symbols">
	      <t>In the form <spanx style="verb">( foo nil nil nil )</spanx>, {bar} encodes 3
	      expressions, and they are three atoms in the yield.</t>

	      <t>In the form <spanx style="verb">( foo nil )</spanx>, {bar} is a single expression
	      in the yield, and that expression is an atom.</t>

	      <t>In the form <spanx style="verb">( foo ( nil nil nil ) )</spanx>, {bar} is also a
	      single expression in the yield, and that expression is a form, a list in the
	      yield.</t>
	    </list>
	  </t>
	  <t>
	    In a shape, when a byte sequence must yield a single expression, it has the type <spanx
	    style="verb">Expr</spanx>. So the last two examples fit the shape <spanx style="verb">(
	    foo {seq}:Expr )</spanx> but not the first. When a byte sequence must yield a form, it
	    has type <spanx style="verb">Form</spanx>. Thus the shape <spanx style="verb">( foo
	    {bar}:Form )</spanx> is equivalent to <spanx style="verb">( foo ( {bar} )
	    )</spanx>. Either one MAY be used.
	  </t>
	</section>
      </section>

      <section title="Atoms">
	<section anchor="nil" title="nil">
	  <t>
	    <list style="hanging">
	      <t hangText="marker"><spanx style="verb">0x00</spanx><br/>mnemonic: <spanx
	      style="verb">nil</spanx></t>
	      <t hangText="parsing shape"><spanx style="verb">nil</spanx></t>
	    </list>
	  </t>
	  <t>
	    Apart from being a possible short marker value, the fact that the <spanx
	    style="verb">0x00</spanx> byte represents a valid atom means that a sequence of null
	    bytes is a valid part of a BULK stream, thus making the format less fragile. In a
	    network communication, nil atoms can be sent to keep the channel open. They can also be
	    used as padding at the end of a form or between forms.
	  </t>
	</section>

	<section title="Arrays">
	  <t>
	    Arrays can be used to store arbitrary bytes.
	  </t>
	  <t>
	    An array can be interpreted either as a bits sequence or as an unsigned integer in
	    binary notation. The choice depends on the context and the application. Actually, many
	    processing applications may not need make any choice, as most programming language
	    implementations actually also confuse unsigned integers and bits sequences to some
	    extent. Expressions that are unsigned integers (that is, natural numbers) have type
	    <spanx style="verb">Nat</spanx> (whether they are encoded as an array or not).
	  </t>
	  <t>
	    Big arrays typically store the content of a file or a binary message of another
	    format. They can also be used to store a vector or matrix of fixed-size elements.
	  </t>
	  <t>
	    In any case, the semantics of the content must be inferred by the processing
	    application; where ambiguity can appear, an application SHOULD enclose the array in a
	    form that makes the semantics explicit (e.g. <xref target="string" format="none"><spanx
	    style="verb">string</spanx></xref>, <xref target="blob" format="none"><spanx
	    style="verb">blob</spanx></xref>, or <xref target="unsigned-int" format="none"><spanx
	    style="verb">unsigned-int</spanx></xref>).
	  </t>
	  <t>
	    Because BULK arrays have no end markers, the payload of a BULK array can constitute the
	    end of the stream.
	  </t>
	  <t>
	    The start and end of an array are known without reading its content, which means that
	    its content can be skipped in constant time and mapped in memory (or read lazily by any
	    other means).
	  </t>
	  <t>
	    Because BULK can use integers with arbitrary size to store the size of an array, BULK
	    arrays have no limit in size.
	  </t>
	  <t>
	    Any array also has the type <spanx style="verb">String</spanx> if its contents can be
	    decoded as a string in the current encoding.
	  </t>
	  
	  <section anchor="array" title="Generic array">
	    <t>
	      <list style="hanging">
		<t hangText="marker"><spanx style="verb">0x03</spanx><br/>mnemonic: <spanx
		style="verb">#</spanx></t>
		<t hangText="parsing shape"><spanx style="verb"># Nat {content}</spanx></t>
	      </list>
	    </t>
	    <t>
	      After consuming the marker byte, the parser returns to the dispatch stage. It is a
	      parser error if the parsed expression is not of type <spanx style="verb">Nat</spanx>
	      or if its value cannot be recognized. This integer is not added to any context, but
	      the parser consumes as many bytes as this integer and they constitute the content of
	      this array.
	    </t>
	    <t>
	      In the text notation, a quoted string is the notation for a generic array containing
	      the encoding of that string in the <xref target="stringenc" format="none">current
	      encoding</xref>, except if the size of the encoding is below 64 bytes, cf. <xref
	      target="smallarray" format="none">small arrays</xref>.
	    </t>
	    <t>
	      In the text notation, some text notation enclosed between balanced <spanx
	      style="verb">([</spanx> and <spanx style="verb">])</spanx> is the notation for a
	      generic array containing the encoding of that text notation, except if the size of
	      the encoding is below 64 bytes, cf. <xref target="smallarray" format="none">small
	      arrays</xref>.
	    </t>
	    <t>Types: <spanx style="verb">Bytes</spanx>, <spanx style="verb">Nat</spanx></t>
	  </section>

	  <section anchor="smallarray" title="Small array">
	    <t>
	      <list style="hanging">
		<t hangText="marker"><spanx style="verb">0xC0–0xFF</spanx><br/>mnemonic: <spanx
		style="verb">#[size]</spanx></t>
		<t hangText="parsing shape"><spanx style="verb">#[size] {content}</spanx></t>
	      </list>
	    </t>
	    <t>
	      The 6 least significant bits of the marker byte are treated as an unsigned
	      integer. This integer is not added to any context, but the parser consumes as many
	      bytes as this integer and they constitute the content of this array.
	    </t>
	    <t>
	      In the text notation, the notation of the marker byte of a small array of size X is
	      <spanx style="verb">#[X]</spanx>. For example, <spanx style="verb">#[2]
	      0x1234</spanx> is a notation for the bytes <spanx style="verb">0xC2-1234</spanx>.
	    </t>
	    <t>
	      In the text notation, a quoted string is the notation for a small array containing
	      the encoding of that string in the current encoding if the size of the encoding is
	      below 64 bytes. For example, <spanx style="verb">"abc"</spanx> and <spanx
	      style="verb">#[3] 0x616263</spanx> are the notation for the same byte sequence if the
	      current encoding is UTF-8.
	    </t>
	    <t>
	      In the text notation, some text notation enclosed between balanced <spanx
	      style="verb">([</spanx> and <spanx style="verb">])</spanx> is the notation for a
	      small array containing the encoding of that text notation if the size of the encoding
	      is below 64 bytes. For example, <spanx style="verb">([ nil 0 1 256 ])</spanx> and
	      <spanx style="verb">#[6] nil w6[0] w6[1] #[2] 0x0100</spanx> are the notation for the
	      same byte sequence.
	    </t>
	    <t>Types: <spanx style="verb">Bytes</spanx>, <spanx style="verb">Nat</spanx></t>
	  </section>

	  <section anchor="smallint" title="Small unsigned integers">
	    <t>
	      <list style="hanging">
		<t hangText="marker"><spanx style="verb">0x80–0xBF</spanx><br/>mnemonic: <spanx
		style="verb">w6[value]</spanx></t>
		<t hangText="parsing shape"><spanx style="verb">w6[value]</spanx></t>
	      </list>
	    </t>
	    <t>
	      Small unsigned integers have a special parsing rule. The 6 least significant bits of
	      the marker byte are the value encoded by this byte (as bits or as an unsigned integer
	      in binary notation).
	    </t>
	    <t>
	      In the text notation, the notation of the marker byte of a small unsigned integer of
	      value X is <spanx style="verb">w6[X]</spanx>. For example, <spanx
	      style="verb">w6[11]</spanx> is a notation for the byte <spanx
	      style="verb">0x8B</spanx> (as is <spanx style="verb">11</spanx>, cf. <xref
	      target="arithmetic"/>).
	    </t>
	    <t>Types: <spanx style="verb">Bytes</spanx>, <spanx style="verb">Nat</spanx></t>
	  </section>

	  <section anchor="nat-enc" title="Encoding natural numbers">
	    <t>
	      When the syntax of a BULK form mandates that an expression can only be a <spanx
	      style="verb">Nat</spanx>, an application SHOULD encode it as the smallest possible
	      array using one of the following sizes: 6, 8, 16, 32, or any multiple of 64 bits.
	    </t>
	  </section>
	</section>

	<section anchor="reserved" title="Reserved marker bytes">
	  <t>
	    Marker bytes <spanx style="verb">0x04−0x0F</spanx> are reserved for future major
	    versions of BULK. It is a parser error if a BULK stream with major version 1 contains
	    such a marker byte.
	  </t>
	</section>

	<section anchor="ref" title="References">
	  <t><list style="hanging">
	    <t hangText="marker"><spanx style="verb">0x10−0x7F</spanx></t>
	    <t hangText="parsing shape">
	      <spanx style="verb">{ns}:1B {name}:1B</spanx>
	      <vspace/>
  	      <spanx style="verb">0x7F {ns'} {name}:1B</spanx>
	    </t>
	  </list>
	  </t>
	  <t>
	    The <spanx style="verb">{ns}</spanx> byte is a value associated with a namespace,
	    called the namespace marker. Values <spanx style="verb">0x10−0x13</spanx> are reserved
	    for namespaces defined by BULK specifications. Greater values can be associated with
	    namespaces identified by a unique identifier.
	  </t>
	  <t>
	    The <spanx style="verb">{name}</spanx> byte is the name within the
	    namespace. Vocabularies with more than 256 names thus need to be spread accross several
	    namespaces.
	  </t>
	  <t>
	    The specification of a namespace SHOULD include a mnemonic for the namespace and for
	    each defined name. When descriptions use several namespaces, the mnemonic of a
	    reference SHOULD be the concatenation of the namespace mnemonic, ":" and the name
	    mnemonic if there can be an ambiguity. For example, the <spanx style="verb">fft</spanx>
	    name in namespace <spanx style="verb">math</spanx> becomes <spanx
	    style="verb">math:fft</spanx>.
	  </t>
	  <t>Type: <spanx style="verb">Ref</spanx></t>
	  <section title="Special case">
	    <t>
	      References have a second parsing rule. In case a BULK stream needs an important
	      number of namespaces, if the marker byte is <spanx style="verb">0x7F</spanx>, the
	      parser continues to read bytes until it finds a byte different than 0xFF. The sum of
	      each of those bytes taken as unsigned integers is the namespace marker. For example,
	      the reference encoded by the bytes <spanx style="verb">0x7F 0xFF 0x8C 0x1A</spanx> is
	      the name 26 in the namespace associated with 522.
	    </t>
	  </section>
	</section>

      </section>

    </section>

    <section title="Standard namespaces">
      <t>
	Standard namespaces have a fixed marker value and are not identified by a unique
	identifier.
      </t>

      <section title="BULK core namespace">
	<t>
	  <list style="hanging">
	    <t hangText="marker"><spanx style="verb">0x10</spanx><br/>namespace mnemonic: <spanx
	    style="verb">bulk</spanx></t>
	  </list>
	</t>
	<table>
	  <thead><tr><th>name</th><th>mnemonic</th></tr></thead>
	  <tbody>
	    <tr><td><spanx style="verb">00</spanx></td><td><spanx style="verb">version</spanx></td></tr>
	    <tr><td><spanx style="verb">01</spanx></td><td><spanx style="verb">import</spanx></td></tr>
	    <tr><td><spanx style="verb">02</spanx></td><td><spanx style="verb">namespace</spanx></td></tr>
	    <tr><td><spanx style="verb">03</spanx></td><td><spanx style="verb">package</spanx></td></tr>
	    <tr><td><spanx style="verb">04</spanx></td><td><spanx style="verb">define</spanx></td></tr>
	    <tr><td><spanx style="verb">05</spanx></td><td><spanx style="verb">mnemonic</spanx></td></tr>
	    <tr><td><spanx style="verb">06</spanx></td><td><spanx style="verb">explain</spanx></td></tr>
	    <tr><td><spanx style="verb">07</spanx></td><td><spanx style="verb">string</spanx></td></tr>
	    <tr><td><spanx style="verb">08</spanx></td><td><spanx style="verb">bulk</spanx></td></tr>
	    <tr><td><spanx style="verb">09</spanx></td><td><spanx style="verb">blob</spanx></td></tr>
	    <tr><td><spanx style="verb">0A</spanx></td><td><spanx style="verb">concat</spanx></td></tr>
	    <tr><td><spanx style="verb">0B</spanx></td><td><spanx style="verb">indexable</spanx></td></tr>
	    <tr><td><spanx style="verb">0C</spanx></td><td><spanx style="verb">indexed-bulk</spanx></td></tr>
	    <tr><td><spanx style="verb">0D</spanx></td><td><spanx style="verb">indexed-array</spanx></td></tr>
	    <tr><td><spanx style="verb">0E</spanx></td><td><spanx style="verb">true</spanx></td></tr>
	    <tr><td><spanx style="verb">0F</spanx></td><td><spanx style="verb">false</spanx></td></tr>
	    <tr><td><spanx style="verb">10</spanx></td><td><spanx style="verb">subst</spanx></td></tr>
	    <tr><td><spanx style="verb">11</spanx></td><td><spanx style="verb">arg</spanx></td></tr>
	    <tr><td><spanx style="verb">12</spanx></td><td><spanx style="verb">rest</spanx></td></tr>
	    <tr><td><spanx style="verb">13</spanx></td><td><spanx style="verb">unsigned-int</spanx></td></tr>
	    <tr><td><spanx style="verb">14</spanx></td><td><spanx style="verb">signed-int</spanx></td></tr>
	    <tr><td><spanx style="verb">15</spanx></td><td><spanx style="verb">fraction</spanx></td></tr>
	    <tr><td><spanx style="verb">16</spanx></td><td><spanx style="verb">binary-float</spanx></td></tr>
	    <tr><td><spanx style="verb">17</spanx></td><td><spanx style="verb">decimal-float</spanx></td></tr>
	    <tr><td><spanx style="verb">18</spanx></td><td><spanx style="verb">binary-fixed</spanx></td></tr>
	    <tr><td><spanx style="verb">19</spanx></td><td><spanx style="verb">decimal-fixed</spanx></td></tr>
	    <tr><td><spanx style="verb">1A</spanx></td><td><spanx style="verb">prefix</spanx></td></tr>
	    <tr><td><spanx style="verb">1B</spanx></td><td><spanx style="verb">postfix</spanx></td></tr>
	    <tr><td><spanx style="verb">1C</spanx></td><td><spanx style="verb">arity</spanx></td></tr>
	    <tr><td><spanx style="verb">1D</spanx></td><td><spanx style="verb">iana-charset</spanx></td></tr>
	  </tbody>
	</table>

	<section anchor="version" title="Version">
	  <t>
	    <list style="hanging">
	      <t hangText="shape"><spanx style="verb">( version {major}:Nat {minor}:Nat
	      )</spanx></t>
	    </list>
	  </t>
	  <t>
	    When parsing a BULK stream, a processing application MUST determine explicitely the
	    major and minor version of the BULK specification that the stream obeys. This
	    information MAY be exchanged out-of-band, if BULK is used to exchange a number a very
	    small messages, where repeated headers of 6 bytes might become too big an overhead. A
	    processing application MUST NOT assume a default version.
	  </t>
	  <t>
	    If the version is expressed within a BULK stream, this form MUST be the first in the
	    stream. In any other place, this form has no semantics attached to it. This
	    specification defines BULK 1.0. When writing a BULK stream, an application MUST encode
	    <spanx style="verb">{major}</spanx> and <spanx style="verb">{minor}</spanx> by the
	    smallest byte sequence as described in <xref target="nat-enc"></xref>.
	  </t>
	  <t>
	    An application writing a BULK stream to long-term storage (e.g. in a file or a database
	    record) SHOULD include a <spanx style="verb">version</spanx> form.
	  </t>
	  <t>
	    Two BULK versions with the same major version MUST share the same parsing rules and the
	    same definitions of marker bytes. Changing the syntax or semantics of existing marker
	    bytes and using marker bytes in the reserved interval warrants a new major
	    version. Changing the syntax or semantics of existing names in standard namespaces
	    also.
	  </t>
	  <t>
	    Adding standard namespaces, adding names in existing standard namespaces, or adding new
	    syntactic uses of existing names that don't overlap with existing uses warrants a new
	    minor version.
	  </t>
	</section>

	<section anchor="nss-pkgs" title="Namespaces and packages">
	  <section anchor="import-ns" title="Importing a namespace">
	    <t>
	      <list style="hanging">
		<t hangText="shape"><spanx style="verb">( import {marker}:Nat ( namespace {id}:Expr
		) )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This associates the namespace identified by <spanx style="verb">{id}</spanx> to the
	      namespace marker <spanx style="verb">{marker}</spanx>, within the scope of this
	      expression.
	    </t>
	  </section>

	  <section anchor="import-pkg" title="Importing a package">
	    <t>
	      <list style="hanging">
		<t hangText="shape"><spanx style="verb">( import {base}:Nat ( package {id}:Expr
		{count}:Nat ) )</spanx></t>
	      </list>
	    </t>

	    <t>
	      This associates the first <spanx style="verb">{count}</spanx> namespaces in the
	      package identified by <spanx style="verb">{id}</spanx> with a continuous range of
	      marker bytes starting at <spanx style="verb">{base}</spanx> within the scope of this
	      expression.
	    </t>
	    <t>
	      Example: <spanx style="verb">( import 21 ( package 0x0123456789ABCDEF 3 ) )</spanx>
	      associates the first 3 namespaces of the package identified by <spanx
	      style="verb">0x0123456789ABCDEF</spanx> to the markers 21, 22 and 23.
	    </t>
	  </section>
	  
	  <section anchor="def-ns" title="Namespace definition">
	    <t>
	      <list style="hanging">
		<t hangText="shape">
		  <spanx style="verb">( define ( namespace {id}:Expr {marker}:Nat ) {def}:Bytes
		  )</spanx>
		</t>
	      </list>
	    </t>
	    <t><spanx style="verb">{def}</spanx> MUST contain a BULK stream. The state of the
	    namespace associated to <spanx style="verb">{marker}</spanx> after evaluating all
	    expressions in <spanx style="verb">{def}</spanx> is made the definition of the
	    namespace identified by <spanx style="verb">{id}</spanx>, within the scope of this
	    expression. This also associates that namespace to the namespace marker <spanx
	    style="verb">{marker}</spanx>, within the scope of this expression.
	    </t>

	    <section anchor="verifiable" title="Verifiable namespace definition">
	      <t>
		When a processing application reckognizes that <spanx style="verb">{id}</spanx>
		designates a digest that matches the bytes contained in <spanx
		style="verb">{def}</spanx>, this creates a verifiable namespace.
	      </t>
	      <t>
		If more data than <spanx style="verb">{id}</spanx> is needed to verify <spanx
		style="verb">{id}</spanx> against <spanx style="verb">{def}</spanx> (like the salt
		of a hash function, or the namespace of a UUID), this data should be provided by
		the first expression in <spanx style="verb">{def}</spanx>. If such information is
		not needed or when the namespace is not verifiable, the first expression MUST be
		<spanx style="verb">nil</spanx>.
	      </t>
	      <t>
		Verifiable namespaces are meant to be immutable, but that would be circumvented if
		they were built upon namespaces that aren't. A verifiable namespace that doesn't
		use names from any other namespaces is an immutable namespace. A verifiable
		namespace that only uses names from immutable namespaces is an immutable
		namespace. To that effect, an application SHOULD stop processing this form with an
		error when <spanx style="verb">{def}</spanx> contains references from namespaces
		that cannot be determined to be immutable themselves. The goal is to prevent a user
		or system to be unwittingly vulnerable, so an application MAY provide an option to
		accept a specific verifiable namespace, but an application MUST NOT provide an
		option to accept any vulnerable verifiable namespace. That is, an application MAY
		have an option like <spanx style="verb">--accept-ns 8f82849556d74466</spanx> but
		MUST NOT have an option like <spanx
		style="verb">--disable-immutability-check</spanx>.
	      </t>
	      <section anchor="bootstrap" title="Bootstrapping verification">
		<t>
		  A verifiable namespace will use a form to express its digest. For immutable
		  namespaces to exist, this means that at least one namespace needs to express its
		  own digest with a name in itself. This is called a bootstrapping namespace.
		</t>
		<t>
		  When importing a bootstrapping namespace, the processing application looks up in
		  its known namespaces if there is a namespace identified by a form whose first
		  element is a reference in itself with the same name number as the digest form in
		  the import. If the logic of the digest algorithm determines that the digest form
		  in the import matches the digest in the namespace definition (along with
		  additional configuration data provided there), then the boostrapping is
		  successful.
		</t>
		<t>
		  The same logic can be used to import a bootstrapping package, which is a package
		  identified with a form from one of its own namespaces.
		</t>
	      </section>
	    </section>

	    
	  </section>

	  <section anchor="def-pkg" title="Package definition">
	    <t>
	      <list style="hanging">
		<t hangText="shape"><spanx style="verb">( define ( package {id}:Expr ) {def}:Bytes
		)</spanx></t>
	      </list>
	    </t>
	    <t>
	      This creates a package identified by <spanx style="verb">{id}</spanx>. Packages MUST
	      be immutable, <spanx style="verb">{id}</spanx> MUST be a digest verifiable against
	      the bytes contained in <spanx style="verb">{def}</spanx>. <spanx
	      style="verb">{def}</spanx> MUST contain a BULK stream with the shape <spanx
	      style="verb">{data}:Expr {namespaces}</spanx>. <spanx
	      style="verb">{namespaces}</spanx>. MUST be a sequence of expressions each identifying
	      a BULK namespace.
	    </t>
	    <t>
	      If more data than <spanx style="verb">{id}</spanx> is needed to verify <spanx
	      style="verb">{id}</spanx> against <spanx style="verb">{def}</spanx> (like the salt of
	      a hash function, or the namespace of a UUID), this data should be provided by <spanx
	      style="verb">{data}</spanx>. Else <spanx style="verb">{data}</spanx> MUST be <spanx
	      style="verb">nil</spanx>.
	    </t>
	  </section>

	  <section anchor="def-name" title="Name definitions">
	    <t>
	      To define a reference is to change the the value of its name in its namespace (as
	      identified by its unique identifier, not the marker value) within a certain scope.
	    </t>
	    <t>
	      If a BULK stream is not evaluated, the semantics of a definition are entirely
	      application-dependent.
	    </t>
	    <t>
	      When a BULK stream containing definitions for a namespace comes from a trusted source
	      (i.e. in configuration files of the application, or in the communication with an
	      agent that has been granted the relevant authority), an application MAY give those
	      definitions long-lasting semantics (i.e. keep the values of the names at the end of
	      parsing). This is the preferred mechanism for bulk namespace definition when the
	      semantics of the defined expressions can be expressed completely by BULK forms.
	    </t>

	    <t>
	      <list style="hanging">
		<t hangText="shape"><spanx style="verb">( define {ref}:Ref {value}:Expr
		)</spanx></t>
	      </list>
	    </t>
	    <t>
	      This defines the reference <spanx style="verb">{ref}</spanx> to the yield of <spanx
	      style="verb">{value}</spanx> in the outer scope of this form.
	    </t>
	  </section>

	  <section title="Mnemonic">
	    <t>
	      <list style="hanging">
		<t hangText="shape"><spanx style="verb">( mnemonic ( namespace {marker}:Nat )
		{mnemonic}:String )</spanx></t>
		<t hangText="shape"><spanx style="verb">( mnemonic Ref {mnemonic}:String
		)</spanx></t>
	      </list>
	    </t>
	    <t>
	      The first shape declares <spanx style="verb">{mnemonic}</spanx> to be the mnemonic
	      for the namespace associated with the marker <spanx style="verb">{marker}</spanx>.
	    </t>
	    <t>
	      The second shape declares <spanx style="verb">{mnemonic}</spanx> to be the mnemonic
	      for the name designated by the reference.
	    </t>
	  </section>

	  <section title="Explain">
	    <t>
	      <list style="hanging">
		<t hangText="shape"><spanx style="verb">( explain ( namespace {marker}:Nat )
		{doc}:Expr )</spanx></t>
		<t hangText="shape"><spanx style="verb">( explain Ref {doc}:Expr )</spanx></t>
	      </list>
	    </t>
	    <t>
	      The first shape declares <spanx style="verb">{doc}</spanx> to be the documentation
	      for the namespace associated with the marker <spanx style="verb">{marker}</spanx>.
	    </t>
	    <t>
	      The second shape declares <spanx style="verb">{doc}</spanx> to be the documentation
	      for the name designated by the reference.
	    </t>
	    <t>
	      Documentation expressions can be strings with plain text or use any BULK vocabulary
	      to use a richer documentation format (including BULK forms to make a format of plain
	      text explicit).
	    </t>
	  </section>
	</section>

	<section title="Strings and other typed byte arrays">
	  <section anchor="stringenc" title="Current encoding">
	    <t>
	      <list style="hanging">
		<t hangText="shape"><spanx style="verb">( define string {enc}:Encoding
		)</spanx></t>
	      </list>
	    </t>
	    <t>
	      This tells the processing application that, in the scope of this expression, all
	      expressions that are understood by the application as character strings will be
	      encoded with the encoding designated by <spanx style="verb">{enc}</spanx>.
	    </t>
	    <t>
	      As the abstract yield doesn't contain strings but expressions that will be used as
	      strings by the application, it is not a parsing error if the application doesn't
	      recognize <spanx style="verb">{enc}</spanx>. In this situation, it is a parsing error
	      when the application actually needs to decode a byte sequence as a string. It is not
	      a parsing error when a processing application only transmits a byte sequence encoding
	      a string, if it can accurately convey the encoding to the receiving application.
	    </t>
	  </section>

	  <section anchor="string" title="String">
	    <t>
	      <list style="hanging">
		<t hangText="shape"><spanx style="verb">( string {string}:Bytes )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This form indicates that the bytes contained in <spanx style="verb">{string}</spanx>
	      are meant to be interpreted as a string encoded with the current string encoding.
	    </t>
	  </section>

	  <section anchor="string*" title="String with explicit encoding">
	    <t>
	      <list style="hanging">
		<t hangText="shape"><spanx style="verb">( string {enc}:Encoding {string}:Bytes
		)</spanx></t>
	      </list>
	    </t>
	    <t>
	      This form indicates that the bytes contained in <spanx style="verb">{string}</spanx>
	      are meant to be interpreted as a string encoded with the encoding designated by the
	      <spanx style="verb">{enc}</spanx>.
	    </t>
	  </section>

	  <section title="IANA registered character set">
	    <t>
	      <list style="hanging">
		<t hangText="shape"><spanx style="verb">( iana-charset {id}:Nat )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This designates the string encoding registered among the <xref
	      target="IANA-Charsets">IANA Character Sets</xref> whose MIBenum is <spanx
	      style="verb">{id}</spanx>.
	    </t>
	    <t>
	      Type: <spanx style="verb">Encoding</spanx>.
	    </t>
	  </section>
	  <section anchor="nested-bulk" title="Nested BULK stream">
	    <t>
	      <list style="hanging">
		<t hangText="shape"><spanx style="verb">( bulk {bulk}:Bytes )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This form indicates that the bytes contained in <spanx style="verb">{bulk}</spanx>
	      are meant to be interpreted as a BULK stream. If the stream doesn't start with a
	      <spanx style="verb">version</spanx> form, the stream MUST be assumed to have the same
	      version as the parent stream.
	    </t>
	    <t>
	      This form can be useful to let the application reading a BULK stream skip parsing a
	      large section.
	    </t>
	    <t>
	      The evaluation of this form is the same as the evaluation of the BULK stream in
	      <spanx style="verb">{bulk}</spanx> taken as a form. For example, these two forms have
	      the same evaluation:
              <list style="symbols">
		<t><spanx style="verb">( 4 5 )</spanx></t>
		<t><spanx style="verb">( bulk true #[2] 4 5 )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This evaluation obeys the following exception. It could be a security risk if a
	      single BULK stream could be parsed into two different abstract yields by two
	      conformant applications, so the evaluation of the whole stream cannot change whether
	      <spanx style="verb">{bulk}</spanx> is decoded or not. For that reason, any effects in
	      the nested stream that affect how BULK expressions are evaluated (like namespace
	      associations or definitions) MUST be isolated within this form.
	    </t>
	    <t>
	      For the same security reason, there isn't a <spanx style="verb">( bulk-with-size
	      Nat Expr )</spanx> form because it would open up the same risk when the size given
	      is not the size of the enclosed expression, accidently or maliciously.
	    </t>
	    <t>
	      <list style="hanging">
		<t hangText="shape"><spanx style="verb">( bulk {exprs} )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This forms makes it possible to include a BULK stream as BULK expressions instead of
	      an array. This form's semantics is the same as the last expression in the sequence of
	      expressions in <spanx style="verb">{exprs}</spanx>, after evaluating all of them. It
	      makes it possible to isolate evaluation side-effects.
	    </t>
	  </section>

	  <section anchor="blob" title="Blob">
	    <t>
	      <list style="hanging">
		<t hangText="shape"><spanx style="verb">( blob {blob}:Bytes )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This form indicates that the bytes contained in <spanx style="verb">{blob}</spanx>
	      are meant be interpreted as just a raw sequence of bytes, not to be decoded.
	    </t>
	  </section>

	  <section title="Array concatenation">
	    <t>
	      <list style="hanging">
		<t hangText="shape"><spanx style="verb">( concat {array1}:Bytes {array2}:Bytes
		)</spanx></t>
	      </list>
	    </t>
	    <t>
	      <list style="hanging">
		<t hangText="Name's type">EagerFunction</t>
		<t hangText="Form's type">Bytes</t>
		<t hangText="Form's value">the concatenation of {array1} and {array2}.</t>
	      </list>
	    </t>
	    <t>
	      The value of this form is an array that contains the bytes in array1 followed by the
	      bytes in array2.
	    </t>
	  </section>	  
	</section>

	<section title="Indexed data">
	  <t>
	    When writing a stream containing a big number of expressions where an application
	    may want to access one of those expression without parsing all expressions before,
	    one could imagine as a solution to use pointer-like references that each use the
	    offset of some expression in the stream. This solution creates a security risk,
	    because if reading according to the pointers doesn't produce the same result as
	    parsing the stream without using them, an attacker might use this inconsistency to
	    their advantage, when they can expect one application to use pointers and another
	    application to use normal parsing, especially when the stream is big enough that
	    verifying the consistency of the pointers might be costly enough that it might not
	    be done or not in time to prevent the attack.
	  </t>
	  <t>
	    Because of that risk, whenever a stream includes indexed BULK expressions, that is,
	    expressions that are meant to be accessed by their byte position, indexed reading
	    SHOULD be the only way used to access them. To that end, indexed data SHOULD be stored
	    in arrays.
	  </t>

	  <section title="Indexable content">
	    <t>
	      <list style="hanging">
		<t hangText="shape"><spanx style="verb">( indexable {container}:Nat {content}:Bytes
		)</spanx></t>
	      </list>
	    </t>
	    <t>
	      This form contains in <spanx style="verb">{content}</spanx> the data that can be
	      indexed somewhere else in the stream, with <xref target="indexed-bulk"><spanx
	      style="verb">indexed-bulk</spanx></xref> or <xref target="indexed-array"><spanx
	      style="verb">indexed-array</spanx></xref>. <spanx style="verb">{container}</spanx> is
	      the identifier of this container and the value of the unsigned integer MUST be unique
	      across the stream. Indexable containers are identified by the value of their unsigned
	      integer, not by their binary encoding.
	    </t>
	  </section>
	  <section anchor="indexed-bulk" title="Indexed BULK expression">
	    <t>
	      <list style="hanging">
		<t hangText="shape"><spanx style="verb">( indexed-bulk {container}:Nat {start}:Nat
		)</spanx></t>
	      </list>
	    </t>
	    <t>
	      The semantics of this form is the same as if it was replaced by the BULK expression
	      starting at offset <spanx style="verb">{start}</spanx> in the indexable container
	      numbered <spanx style="verb">{container}</spanx>.
	    </t>
	    <t>
	      The goal of indexed data is to selectively parse only part of the BULK stream, so the
	      same security risks as with <xref target="nested-bulk"><spanx
	      style="verb">bulk</spanx></xref> are present. It could be a security risk if a single
	      BULK stream could be parsed into two different abstract yields by two conformant
	      applications, so the semantics of the whole stream cannot change whether the indexed
	      expression is decoded or not. For that reason, any effects in the indexed expression
	      that affect how BULK expressions are parsed or evaluated (like namespace associations
	      or definitions) MUST be isolated within this form.
	    </t>
	    <t>
	      Beware that, although evaluations of all other BULK definitions follow lexical
	      scoping, any definition used inside an indexed expression that isn't defined inside
	      that same indexed expression follows dynamic scoping with respect to any places where
	      it's used. Any definition made inside an indexed expression still follows lexical
	      scoping.
	    </t>
	  </section>
	  <section anchor="indexed-array" title="Indexed array">
	    <t>
	      <list style="hanging">
		<t hangText="shape"><spanx style="verb">( indexed-array {container}:Nat {start}:Nat
		{size}:Nat )</spanx></t>
	      </list>
	    </t>
	    <t>
	      The semantics of this form is an array whose content are <spanx
	      style="verb">{size}</spanx> bytes, starting at offset <spanx
	      style="verb">{start}</spanx> in the indexable container numbered <spanx
	      style="verb">{container}</spanx>.
	    </t>
	    <t>
	      Compared to <spanx style="verb">indexed-bulk</spanx>, which can reference an array
	      expression, <spanx style="verb">indexed-array</spanx> is useful when several
	      different but overlapping sections of the same byte sequence are needed as arrays.
	    </t>
	  </section>
	</section>

	<section title="Booleans">
	  <t>
	    <list style="hanging">
	      <t hangText="shape"><spanx style="verb">true</spanx></t>
	      <t hangText="shape"><spanx style="verb">false</spanx></t>
	    </list>
	  </t>
	  <t>
	    Type: <spanx style="verb">Boolean</spanx>.
	  </t>
	</section>


	<section title="Substituton">
	  <section title="Substitution function">
	    <t>
	      <list style="hanging">
		<t hangText="shape"><spanx style="verb">( subst {code} )</spanx></t>
	      </list>
	    </t>
	    <t>
	      <list style="hanging">
		<t hangText="Name's type">LazyFunction</t>
		<t hangText="Form's type">EagerFunction</t>
		<t hangText="Form's value">A substitution function whose return value is the
		value of <spanx style="verb">{code}</spanx>. Within <spanx
		style="verb">{code}</spanx>'s specific yield, the names <spanx
		style="verb">arg</spanx> and <spanx style="verb">rest</spanx> are defined:</t>
	      </list>
	    </t>
	  </section>
	  <section title="Argument">
	    <t>
	      <list style="hanging">
		<t hangText="shape"><spanx style="verb">( arg {n}:Nat )</spanx></t>
	      </list>
	    </t>
	    <t>
	      <list style="hanging">
		<t hangText="Name's type">EagerFunction</t>
		<t hangText="Form's type">Expr</t>
		<t hangText="Form's value">the element number <spanx style="verb">{n}</spanx>
		(starting at zero) of the substitution function's arguments list</t>
	      </list>
	    </t>
	  </section>
	  <section title="Rest of arguments list">
	    <t>
	      <list style="hanging">
		<t hangText="shape"><spanx style="verb">( rest {n}:Nat )</spanx></t>
	      </list>
	    </t>
	    <t>
	      <list style="hanging">
		<t hangText="Name's type">EagerFunction</t>
		<t hangText="Form's type">Expr</t>
		<t hangText="Form's value">the substitution function's arguments list without its
		first <spanx style="verb">{n}</spanx> elements.</t>
	      </list>
	    </t>
	  </section>
	  <section title="Examples">
	    <t>Here is a definition of the inverse followed by the numbers 1/2, 1/3 and 1/4:</t>
	    <t><spanx style="verb">( define inverse ( subst ( frac 1 ( arg 0 ) ) ) ) ( inverse 2
	    ) ( inverse 3 ) ( inverse 4 )</spanx></t>
	    <t>Substitution will splice multiple expressions in place:</t>
	    <t>The evaluation of <spanx style="verb">( ( subst 1 ( rest 0 ) 4 ) 2 3 )</spanx>
	    must yield the same as <spanx style="verb">( 1 2 3 4 )</spanx></t>
	  </section>
	</section>

	<section anchor="arithmetic" title="Arithmetic">
	  <t>
	    A processing application must recognize the type of all expressions defined in this
	    specification that have the type Number, but an application MAY consider a number as
	    having an unknown value if it can't decode its value or has no adequate data type to
	    store it.
	  </t>
	  <t>
	    In the text notation of a BULK stream, a decimal integer is the notation for the
	    smallest byte sequence that yields this integer as described in <xref
	    target="nat-enc"></xref>. For example, <spanx style="verb">( 31 256 )</spanx> is a
	    notation for the bytes <spanx style="verb">0x01 0x9F 0xC2-0100 0x02</spanx>.
	  </t>

	  <section anchor="unsigned-int" title="Unsigned integer">
	    <t>
	      <list style="hanging">
		<t hangText="shape"><spanx style="verb">( unsigned-int {bits}:Bytes )</spanx></t>
	      </list>
	    </t>
	    <t>
	      The bits contained in <spanx style="verb">{bits}</spanx> is the value of this integer
	      in binary notation. This form exists in case disambiguation of the semantics is
	      necessary.
	    </t>
	    <t>
	      Type: <spanx style="verb">Number</spanx>, <spanx style="verb">Int</spanx>, <spanx
	      style="verb">Nat</spanx>.
	    </t>
	  </section>

	  <section title="Signed integer">
	    <t>
	      <list style="hanging">
		<t hangText="shape"><spanx style="verb">( signed-int {bits}:Bytes )</spanx></t>
	      </list>
	    </t>
	    <t>
	      The bits contained in <spanx style="verb">{bits}</spanx> is the value of this integer
	      in two's-complement notation.
	    </t>
	    <t>
	      Type: <spanx style="verb">Number</spanx>, <spanx style="verb">Int</spanx>.
	    </t>
	  </section>

	  <section title="Fraction">
	    <t>
	      <list style="hanging">
		<t hangText="shape"><spanx style="verb">( frac {num}:Int {div}:Int )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This is the number <spanx style="verb">{num}</spanx>/<spanx
	      style="verb">{div}</spanx>.
	    </t>
	    <t>
	      Type: <spanx style="verb">Number</spanx>.
	    </t>
	  </section>

	  <section title="Binary floating-point number">
	    <t>
	      <list style="hanging">
		<t hangText="shape"><spanx style="verb">( binary-float {bits}:Bytes )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This is a floating-point number expressed in IEEE 754-2008 binary interchange
	      format. <spanx style="verb">{bits}</spanx> can be of size 16, 32, 64, 128 or any
	      bigger multiple of 32 bits, as per IEEE 754-2008 rules.
	    </t>
	    <t>
	      Types: <spanx style="verb">Number</spanx>, <spanx style="verb">Real</spanx>, <spanx
	      style="verb">Float</spanx>.
	    </t>
	  </section>

	  <section title="Decimal floating-point number">
	    <t>
	      <list style="hanging">
		<t hangText="shape"><spanx style="verb">( decimal-float {bits}:Bytes )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This is a floating-point number expressed in IEEE 754-2008 decimal interchange
	      format. <spanx style="verb">{bits}</spanx> can be of size 32, 64, 128 or any bigger
	      multiple of 32 bits, as per IEEE 754-2008 rules.
	    </t>
	    <t>
	      Types: <spanx style="verb">Number</spanx>, <spanx style="verb">Real</spanx>, <spanx
	      style="verb">Float</spanx>.
	    </t>
	  </section>

	  <section title="Binary fixed point number">
	    <t>
	      <list style="hanging">
		<t hangText="shape"><spanx style="verb">( binary-fixed {point}:Nat {bits}:Bytes
		)</spanx></t>
	      </list>
	    </t>
	    <t>
	      This is a fixed point binary number. <spanx style="verb">{bits}</spanx> contains an
	      integer in two's complement. That integer divided by 2^point is the value of this
	      form. For example, <spanx style="verb">( binary-fixed 2 15 )</spanx> has value <spanx
	      style="verb">3.75<sub>10</sub></spanx> (<spanx
	      style="verb">11.11<sub>2</sub></spanx>).
	    </t>
	    <t>
	      Types: <spanx style="verb">Number</spanx>, <spanx style="verb">Real</spanx>.
	    </t>
	  </section>

	  <section title="Decimal fixed point number">
	    <t>
	      <list style="hanging">
		<t hangText="shape"><spanx style="verb">( decimal-fixed {point}:Nat {bits}:Bytes
		)</spanx></t>
	      </list>
	    </t>
	    <t>
	      This is a fixed point decimal number. <spanx style="verb">{bits}</spanx> contains an
	      integer in two's complement. That integer divided by 10^point is the value of this
	      form. For example, <spanx style="verb">( decimal-fixed 2 123 )</spanx> has value
	      <spanx style="verb">1.23</spanx>.
	    </t>
	    <t>
	      Types: <spanx style="verb">Number</spanx>, <spanx style="verb">Real</spanx>.
	    </t>
	  </section>
	</section>

	<section title="Compact formats">
	  <t>
	    This specification and other official BULK specifications take the option to use as
	    their basic building block a form with a discriminating reference as first element
	    (basically, they are a binary representation of an abstract syntax tree). As noted
	    previously, this means that most representations weigh 4 bytes plus their actual
	    content, which will in turn have some overhead because of one or several marker bytes.
	  </t>
	  <t>
	    But when there is a special need for compactness, BULK makes it possible to design
	    protocols and formats with different trade-offs, while retaining its property of being
	    parseable by processing applications not knowing the protocol in its entirety.
	  </t>
	  <t>
	    On one end of the spectrum, a format might choose to use an array to encapsulate an ad
	    hoc binary format. An extreme use of this scheme would be to use BULK just to make
	    explicit the binary format used and for nothing else. With a known <xref
	    target="profiles">profile</xref> (for example with a file extension and/or media type
	    for such explicitly typed BLOBs), such a BULK stream can consist solely of the version
	    form, a reference that describes the binary format and an array, which would amount to
	    an overhead between 11 bytes and 20 bytes depending on the size of the content (11, 13,
	    14, 16 and 20 bytes for contents of no more than 63B, 255B, 65kB, 4GB and 18EB
	    respectively). Without a profile, with the namespaces associations in a package, the
	    minimum overhead is only between 32 and 41 bytes (the difference is a single <spanx
	    style="verb">import</spanx> form, assuming a digest of 64 bits).
	  </t>
	  <t>
	    Still, even this extreme in the design space retains the ability to insert expressions
	    in the BULK stream, whatever their type. Thus metadata can be added about data that is
	    represented in a format that doesn't allow for metadata or for limited metadata.
	  </t>
	  <t>
	    In-between these two extremes, several options are available to produce a format that
	    leverages the BULK parser a lot more while being more compact than a basic BULK
	    format. The following forms provide a standard way to create such formats, called BULK
	    bytecodes.
	  </t>
	  <t>
	    A BULK bytecode is a flat sequence of expressions. The evaluation of a bytecode form
	    transforms that sequence to an abstract syntax tree of its contents (and then the
	    resulting expression can be evaluated with the normal BULK evaluation rules). The
	    expressions of the bytecode are divided among operators and operands. Operators are
	    references that will end up as the first expression of a form in the abstract syntax
	    tree. Operands are all other expressions. Prefix bytecodes are those where operators
	    come before their operands, postfix bytecodes are those where operators come after
	    their operands. In the following forms, operators MUST be references.
	  </t>
	  <t>
	    When evaluating a bytecode, a processing application MUST abort the transformation as
	    soon as it encounters a reference for which it cannot determine if it is an operator or
	    its arity (the number of operands it will have). An expression that is not a reference
	    is always an operand.
	  </t>

	  <section title="Prefix bytecode">
	    <t>
	      <list style="hanging">
		<t hangText="shape"><spanx style="verb">( prefix {bytecode} )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This is a prefix bytecode form. The bytecode to be transformed is the sequence of
	      expressions in <spanx style="verb">{bytecode}</spanx>.
	    </t>
	    <t>
	      To transform a prefix bytecode form, a processing application creates an alternate
	      context. If the first expression of the bytecode is an operand, it is removed from
	      the beginning of the bytecode and appended at the end of the alternate context. If
	      the first expression of the bytecode is an operator, it is removed from the beginning
	      of the bytecode and a list is created with the operator as the first expression, then
	      as many next expressions as its arity are removed from the beginning of the bytecode
	      and appended at the end of this list. Then that resulting list is appended at the end
	      of the alternate context. The transformation continues until the bytecode is empty,
	      in which case the alternate context replaces the bytecode form and the transformation
	      is complete. The resulting form can then be evaluated in turn.
	    </t>
	    <t>Example: the evaluation of</t>
	    <t><figure><artwork>( bulk
 ( define ( arity prefix ) ( 2 go:black ) )
 ( prefix go:game go:black 1 2 go:black 3 4 go:black 5 6 )
	    </artwork></figure></t>
	    <t>is</t>
	    <t><figure><artwork>( go:game
 ( go:black 1 2 )
 ( go:black 3 4 )
 ( go:black 5 6 ) )
	    </artwork></figure></t>
	  </section>
	  <section title="Postfix bytecode">
	    <t>
	      <list style="hanging">
		<t hangText="shape"><spanx style="verb">( postfix {bytecode} )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This is a postfix bytecode form. The bytecode to be transformed is the sequence of
	      expressions in <spanx style="verb">{bytecode}</spanx>.
	    </t>
	    <t>
	      To transform a postfix bytecode form, a processing application creates a data
	      stack. If the first expression of the bytecode is an operand, it is removed from the
	      beginning of the bytecode and pushed on top of the stack. If the first expression of
	      the bytecode is an operator, it is removed from the beginning of the bytecode and a
	      list is created with the operator as the first expression, then as many next
	      expressions as its arity are popped from the stack and appended at the end of this
	      list (with the top of the stack as the last element). Then that resulting list is
	      pushed on top of the stack. The transformation continues until the bytecode is empty,
	      in which case the list of the elements on the stack (with the top of the stack as the
	      last element) replaces the bytecode form and the transformation is complete. The
	      resulting form can then be evaluated in turn.
	    </t>
	    <t> Example: the evaluation of </t>
	    <t><figure><artwork>( bulk
 ( define ( arity postfix ) ( 2 go:black go:white go:comment go:alternative ) )
 ( postfix
  go:game
  1 2 go:black
  "white tried an unorthodox opening" 3 4 go:white go:comment
  "a more classical opening would be" 8 9 go:white go:comment
  go:alternative
  2 3 go:black
  4 5 go:white )</artwork></figure></t>
            <t>is</t>
	    <t><figure><artwork>( go:game
  ( go:black 1 2 )
  ( go:alternative
    ( go:comment "white tried an unorthodox opening" ( go:white 3 4 ) )
    ( go:comment "a more classical opening would be" ( go:white 8 9 ) ) )
  ( go:black 2 3 )
  ( go:white 4 5 ) )</artwork></figure>
	    </t>
            <t>
	      The obvious advantage of postfix bytecode is that it makes it possible to compact
	      nested forms when they have a known arity. When a reference in a vocabulary can be
	      used in a form containing a variable number of expressions, if some arity is used
	      frequently enough, an application can define a specific form for it. The trade-offs
	      for this are explained in <xref target="arityForm"/>
	    </t>
	  </section>
	  <section title="Arity definition">
	    <t>
	      <list style="hanging">
		<t hangText="shape"><spanx style="verb">( define ( arity {contexts} ) {arities}
		)</spanx></t>
	      </list>
	    </t>
	    <t>
	      This form defines the arity of references in the context of bytecodes.
	    </t>
	    <t>
	      <spanx style="verb">{contexts}</spanx> can contain <spanx
	      style="verb">prefix</spanx>, <spanx style="verb">postfix</spanx>, or any other
	      reference, to specifiy in the context of which kind of bytecode the arities are
	      modified. If <spanx style="verb">{contexts}</spanx> is empty, the arities are
	      modified in the context of all kinds of bytecodes.
	    </t>
	    <t>
	      <spanx style="verb">{arities}</spanx> is a sequence of expressions that can be
	      shaped:
	      <list style="symbols">
		<t><spanx style="verb">nil</spanx>: meaning all known arities should be
		forgotten</t>
		<t>
		  <spanx style="verb">( {kind}:Expr {target} )</spanx>:
		  <list style="symbols">
		    <t>
		      if <spanx style="verb">{kind}</spanx> is <spanx style="verb">nil</spanx>, it
		      sets all references designated by <spanx style="verb">{target}</spanx> as
		      operands
		    </t>
		    <t>
		      if <spanx style="verb">{kind}</spanx> is typed <spanx
		      style="verb">Nat</spanx>, it sets all references designated by <spanx
		      style="verb">{target}</spanx> as operators of arity <spanx
		      style="verb">{kind}</spanx>
		    </t>
		    <t>
		      if <spanx style="verb">{target}</spanx> is <spanx style="verb">nil</spanx>,
		      it designates all references with unknown arity
		    </t>
		    <t>
		      if <spanx style="verb">{target}</spanx> is a sequence of references, it
		      designates each of those
		    </t>
		  </list>
		</t>
		
	      </list>
	    </t>
	  </section>

	  <section title="Going further in compactness">
            <t>
	      If the overhead of several marker bytes in the operands of some operators is too
	      much, even more compactness can be achieved by packing together small operands. For
	      example, instead of an operator with two integers as its operands, one could specify
	      an operator to take a single word as operand and extract the integers from it (while
	      still retaining the ability to operate on many sizes of integers, because it can
	      still deduce the size of the integers by dividing the size of the array by two).
	    </t>
	    <t>
	      For example, a BULK format representing player moves with a pair of coordinates on a
	      large board might represent a single move with the following shapes:
	    </t>
	    <t>
	      <list style="hanging">
		<t hangText="basic (8 bytes)"><spanx style="verb">( game:black/2 #[1] 0x41 #[1]
		0x5A )</spanx></t>
		<t hangText="packed basic (7 bytes)"><spanx style="verb">( game:black/1 #[2] 0x41
		0x5A )</spanx></t>
		<t hangText="bytecode (6 bytes)"><spanx style="verb">game:black/2 #[1] 0x41 #[1]
		0x5A</spanx></t>
		<t hangText="packed bytecode (5 bytes)"><spanx style="verb">game:black/1 #[2] 0x41
		0x5A</spanx></t>
	      </list>
	    </t>
	    <t>
	      The transformation defined for the bytecode forms makes it possible to mix literal
	      expressions and operations represented by a sequence of operators and operands. In
	      the previous go example, for instance, one might represent each alternating move by
	      the two players as two integers, lowering the weight of each move to 2 bytes as
	      coordinates are below 64:
	    </t>
	    <t><figure><artwork>( postfix
  ( ( 2 go:white go:comment go:alternative ) )
  go:game
  1 2
  "white tried an unorthodox opening" 3 4 go:white go:comment
  "a more classical opening would be" 8 9 go:white go:comment
  go:alternative
  2 3
  4 5 )</artwork></figure></t>
            <t>
              The difference between all these schemes and an array containing fixed-size elements
              is that you keep the ability to insert other forms, like here to represent comments
              on the game or variants.
	    </t>
	    <t>
	      The cost of the bytecode format is that if it contains operators whose arity is
	      unknown to a processing application, the whole list after the first occurrence of
	      them is unreadable to that processing application, whereas in the basic format, the
	      processing application can still process all the forms it understands, and that
	      requires no anticipation by the application creating the BULK stream.
	    </t>
	    <t>
	      The only case where operators could have an unknown arity is when the application
	      writing the stream didn't include the arities of every operator used in the stream to
	      avoid the redundancy with their previous definition (typically in the definition of
	      their respective namespaces). That redundancy would be offset by the space reduction
	      of postfix bytecode for streams containing a few dozens forms. At that point, with
	      all arities explicit, with packing and with literals for the most used forms, postfix
	      bytecode gets on the Pareto front for size and generality, retaining the full
	      generality of BULK while saving a lot of space.
	    </t>
	  </section>
	</section>
      </section>
    </section>

    <section title="Extension namespaces">
      <t>
	Extension namespaces are defined with a unique identifier, to be associated to a marker
	value.
      </t>
      <t>
	By is decentralized nature, as far as a processing application is concerned, apart from
	standard namespaces, there is no difference between a namespace defined as part of the
	official BULK suite and a user-defined one.
      </t>
    </section>

    <section anchor="profiles" title="Profiles">
      <t>
	A profile is a byte sequence parsed by a processing application just after the <spanx
	style="verb">version</spanx> form or before the first expression if there is no <spanx
	style="verb">version</spanx> form. Thus a parser SHOULD look ahead at the beginning of a
	stream to see if the first three bytes are <spanx style="verb">( bulk:version</spanx>. With
	respect to the BULK stream, the profile is an out-of-band information, usually implicit.
      </t>
      <t>
	A processing application doesn't need to include the profile in the concrete yield, as long
	as the semantics of the abstract yield are maintained.
      </t>
      <t>
	The same BULK stream might be processed with different profiles.
      </t>
      <t>
	A processing application MUST NOT deduce the profile from the content of a BULK stream.
      </t>

      <section title="Profile redundancy">
	<t>
	  A processing application SHOULD only rely on the use of a profile when it is a safe
	  assumption that the profile is known, for example within a communication where the
	  protocol dictates the profile.
	</t>
	<t>
	  In particular, long-term storage of a BULK stream SHOULD preserve profile information,
	  for example with a media type that dictates the profile.
	</t>
	<t>
	  Otherwise, an application writing a BULK stream in a long-term storage SHOULD include the
	  profile after the version form. For this reason, the expressions in a profile SHOULD have
	  idempotent semantics.
	</t>
      </section>

      <section title="Standard profile">
	<t>
	  This specification defines the default profile that a processing application MUST use
	  when it is not using a specific profile:
	</t>
	<t>
	  <spanx style="verb">( define string ( iana-charset 106 ) )</spanx>
	</t>
	<t>
	  This means that the default string encoding in a BULK stream is UTF-8.
	</t>
      </section>
    </section>

    <section title="Security Considerations" anchor="sec">
      <section title="Parsing">
	<t>
	  Parsing a BULK stream is designed to be free of side-effects for the processing
	  application, apart from storing the parsed results.
	</t>
	<t>
	  Arrays in BULK carry their size, to avoid the need for escaping their content. A
	  malicious software, however, may announce an array with a size choosen to get an
	  application to exhaust its available memory. When a BULK stream has been completely
	  received, an array bigger than the remaining data MUST trigger an error. When a BULK
	  stream's size is not known in advance, the application SHOULD use a growable data
	  structure.
	</t>
	<t>
	  Evaluation opens up some known attacks that appear whenever a format provides a way to
	  express abstraction, like the billion laughs attack. As it is explained in <xref
	  target="eval" format="title"/>, an implementation MAY stop evaluation after a predefined
	  number of evaluation steps. As this has been demonstrated not to be sufficient to prevent
	  attacks based on expansion, an implementation SHOULD also put predefined limits on the
	  space that the abstract yield can take on disk or in memory.
	</t>
	<t>
	  Applications MAY use out-of-band information to select size limits (like HTTP
	  attributes), or a BULK namespace MAY provide hints. An implementation SHOULD emit
	  warnings when the size of the abstract yield would exceed the size limits set by such
	  out-of-band or in-band information.
	</t>
      </section>
      <section title="Forwarding">
	<t>
	  When a processing application forwards all or part of the data in a BULK stream to
	  another application, care must be taken if part of the forwarded data was not entirely
	  recognized, as it could be used by an attacker to benefit from the authority the
	  forwarding application has on the recipient of the data.
	</t>
      </section>
      <section title="Definitions">
	<t>
	  The architecture of a processing application SHOULD ensure that a malicious agent cannot
	  abuse authority given to it to define a namespace in order to modify associations in
	  other namespaces. Depending on the use of data structures storing BULK expressions, this
	  could amount to giving an attacker a way to manipulate the application's state. See <xref
	  target="robustNS"/> for an example of architecture that is resistant to that kind of
	  attack.
	</t>
      </section>
    </section>

    <section title="IANA Considerations">
      <t>
	This specification defines a new media type, application/bulk. Here are the informations
	for its registration to IANA:
      </t>
      <t>
	<list style="hanging">
	  <t hangText="Type name">application</t>
	  <t hangText="Subtype name">bulk</t>
	  <t hangText="Required parameters">none</t>
	  <t hangText="Optional parameters">none</t>
	  <t hangText="Encoding considerations">none, content is self-describing</t>
	  <t hangText="Security considerations">cf. <xref target="sec"/></t>
	  <t hangText="Interoperability considerations">the constraint to start any BULK file with
	  a version form has the side-effect that classes of BULK streams can be identified by a
	  sequence of bytes acting as "magic number", at offset 0:
	  <list style="hanging">
	    <t hangText="0x011000">any BULK stream</t>
	    <t hangText="0x01100081">a BULK stream of major version 1</t>
	    <t hangText="0x011000818002">a BULK stream of version 1.0</t>
	  </list>
	  </t>
	  <t hangText="Published specification">this document</t>
	  <t hangText="Applications that use this media type">none so far</t>
	  <t hangText="Fragment identifier considerations">this specification defines no semantics
	  for addressing the data with a fragment identifier; a future specification MAY define
	  fragment identifier syntaxes to address the content by byte offset or the parsed results
	  by their position in the yielded list</t>
	  <t hangText="Additional information">a future specification MAY define a naming
	  convention for media types based on bulk with a +bulk suffix, as for XML with +xml</t>
	</list>
      </t>
    </section>

    <section title="Acknowledgements">
      <t>
	The original author of this specification read <eref
	target="http://www.schnada.de/grapt/eriknaggum-xmlrant.html">Erik Naggum's famous rant
	about XML</eref> several years before, and while forgotten as such for a time, it clearly
	was the seed that slowly bloomed into the design of BULK. This format is dedicated to Erik.
      </t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <reference anchor="RFC2119">
        <front>
          <title>
            Key words for use in RFCs to Indicate Requirement Levels
          </title>
          <author initials="S." surname="Bradner" fullname="Scott Bradner">
            <organization>Harvard University</organization>
            <address><email>sob@harvard.edu</email></address>
          </author>
          <date month="March" year="1997"/>
        </front>
        <seriesInfo name="BCP" value="14"/>
        <seriesInfo name="RFC" value="2119"/>
      </reference>

      <reference anchor="IANA-Charsets" target="http://www.iana.org/assignments/character-sets">
        <front>
          <title>
	    IANA Charset Registry (archived at):
          </title>
	  <author/>
	  <date/>
        </front>
      </reference>

    </references>


    <references title="Informative references">

      <reference anchor="HTTP2">
	<front>
	  <title>Hypertext Transfer Protocol version 2 (HTTP/2)</title>
	  <author initials="M." surname="Belshe" fullname="Mike Belshe">
	    <organization>BitGo</organization>
	    <address>
	      <email>mike@belshe.com</email>
	    </address>
	  </author>

	  <author initials="R." surname="Peon" fullname="Roberto Peon">
	    <organization>Google, Inc</organization>
	    <address>
	      <email>fenix@google.com</email>
	    </address>
	  </author>

	  <author initials="M." surname="Thomson" fullname="Martin Thomson" role="editor">
	    <organization>Mozilla</organization>
	    <address>
	      <postal>
		<street>331 E Evelyn Street</street>
		<city>Mountain View, CA</city>
		<code>94041</code>
		<country>US</country>
	      </postal>
	      <email>martin.thomson@gmail.com</email>
	    </address>
	  </author>
	  <date month="May" year="2015"/>
	</front>
        <seriesInfo name="RFC" value="7540"/>
      </reference>

      <reference anchor="Avro" target="http://avro.apache.org/docs/1.7.4/spec.html">
	<front>
	  <title>Apache Avro™ 1.7.4 Specification</title>
	  <author initials="D." surname ="Cutting" fullname="Doug Cutting">
            <organization>Cloudera</organization>
	  </author>
	  <date month="February" year="2013"/>
	</front>
      </reference>

      <reference anchor="protobuf" target="https://developers.google.com/protocol-buffers/">
	<front>
	  <title>Protocol Buffers</title>
	  <author/>
	  <date month="July" year="2008"/>
	</front>
      </reference>

      <reference anchor="Smile" target="http://wiki.fasterxml.com/SmileFormat">
	<front>
	  <title>Smile Data Format</title>
	  <author initials="T." surname ="Saloranta" fullname="Tatu Saloranta">
	    <address><email>tsaloranta@gmail.com</email></address>
	  </author>
	  <date month="September" year="2010"/>
	</front>
      </reference>

      <reference anchor="Thrift" target="http://thrift.apache.org/static/files/thrift-20070401.pdf">
	<front>
	  <title>Thrift: Scalable Cross-Language Services Implementation</title>
	  <author initials="M." surname ="Slee" fullname="Mark Slee">
	    <organization>Facebook</organization>
	    <address><email>mcslee@facebook.com</email></address>
	  </author>
	  <author initials="A." surname ="Agarwal" fullname="Aditya Agarwal">
	    <organization>Facebook</organization>
	    <address><email>aditya@facebook.com</email></address>
	  </author>
	  <author initials="M." surname ="Kwiatkowski" fullname="Marc Kwiatkowski">
	    <organization>Facebook</organization>
	    <address><email>marc@facebook.com</email></address>
	  </author>
	  <date month="April" year="2007"/>
	</front>
      </reference>

    </references>

    <section anchor="robustNS" title="Robust namespace definition">
      <t>
	This constitutes a suggestion of architecture for a BULK processing application. It has the
	advantage that an agent cannot modify the values of names to which it has not specifically
	been given authority. This architecture doesn't ensure this property by checking the
	validity of definitions but by adhering to the Principle Of Least Authority, thus ensuring
	no false positives or TOCTOU race conditions.
      </t>
      <t>
	For each new context (including the abstract yield when parsing starts), the parser creates
	a new copy of each known namespace. These copies are available in this context to retrieve
	and define values. It implements the lexical scoping of definitions on top of providing the
	robustness properties discussed here.
      </t>
      <t>
	By default, all namespaces created in a context are discarded at the end of this context.
      </t>
      <t>
	Of course, an implementation of the architecture presented here can be optimized compared
	to the abstract algorithm, for example by using copy-on-demand.
      </t>
      <t>
	Any namespace that is not a copy for its context but the object retained by the application
	afterwards, gives authority to make long-lasting definitions. Such a namespace is called
	lasting here.
      </t>
      <section title="Selective authority">
	<t>
	  A number of lasting namespaces are included for the abstract yield. Their unique
	  identifiers are agreed out-of-band. The disadvantage of this solution is that it needs
	  prior agreement on the definable namespaces.
	</t>
      </section>
      <section title="Open authority">
	<t>
	  Any namespace definition for a unique identifier unknown to the processing application
	  triggers the creation of a lasting namespace.
	</t>
	<t>
	  The disadvantage of this solution is that it opens a denial of service vulnerability. If
	  Bob is a processing application and Carol and Dave are agents communicating with Bob with
	  an open authority, Dave can prevent Carol from defining a namespace if it manages to know
	  the unique identifier and to start a communication with Bob before Carol.
	</t>
	<t>
	  If an agent uses a secure way to create unique identifiers, this solution is both
	  flexible and safe (the burden is not on the BULK processing application). This
	  specification thus encourages the use of open authority restricted to verifiable
	  namespaces (in which case several agents can present the same definition to a processing
	  application without conflict).
	</t>
      </section>
    </section>

    <section title="Forward compatibility">
      <t>
	BULK makes it possible to create new versions of vocabularies that encompass previous
	versions, in a way that minimizes implementation complexity.
      </t>

      <t>
	The first tool is aliasing: reuse names and values from existing namespaces, even in <xref
	target="bootstrap" format="none">bootstrapping namespaces</xref>:
      </t>

      <t><figure><artwork>( import ( namespace 20 ( oldhash:shake128 {oldhashid} ) ) )
( define ( namespace ( newhash:shake128 {newhashid} ) 21 )
([ nil
( explain ( namespace 21 ) "The new, shiny hash namespace!" )
( define newhash:shake128 oldhash:shake128 ) ]) )</artwork></figure></t>
	  
      <t>
	With this, new namespaces can be created and applications don't need to change the existing
	code.
      </t>

      <t>
	One possible downside with aliasing is that if the number of aliasing namespaces grow, you
	might end up with the implementation of an important namespace scattered across a bunch of
	aliased legacy namespaces. In that case, a second tool is to reverse to direction of
	aliasing: all the implementation lives in the current namespace, cohesively, and old
	namespaces are aliases to the new:
      </t>

      <t><figure><artwork>( import ( namespace 20 ( newhash:shake128 {newhashid} ) ) )
( define ( namespace ( oldhash:shake128 {oldhashid} ) 21 )
([ nil
( explain ( namespace 20 ) "Ye olde hashing namespace." )
( define oldhash:shake128 newhash:shake128 ) ]) )</artwork></figure></t>

      <t>
	Such a new namespace definition would obviously have a different digest than the original,
	and namespace verification, as is, would fail, so reverse aliasing would need the
	application to provide a dedicated espace hatch for it. Following the Principle of Least
	Authority, it should not be possible for the escape hatch to be something that can be
	requested from a BULK stream, so there is no provision for it in BULK syntax.
      </t>

      <t>
	One obvious design would be for the application to have a privileged storage for reverse
	aliasing namespace definitions. Where this could still not be deemed safe enough, reverse
	aliasing namespaces could be defined in the application's code.
      </t>
    </section>

    <section anchor="arityForm" title="Arity-carrying forms">
      <t>
	Sometimes a vocabulary will include forms that can contain an arbitrary number of
	expressions. When such a form is used in postfix bytecode, the simplest solution is just to
	use a nested <spanx style="verb">postfix</spanx> form:
      </t>
      <t><figure><artwork>( define ( arity ) ( 2 go:black go:white go:comment ) )
( postfix
  go:game
  1 2 go:black
  ( postfix go:alternative
    "white tried an unorthodox opening" 3 4 go:white go:comment
    "a more classical opening would be" 8 9 go:white go:comment )
  2 3 go:black
  ( postfix go:alternative 
    "white played a bad move" 4 5 go:white go:comment
    "white could have played a decent move" 5 6 go:white go:comment
    "white could have played a great move" 5 7 go:white go:comment ) )</artwork></figure></t>
        <t>
        The nested <spanx style="verb">postfix</spanx> form costs 4 bytes, compared to an
	equivalent postfix bytecode.
      </t>
      <t>
	If those 4 bytes add up to too much space through repetition, an application could define a
	form for the sole purpose of assigning it an arity, while the evaluation of the
	arity-carrying form would just replace it with the original one. For example, after
	evaluating the postfix bytecode transformation and the resulting form of the last
	expression of
      </t>
      <t><figure><artwork>( define go2:alt/2 go:alternative )
( define go2:alt/3 go:alternative )
( define ( arity ) ( 2 go:black go:white go:comment go2:alt/2 ) ( 3 go2:alt/3 ) )
( postfix
  go:game
  1 2 go:black
  "white tried an unorthodox opening" 3 4 go:white go:comment
  "a more classical opening would be" 8 9 go:white go:comment
  go2:alt/2
  2 3 go:black
  "white played a bad move" 4 5 go:white go:comment
  "white could have played a decent move" 5 6 go:white go:comment
  "white could have played a great move" 5 7 go:white go:comment
  go2:alt/3
  )</artwork></figure></t>
      <t>it would be transformed into</t>
      <t><figure><artwork>( go:game
  ( go:black 1 2 )
  ( go:alternative
    ( go:comment "white tried an unorthodox opening" ( go:white 3 4 ) )
    ( go:comment "a more classical opening would be" ( go:white 8 9 ) ) )
  ( go:black 2 3 )
  ( go:alternative
    ( go:comment "white played a bad move" ( go:white 4 5 ) )
    ( go:comment "white could have played a decent move" ( go:white 5 6 ) )
    ( go:comment "white could have played a great move" ( go:white 5 7 ) ) )
  ( go:white 4 5 ) )</artwork></figure>
      </t>
      <t>
	Such an arity-carrying form costs 10 or 13 bytes to be usable when it is added to an
	existing form defining arities. Which means that compared to the nested <spanx
	style="verb">postfix</spanx> form, it pays for itself if it is used only 3 or 4 times.
      </t>
    </section>

  </back>
</rfc>

