天天看點

H.264視訊的RTP荷載格式

Status of This Memo

   This document specifies an Internet standards track protocol for the

   Internet community, and requests discussion and suggestions for

   improvements.  Please refer to the current edition of the "Internet

   Official Protocol Standards" (STD 1) for the standardization state

   and status of this protocol.  Distribution of this memo is unlimited.

Copyright Notice

   Copyright (C) The Internet Society (2005).

Abstract

   This memo describes an RTP Payload format for the ITU-T

   Recommendation H.264 video codec and the technically identical

   ISO/IEC International Standard 14496-10 video codec.  The RTP payload

   format allows for packetization of one or more Network Abstraction

   Layer Units (NALUs), produced by an H.264 video encoder, in each RTP

   payload.  The payload format has wide applicability, as it supports

   applications from simple low bit-rate conversational usage, to

   Internet video streaming with interleaved transmission, to high bit-

   rate video-on-demand.

目錄

   1.  介紹        ........................................  3

       1.1.  H.264 Codec    ...............................  3

       1.2.  參數集概念         ...........................  4

       1.3.  網絡抽象層單元類型............................  5

   2.  約定       .........................................  6

   3.  範圍 ...............................................  6

   4.  定義和縮寫         .................................  6

       4.1.  定義     .....................................  6

   5.  RTP 荷載格式   .....................................  8

       5.1.  RTP 頭的使用..................................  8

       5.2.  RTP荷載格式的公共使用           .............. 11

       5.3.  NAL單言位元組的用法 ............................ 12

       5.4.  打包方式  .................................... 14

       5.5.  解碼順序号  (DON)............................. 15

       5.6.  單個NAL單元包................................. 18

       5.7.  複合包       ................................. 18

       5.8.  分片單元 (FUs) ............................... 27

   6.  分包規則         ................................... 31

       6.1.  公共分包規則    .............................. 31

       6.2.  單個NAL單元方式............................... 32

       6.3.  非交錯方式     ............................... 32

       6.4.  交錯方式       ............................... 33

   7.  打包過程 (資訊)             ........................ 33

       7.1.  單NAL單元和非交錯方式         ................ 33

       7.2.  交錯方式       ............................... 34

       7.3.  附加的打包原則              .................. 36

   8.  荷載格式參數     ................................... 37

       8.1.  MIME 注冊 .................................... 37

       8.2.  SDP 參數...................................... 52

       8.3.  例子.......................................... 58

       8.4.  參數集考慮        ............................ 60

   9.  安全考慮     ....................................... 62

   10. 擁塞控制............................................ 63

   11. IANA考慮 ........................................... 64

   12. 資訊化附錄: 應用例子            .................... 65

       12.1. 根據ITU-T H.241 附錄A的視訊電話............... 65

       12.2. 沒有分片資料分區,沒有NAL單元聚合的視訊電話... 65

       12.3. 使用NAL單元聚合交錯打包的視訊電話............. 66

       12.4. 使用資料分區的視訊電話      .................. 66

       12.5. 使用FU和向前糾錯的視訊電話和流................ 67

       12.6. 低位率流    .................................. 69

       12.7. 視訊流中健壯的包排程             ............. 70

   13. 資訊化附錄:解碼順序号的原理                    ..... 71

       13.1. 介紹.......................................... 71

       13.2. 多圖像片斷交錯的例子             ............. 71

       13.3. 健壯包排程的例子          .................... 73

       13.4. 備援編碼片斷健壯傳輸排程的例子................ 77

       13.5. 其它設計可能的提醒         ................... 77

   14. 緻謝  .............................................. 78

   15. 參考 ............................................... 78

       15.1. 标準化參考.................................... 78

       15.2. 參考性的參考.................................. 79

   作者位址................................................ 81

   完全版權聲明  .......................................... 83

1.  介紹

1.1.  H.264 Codec

   本文指定一個RTP荷載規範用于ITU-T H.264 視訊編碼标準(ISO/IEC 14496 Part 10 [2])(兩個都稱為進階視訊編碼

   AVC).  H.264建議在2005年5月被ITU-T采納, 草案規範對于公共回顧可用[8]. 本文H.264 縮寫用于codec和标準,但是

   本文等價于采納 ISO/IEC相似的編碼标準.

   H.264 視訊 codec又非常廣泛的應用覆寫所有格式的數字壓縮視訊格式,從低帶寬的Internet流應用到HDTV廣播和數字

   影院應用。和目前的技術狀态比較, 整個H.264的性能被報告節省50%的位率。例如,數字衛星TV品質被報告在1.5 Mbit/s,

   就可以實作,而目前的MPEG 2的操作點在大約3.5 Mbit/s [9].

   該codec規範自己概念上區分[1]視訊編碼層(VCL)和網絡抽象層(NAL). VCL包含Codec的信令處理功能;以及如轉換,量化,

   運動補償預測機制;以及循環過濾器。他遵從今天大多數視訊codec的一般概念,基于宏快的編碼器,使用基于運動補償的

   圖像間預測和殘餘信号的轉換編碼。VCL編碼器輸出片斷: 一個位串包含整數數目宏快的宏塊資料,以及片斷頭資訊(包含

   片斷内第一個宏快的空間位址, 初始量化參數以及相似資訊). 片斷内的宏快按照掃描順序安排,除非指定一個不同的宏塊

   配置設定,通過使用被稱為靈活宏塊順序文法Flexible Macroblock Ordering syntax.圖像内的預測隻用于一個片斷内部。更多

   資訊在[9]提供.

   (NAL)編碼器封裝VCL編碼器輸出的片斷到網絡抽象層單元(NAL units),它适合于通過包網路傳輸或用于面向包的多路複用

   環境。H.264的附錄B定義封裝過程傳輸這樣的NAL單元通過面向位元組流的網絡。本文檔範圍, 附錄 B 不相關的。 

   NAL使用NAL單元. 一個NAL單元由一位元組的頭和荷載位元組串組成。 頭訓示NAL單元的類型, 是否有位錯誤或文法沖突在NAL

   單元荷載中,以及對于解碼過程該NAL單元相對重要性的資訊。本RTP荷載規範被設計成不了解NAL單元荷載的位串。

   H.264的一個主要特性是傳輸時間,解碼時間,圖像以及片斷采樣示範時間完全的解耦合。H.264中指定的解碼過程是不知道

   時間的, 并且H.264文法沒有運送如跳過幀數目(在早期視訊壓縮标準,時間參考格式中是普遍的)資訊.同時,有的NAL單元

   影響許多圖像,是以固有的是無時間性的。因為這樣的原因,處理RTP時戳要求對于采樣或示範時間沒有定義或者在傳輸時間

   不知道的NAL單元進行一些特殊的考慮。 

1.2.  參數集概念

   H.264一個非常基本的設計概念是産生自包含包, 使得如RFC2429的頭重複或MPEG-4的頭擴充編碼(HEC)[11]機制變得不必要。

   這是通過從媒體流解耦合不止一個片斷的相對資訊來實作的。高層meta資訊應該可靠/異步的發送,事先不和包含片斷包的RTP

   包流發送。(對于沒有通過帶外傳輸信道發送本資訊的應用,通過帶内發送本資訊也提供了手段)。高層參數的組合被稱為參數集。

   H.264規範包括兩類參數集:順序參數集和圖像參數集。一個活動順序參數集在一個編碼視訊序列中保持不變,一個活動圖像參數集

   在一個編碼圖像裡保持不變。順序和圖像參數集結構包含如圖像大小,采用的可選的編碼模式,宏塊到片斷組映射等資訊。

   為了改變圖像參數(如圖像大小)而不用同步傳送參數集修改給片斷包流,編碼器和解碼器可以維護不止一個順序和圖像參數集的

   清單。每個片斷頭包含一個碼字訓示使用的順序和圖像參數集。

   本機制允許從包流中解耦合參數集的傳輸,通過外部手段傳輸他們(即,作為能力交換的副作用),或通過一個(可靠或不可靠)控制協定

   他們從沒有被傳送但是被應用設計規範修複甚至是可能的。

1.3.  網絡抽象層單元類型

   可以在[12], [13],[14]中找到關于NAL設計的學習資訊.

   所有NAL單元有一個單個NAL單元類型位元組,他也作為本RTP荷載格式的荷載頭.後面立即跟随NAL單元的荷載。

   NAL單元類型位元組的文法語義在[1]中指定,但是NAL單元類型的基本屬性總結如下。NAL單元類型位元組格式如下:

      +---------------+

      |0|1|2|3|4|5|6|7|

      +-+-+-+-+-+-+-+-+

      |F|NRI|  Type   |

   NAL單元類型位元組部件的語義在H.264規範中制定, 簡要描述如下.

   F: 1 bit

      forbidden_zero_bit.  H.264規範聲明設定為1訓示文法違例。

   NRI: 2 bits

      nal_ref_idc.  00值訓示NAL單元的不用于幀間圖像預測的重構參考圖像。這樣的NAL單元可以被丢棄而不用冒參考

      圖像完整性的風險。大于0的值訓示NAL單元的解碼要求維護參考圖像的完整性。

   Type: 5 bits

      nal_unit_type.  本部件指定NAL單元荷載類型定義在[1]的表 7-1中和本文後面。為了參考所有目前定義的NAL單元類型

      和他們的語義,參考 [1]的7.4.1.

   本文引入新的NAL單元類型,在5.2示範.  定義在本文的NAL單元類型在[1]中标記為未指定。但是,本規範擴充了F和 NRI的

   語義,象5.3描述的那樣.

2.  Conventions

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",

   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this

   document are to be interpreted as described in BCP 14, RFC 2119 [3].

   This specification uses the notion of setting and clearing a bit when

   bit fields are handled.  Setting a bit is the same as assigning that

   bit the value of 1 (On).  Clearing a bit is the same as assigning

   that bit the value of 0 (Off).

3.  Scope

   This payload specification can only be used to carry the "naked"

   H.264 NAL unit stream over RTP, and not the bitstream format

   discussed in Annex B of H.264.  Likely, the first applications of

   this specification will be in the conversational multimedia field,

   video telephony or video conferencing, but the payload format also

   covers other applications, such as Internet streaming and TV over IP.

4.  定義和縮寫

4.1.  定義

   本文檔使用[1]中的定義. 為了友善以下定義在[1]中的詞語總結出來:

      access unit: 一組NAL單元總包括一個主要的編碼圖像。除了主要的編碼圖像,一個 access unit也可以包含

      一個或多個備援編碼圖像或其他的不包括片斷或編碼圖像片斷分區資料的NAL單元 。access unit的解碼總是

      導緻一個解碼的圖像。

      coded video sequence: A sequence of access units that consists, in

      decoding order, of an instantaneous decoding refresh (IDR) access

      unit followed by zero or more non-IDR access units including all

      subsequent access units up to but not including any subsequent IDR

      access unit.

      IDR access unit: An access unit in which the primary coded picture

      is an IDR picture.

      IDR picture: A coded picture containing only slices with I or SI

      slice types that causes a "reset" in the decoding process.  After

      the decoding of an IDR picture, all following coded pictures in

      decoding order can be decoded without inter prediction from any

      picture decoded prior to the IDR picture.

      primary coded picture: The coded representation of a picture to be

      used by the decoding process for a bitstream conforming to H.264.

      The primary coded picture contains all macroblocks of the picture.

      redundant coded picture: A coded representation of a picture or a

      part of a picture.  The content of a redundant coded picture shall

      not be used by the decoding process for a bitstream conforming to

      H.264.  The content of a redundant coded picture may be used by

      the decoding process for a bitstream that contains errors or

      losses.

      VCL NAL unit: A collective term used to refer to coded slice and

      coded data partition NAL units.

   In addition, the following definitions apply:

      decoding order number (DON): A field in the payload structure, or

      a derived variable indicating NAL unit decoding order.  Values of

      DON are in the range of 0 to 65535, inclusive.  After reaching the

      maximum value, the value of DON wraps around to 0.

      NAL unit decoding order: A NAL unit order that conforms to the

      constraints on NAL unit order given in section 7.4.1.2 in [1].

      transmission order: The order of packets in ascending RTP sequence

      number order (in modulo arithmetic).  Within an aggregation

      packet, the NAL unit transmission order is the same as the order

      of appearance of NAL units in the packet.

      media aware network element (MANE): A network element, such as a

      middlebox or application layer gateway that is capable of parsing

      certain aspects of the RTP payload headers or the RTP payload and

      reacting to the contents.

         Informative note: The concept of a MANE goes beyond normal

         routers or gateways in that a MANE has to be aware of the

         signaling (e.g., to learn about the payload type mappings of

         the media streams), and in that it has to be trusted when

         working with SRTP.  The advantage of using MANEs is that they

         allow packets to be dropped according to the needs of the media

         coding.  For example, if a MANE has to drop packets due to

         congestion on a certain link, it can identify those packets

         whose dropping has the smallest negative impact on the user

         experience and remove them in order to remove the congestion

         and/or keep the delay low.

   縮寫

      DON:        解碼順序号

      DONB:       解碼順序基

      DOND:       解碼順序号差

      FEC:        向前糾錯

      FU:         分片單元

      IDR:        瞬間解碼重新整理

      IEC:        國際電子委員會

      ISO:        國際标準化組織

      ITU-T:      國際電聯-通信标準部門

      MANE:       美提感覺網絡元素

      MTAP:       多時刻聚合包

      MTAP16:     16位時戳位移的MTAP

      MTAP24:     24位時戳位移的MTAP

      NAL:        網絡抽象層

      NALU:       NAL單元

      SEI:        補充增強資訊

      STAP:       單時刻聚合包

      STAP-A:     STAP類型A

      STAP-B:     STAP類型B

      TS:         時戳

      VCL:        視訊編碼層

5.  RTP 荷載格式

5.1.  RTP頭的使用

   RTP 頭的格式在RFC 3550 [4]中指定為了友善在圖1又顯示出來。本載荷格式使用頭中域的方式和該規範一緻。

   當一個 NAL 單元封裝在每個RTP包中, 推薦的RTP荷載格式在5.6節指定。對于聚合包/分片包的RTP荷載 (以及

   一些rtp頭域的設定)在5.7和5.8節指定。

       0                   1                   2                   3

       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

      |V=2|P|X|  CC   |M|     PT      |       sequence number         |

      |                           timestamp                           |

      |           synchronization source (SSRC) identifier            |

      +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

      |            contributing source (CSRC) identifiers             |

      |                             ....                              |

                       圖 1.  RTP 頭。

   根據RTP荷載格式設定的RTP頭資訊按如下設定: 

   Marker bit (M): 1 bit

      對于RTP時戳訓示的通路單元的最後一個包本位進行設定,符合視訊格式M位的正常使用,以允許有效

      緩沖處理布局。對于聚合包(STAP,MTAP),RTP頭中的M位必須設定成最後一個NAL單元如果被傳送在

      單個RTP包中時M位對應的值。解碼器可以使用本位作為早期最後一個包的訓示,但是不可以依賴本

      屬性。

       注:運送多個NAL單元的聚合包隻有一個M位相關聯。是以,如果一個網關重新打包一個聚合包為幾

       個包,它可能不會可靠設定這些包的M位。

   Payload type (PT): 7 bits

      本新的包格式的荷載類型的值超過本文檔的範圍,在此不指明。荷載類型的指派或者通過profile或者

      通過動态方式。

   Sequence number (SN): 16 bits

      根據RFC 3550設定使用. 對于單個NALU與非交錯打包方式, 序号用于對定NALU解碼順序。

   Timestamp: 32 bits

      RTP時戳設定為内容的采樣時戳。必須使用90 kHz 時鐘頻率。

      如果NAL單元沒有他自己的時間屬性(即,parameter set and SEI NAL units),RTP時戳設定成通路單元主編碼圖像

      的RTP時戳,根據[1]的7.4.1.2節。

      MTAPs時戳的設定在5.7.2定義.

      接收者應該忽略包含在通路單元(隻有一個顯示時戳)的任何圖像時間SEI消息,相反,接收者應該使用RTP時戳

      同步顯示過程。

      RTP發送者你不應該傳送圖像時間 SEI消息對于不支援被顯示成多個場的圖像。

      如果一個通路單元有多于一個顯示時戳在圖像時間SEI消息中, SEI消息中的資訊應該被對待成相對于RTP時戳的,

      最早事件發生在RTP時戳給定的時間, 後續事件發生的時間由SEI消息中圖像時間值差給定。假設tSEI1, tSEI2, ...,

      tSEIn 為SEI消息中運送的顯示時間戳, 其中tSEI1 是所有這樣時間戳的最早值。tmadjst()是一個函數,他調整

      SEI消息時間到90-kHz時間.TS是RTP時戳.則,和tSEI1關聯的顯示時間是TS. 和tSEIx[x=[2..n]]關聯事件的顯示時間為 

      TS + tmadjst (tSEIx - tSEI1).

         注釋: 在一個3:2折疊的操作中需要顯示編碼的幀作為場, 在其中組成編碼幀的電影内容使用隔行掃描顯示。

         圖像定時SEI消息使得運送相同編碼圖像的多個時戳,是以3:2折疊過程正确控制。圖像定時SEI消息機制是必須

         的,因為在RTP時戳中隻可以運送一個時戳。

         注釋:因為H.264允許解碼順序可以和顯示順序不同, RTP時戳的值針對于RTP序号可以不是單調非減的。而且

         RTCP報告中的抖動區間值可以不是網絡性能問題的訓示, as the calculation rules

         for interarrival jitter (section 6.4.1 of RFC 3550) assume that

         the RTP timestamp of a packet is directly proportional to its

         transmission time.

5.2. RTP 荷載格式的公共結構

   荷載格式定義三個不同的基本荷載結構。一個接收者可以識别荷載結構通過RTP荷載的第一個位元組, 

   他也共享為RTP荷載頭,某些情況下,作為荷載的第一個位元組。本位元組總是結構化為NAL單元頭. 

   NAL單元類型訓示目前使用那個結構. 可能的結構如下:

   單個NAL單元包: 荷載中隻包含一個NAL單元。NAL頭類型域等于原始 NAL單元類型,即在範圍1到23之間. 5.6指定

   聚合包: 本類型用于聚合多個NAL單元到單個RTP荷載中。本包有四種版本,單時間聚合包類型A (STAP-A), 單時間

   聚合包類型B (STAP-B), 多時間聚合包類型(MTAP)16位位移(MTAP16), 多時間聚合包類型(MTAP)24位位移(MTAP24)。

   賦予STAP-A, STAP-B, MTAP16, MTAP24的NAL單元類型号分别是 24, 25, 26, 27。見5.7.

   分片單元: 用于分片單個NAL單元到多個RTP包。現存兩個版本FU-A,FU-B,用NAL單元類型 28,29辨別。見5.8.

   Table 1.  單元類型以及荷載結構總結

      Type   Packet    Type name                        Section

      ---------------------------------------------------------

      0      undefined                                    -

      1-23   NAL unit  Single NAL unit packet per H.264   5.6

      24     STAP-A    Single-time aggregation packet     5.7.1

      25     STAP-B    Single-time aggregation packet     5.7.1

      26     MTAP16    Multi-time aggregation packet      5.7.2

      27     MTAP24    Multi-time aggregation packet      5.7.2

      28     FU-A      Fragmentation unit                 5.8

      29     FU-B      Fragmentation unit                 5.8

      30-31  undefined                                    -

      注釋: 本規範沒有限制封裝在單個NAL單元包和分片單元的大小。封裝在聚合包中的 NAL單元大小為65535位元組。

5.3.  NAL單元位元組使用

   NAL單元位元組的結構語義在1.3節介紹。為了友善,NAL單元類型位元組的格式在下面列出:

   本部分根據本規範指定F和NRI的語義。

      forbidden_zero_bit.  A value of 0 indicates that the NAL unit type

      octet and payload should not contain bit errors or other syntax

      violations.  A value of 1 indicates that the NAL unit type octet

      and payload may contain bit errors or other syntax violations.

      MANEs SHOULD set the F bit to indicate detected bit errors in the

      NAL unit.  The H.264 specification requires that the F bit is

      equal to 0.  When the F bit is set, the decoder is advised that

      bit errors or any other syntax violations may be present in the

      payload or in the NAL unit type octet.  The simplest decoder

      reaction to a NAL unit in which the F bit is equal to 1 is to

      discard such a NAL unit and to conceal the lost data in the

      discarded NAL unit.

      nal_ref_idc.  0值和非零值的語義與H.264規範保持一緻。換句話,00值訓示NAL單元的内容不用于重建引用圖像的

      幀見圖像預測。這樣的NAL單元可以被丢棄而不用冒引用圖像完整性的風險。大于00的值訓示NAL單元的解碼要求維護

      引用圖像的完整性。

      除了上面指定的外, 根據本RTP荷載規範, 大于00的NRI值訓示相對傳輸優先級, 象編碼器決定的一樣。 MANE可以使用

      本資訊保護更重要的NAL單元。最高的傳輸優先級是11, 依次是 10, 01;00 最低。

         注釋: 任何非零的NRI在H.264 解碼器的處理是相同的。是以,接收者在傳送NAL單元給解碼器時不必操作NRI的值。

      H.264編碼器必須根據H.264規範設定NRI值(subclause 7.4.1)當nal_unit_type 範圍的是1到12. 特别是, H.264規範

      要求對于nal_unit_type為6,9,10,11,12的NAL單元的NRI的值應該為0。

      對于nal_unit_type等于7,8 (訓示順序參數集或圖像參數集)的NAL單元,H.264編碼器應該設定NRI為11 (二進制格式)

      對于nal_unit_type等于5的主編碼圖像的編碼片NAL單元(訓示編碼片屬于一個IDR圖像), H.264編碼器應設定NRI為11。

      對于映射其他的nal_unit_types到NRI值,以下的例子可以使用并且在某些環境有效[13].其它的映射也可以,依賴于應用

      以及使用的H.264/AVC Annex A profile.

         注釋: 在某些profile中資料分區不可用,即 , 在Main或Baseline profiles. 是以, nal單元類型2, 3,4 隻出現在

         視訊流符合資料分區被允許的profile情況下,不會出現在符合MAIN/Baseline profile的流中。

      Table 2.  編碼片和主編碼參考圖像資料分區的編碼片的NRI值的例子

      NAL Unit Type     Content of NAL unit              NRI (binary)

      ----------------------------------------------------------------

       1              non-IDR coded slice                         10

       2              Coded slice data partition A                10

       3              Coded slice data partition B                01

       4              Coded slice data partition C                01

         注釋: 像以前提起的, 非參考圖像NRI值是00.

      H.264編碼器應該設定備援編碼參考圖像的編碼片和編碼片分區NAL單元的NRI值為01 (二進制格式).

      對于NAL單元類型24~29的NRI的定義在本文5.7,5.8給出。

      對于nal_unit_type範圍在13到23的NAL單元的NRI值沒有推薦的值,因為這些值保留給ITU-T,ISO/IEC. 

      對于nal_unit_type為0或30,31的NAL單元的NRI值也沒有推薦的值,因為這些值的語義本文沒有指定。

5.4.  打包方式

   本文指定三種打包方式:

      o 單NAL單元方式

      o 非交錯方式

      o 交錯方式

   單NAL單元方式目标是正常的系統,該系統相容ITU-T H.241 [15] (12.1). 非交錯方式目标是正常系統,可以不符合

   ITU-T H.241建議.在非交錯方式, NAL單元按照NAL單元解碼順序傳送。交錯模式目标是不要求非常低端到端延遲的系統。

   交錯方式允許傳送NAL單元不按照NAL單元解碼順序。

   使用的打包方式可以通過OPTIONAL packetization-mode MIME參數的值指定或外部手段。使用的打包方式控制那個NAL

   單元類型在RTP荷載中允許。表3 總結對每個打包方式允許的NAL單元類型。有些NAL單元類型值(在表3中訓示為沒有定義)

   保留為将來擴充. 那些類型的NAL單元不應該被發送者發送,接受者必須忽略他們。例如:

   1-23, 相關的包類型"NAL unit",允許出現在 "單NAL單元方式" 和"非交錯方式", 不允許在"交錯方式".

   打包方式在第六節更詳細解釋。

   表 3.  每個打包方式允許的NAL單元類型總結(yes = 允許, no = 不允許, ig = 忽略)

      Type   Packet    Single NAL    Non-Interleaved    Interleaved

                       Unit Mode           Mode             Mode

      -------------------------------------------------------------

      0      undefined     ig               ig               ig

      1-23   NAL unit     yes              yes               no

      24     STAP-A        no              yes               no

      25     STAP-B        no               no              yes

      26     MTAP16        no               no              yes

      27     MTAP24        no               no              yes

      28     FU-A          no              yes              yes

      29     FU-B          no               no              yes

      30-31  undefined     ig               ig               ig

5.5.  解碼順序号(DON)

   在交錯打包方式, NAL單元的傳輸順序允許和NAL單元的解碼順序不同。解碼順序号(DON)是荷載結構中的一個域

   或一個獲得變量訓示NAL單元的解碼順序。 不按解碼順序傳輸的例子和原理以及DON的使用見13節。

   傳輸和解碼順序的耦合由OPTIONAL sprop-interleaving-depth MIME參數控制,見下。當OPTIONAL sprop-interleaving-depth

   MIME 參數的值等于0 (明确或預設) 或者外部手段不允許傳輸NAL單元順序不同于他們的解碼順序, NAL單元的

   傳輸順序必須和他們的解碼順序一緻。當OPTIONAL sprop-interleaving-depth MIME參數的值大于0或者傳輸NAL單元

   與解碼序号不一緻通過外部手段被允許時,

   o  在MTAP16/MTAP24中的NAL單元順序不要求是NAL單元的解碼順序

   o  在兩個連續包中的STAP-B, MTAP,FU解嵌套産生的NAL單元序号不要求是NAL單元解碼序号。

   用于單NAL單元包 STAP-A和FU-A的RTP荷載結構不包含DON.  STAP-B,FU-B結構包含DON, MTAP結構允許推導DON象5.7.2指定的一樣.

      注釋:檔FU-A出現在交錯方式,後邊總跟一個FU-B, 他設定自己的DON.

      注釋: 一個傳輸器想封裝單個NAL單元每個包并且傳輸包不按照他們的解碼順序,可以使用STAP-B包類型。

   在單個NAL單元打包方式, NAL單元的傳輸順序,由RTP順序号确定, 必須和他們的NAL單元解碼序号一緻。

   在非交錯打包方式中, 在單NAL單元包,STAP-A,FU-A中NAL單元的傳輸順序必須和他們的NAL單元解碼順序一緻.

   在一個STAP中的NAL單元必須按照他們的NAL單元解碼順序出現。是以,解碼順序首先由STAP隐含順序提供, 第二

   通過RTP序号提供(對于STAPs, FUs, 單個NAL unit包之間的)。

   對于運送在STAP-B, MTAP以及FU-B開始的一些列分片單元中的NAL單元的DON值的信令在5.7.1, 5.7.2, 指定5.8。

   傳輸順序中的NAL單元的第一個DON值可以設定成任何值,DON值的範圍是0到65535。到達最大值後, DON的值回繞到0.

   包含在STAP-B, MTAP,或FU-B開始的一系列分片單元中的兩個NAL單元的解碼順序按照如下确定:

   DON(i)是索引為i傳輸順序的解碼順序号. 函數don_diff(m,n)定義如下:

      If DON(m) == DON(n), don_diff(m,n) = 0

      If (DON(m) < DON(n) and DON(n) - DON(m) < 32768),

      don_diff(m,n) = DON(n) - DON(m)

      If (DON(m) > DON(n) and DON(m) - DON(n) >= 32768),

      don_diff(m,n) = 65536 - DON(m) + DON(n)

      If (DON(m) < DON(n) and DON(n) - DON(m) >= 32768),

      don_diff(m,n) = - (DON(m) + 65536 - DON(n))

      If (DON(m) > DON(n) and DON(m) - DON(n) < 32768),

      don_diff(m,n) = - (DON(m) - DON(n))

   don_diff(m,n)正值訓示具有傳輸順序n的NAL單元解碼順序跟在具有傳輸順序m的NAL單元後面。 don_diff(m,n)等于0

   訓示NAL單元解碼順序号可以按照任何NAL單元優先。don_diff(m,n)的負值訓示索引為n的NAL單元解碼序号先于索引為

   m的NAL單元。

   DON相關域的值(DON, DONB, and DOND; 5.7)必須使得上面指定的DON的值确定的解碼器順序号符合NAL單元解碼序号。

   如果兩個NAL解碼單元順序的NAL單元交換,新的順序号不符合NAL單元解碼順序,NAL單元不可以有相同的DON值. 如果

   在一個NAL單元流中兩個連續NAL單元的序号交換并且新的序号仍符合NAL單元解碼順序号,NAL解碼單元可以有相同的

   DON值。例如:當使用的視訊編碼profile允許任意分片順序, 一個編碼圖像的所有編碼片的NAL單元可以有相同的DON

   值。是以,相同DON值的 NAL單元可以按照任何順序解碼,有不同DON值的NAL單元應該按照上面指定的順序傳遞給解碼器。

   當兩個連續的NAL單元解碼順序的NAL單元有不同的DON值, 第二個NAL單元的DON應該是第一個NAL單元的DON值加1。

   解包過程恢複NAL單元解碼的例子在第7部分給出。

      注: 接收者不應該預測兩個解碼順序号連續的NAL的DON值的絕對差等于1,甚至在沒有錯誤的傳輸過程。

      沒有要求增加1,就像關聯DON的值到NAL單元的時間一樣, 不可能知道所有NAL單元是否分發給接收者。例如:

      一個網關可以不轉發非引用的編碼的NAL片或SEI NAL 單元,當需要轉發的網絡帶寬不足時。;另外的例子:

      現場廣播被預先編碼的内容不時的打斷,如廣告。預先編碼的第一個内幀圖像事先傳送使得接收端準備可用。

      當傳送第一個内幀時,發送者不能精确知道在解碼順序後的第一個内幀前,有多少NAL單元被編碼。是以, 預編碼

      片斷的第一個内幀的DON值不得不估算,當他們傳送時,是以DON中可能産生空隙。

5.6.  單個NAL單元包

   定義在此的單個NAL單元包必須隻包含一個類型定義在[1]中的NAL單元。這意味聚合包和分片單元不可以用在單個NAL

   單元包中。一個封裝單個NAL單元包到RTP的NAL單元流的RTP序号必須符合NAL單元的解碼順序。單個NAL單元包的結構

   顯示在圖2。

      注: NAL單元的第一位元組和RTP荷載頭第一個位元組重合。

      |F|NRI|  type   |                                               |

      +-+-+-+-+-+-+-+-+                                               |

      |                                                               |

      |               Bytes 2..n of a Single NAL unit                 |

      |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

      |                               :...OPTIONAL RTP padding        |

      Figure 2.  單個NAL單元包的RTP荷載格式。

5.7.  聚合包

   聚合包是本荷載規範的NAL單元聚合安排。本計劃的引入是反映兩個主要目标網絡差異巨大的MTU:

   有線IP網絡(MTU 通常被以太網的MTU限制; 大約1500 位元組), 基于無線通信系統的IP或非IP (ITU-T

   H.324/M)網絡,它的優先傳輸最大單元是254或更少。為了阻止連個世界媒體的轉換以及避免不必要的打包

   負擔,引入聚合單元安排。

   本規範定義了兩類聚合包:

   o  單時間聚合包(STAP): 聚合相同NALU時間的NAL單元。兩類STAP被定義, 一類不包括DON (STAP-A)另一類包括DON (STAP-B).

   o  多時間聚合包(MTAP): 聚合具有差異NALU時間的NAL單元. 兩個MTAP被定義, 差别在 NAL單元時戳位移長度不同。

   詞語NALU-時間被定義成如果NAL單元被傳輸他自己的RTP包中時RTP的時戳。

   運送在一個聚合包中的每個NAL單元封裝在一個聚合單元中。參見下面四個不同聚合單元和他們的特性。

   聚合包的RTP荷載格式的結構見圖3。

      |             one or more aggregation units                     |

      圖 3.  聚合包的RTP荷載格式。

   MTAPs,STAPs公用以下打包規則:RTP時戳必須設定為被聚合NAL單元中最早NALU時間。NAL單元類型的類型域必須被設定成

   适當的值,像表4描述的一樣.

   如果聚合NAL單元的F位是0,F位必須清除,否則,則必須被設定。 NRI的值必須是運送在聚合包中NAL單元的最大值。

      表 4.  STAPs和MTAPs的類型域

      Type   Packet    時戳位移域長度(位)   DON相關的域(DON, DONB, DOND)是否存在

      --------------------------------------------------------

      24     STAP-A       0                 no

      25     STAP-B       0                 yes

      26     MTAP16      16                 yes

      27     MTAP24      24                 yes

   RTP頭的marker位設定為聚合包中最後NAL單元如果單獨封裝在RTP傳輸中對應Marker位的值。

   聚合包的荷載由一個或多個聚合單元組成。見5.7.1,5.7.2四個不同類型的聚合單元。一個包聚合包可以運送必要多的

   聚合單元; 但是, 聚合包中整個資料顯然必須适合于一個IP包,并且大小應該選擇使得結果的IP包比MTU小。一個聚合包

   不可以包含5.8中指定的分片單元。聚合包不可以嵌套;即,一個聚合包包含另一個聚合包。

5.7.1. 單時間聚合包

   單時刻聚合包(STAP)應該用于當聚合在一起的NAL單元共享相同的NALU時刻。STAP-A荷載不包括DON,至少包含一個單時刻聚合單元

   見圖4. STAP-B荷載包含一個16位的無符号解碼順序号(DON) (網絡位元組序)緊跟至少一個單時刻聚合單元。見圖5.

                      :                                               |

      |                single-time aggregation units                  |

      |                               :

      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                     圖 4.  STAP-A荷載格式

                      :  decoding order number (DON)  |               |

      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               |

                 圖 5.  STAP-B 荷載格式

   DON域指定STAP-B傳輸順序中第一個NAL單元的DON值. 對每個後續出現在STAP-B中的NAL單元,它的DON值等于

   (STAP-B中前一個NAL的DON值+1)%65535, %是取模運算。

   單時刻聚合單元有一個16位無符号大小資訊(網絡位元組序),他訓示後續NAL單元的大小(以位元組為機關)(不包括

   這兩個位元組,但包括NAL單元類型位元組),後面緊跟NAL單元本身, 包括它的NAL單元類型位元組. 單時刻聚合單元在RTP荷載

   中是位元組對齊的,單可以不是32位字邊界對齊。圖6 表示單時刻聚合單元的結構。

                      :        NAL unit size          |               |

      |                           NAL unit                            |

              圖 6.  單時刻聚合單元的結構

   圖 7表示一個例子--一個RTP包包含一個STAP-A. STAP包含兩個單時刻聚合單元, 在圖中用1,2标記。

      |                          RTP Header                           |

      |STAP-A NAL HDR |         NALU 1 Size           | NALU 1 HDR    |

      |                         NALU 1 Data                           |

      :                                                               :

      +               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

      |               | NALU 2 Size                   | NALU 2 HDR    |

      |                         NALU 2 Data                           |

      圖 7.  RTP包包含一個STAP-A. STAP包含兩個單時刻聚合單元

   圖 8 表示一個RTP包包含一個STAP-B. STAP包含兩個單時刻聚合單元, 用 1,2标記。

      |STAP-B NAL HDR | DON                           | NALU 1 Size   |

      | NALU 1 Size   | NALU 1 HDR    | NALU 1 Data                   |

      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               +

      |                       NALU 2 Data                             |

      圖 8.  一個RTP包包含一個STAP-B. STAP包含兩個單時刻聚合單元例子

5.7.2.  多時刻聚合包(MTAPs)

   多時刻聚合包的NAL單元荷載有16位的無符号解碼順序号基址(DONB) (網絡位元組序)以及一個或多個多時刻聚合單元,如

   圖9表示。DONB 必須包含MTAP中NAL單元的第一個NAL的DON的值。

      注釋:NAL解碼順序中的第一個NAL單元不必要是封裝在MTAP中的第一個NAL單元。

                      :  decoding order number base   |               |

      |                 multi-time aggregation units                  |

           圖 9. MTAP的NAL單元荷載格式

   本規範定義兩個不同多時刻聚合單元。兩個都有16位的無符号大小資訊用于後續NAL單元(網絡位元組序),一個8位無符号解碼序号

   內插補點(DOND), 和n位 (網絡位元組序) 時戳位移(TS 位移)用于本NAL單元,n可以是16/24. 不同MTAP類型的選擇是應用相關的(MTAP16

   /MTAP24): 時戳位移越大, MTAP的靈活性越大, 但是負擔也越大。

   MTAP16/MTAP24多時刻聚合單元的結構分别在圖 10 ,11表示。一個包中的聚合單元的開始/結束不要求位于32位的邊界。

   跟随NAL單元的DON 等于(DONB + DOND) % 65536,  %代表取摸操作. 本文沒有指定MTAP内的NAL單元如何排序,但大多數

   情況,應該使用NAL單元解碼順序。

   時戳位移域必須設定成等于以下公式的值:如果NALU-time大于等于包的RTP時戳,則時戳位移等于(NALU-time - 包的RTP時戳).

   如果NALU-time小于包的RTP時戳,則時戳位移等于 NALU-time + (2^32 - 包的RTP時戳).

      :        NAL unit size          |      DOND     |  TS offset    |

      |  TS offset    |                                               |

      +-+-+-+-+-+-+-+-+              NAL unit                         |

                  圖 10.  MTAP16多時刻聚合單元

      :        NALU unit size         |      DOND     |  TS offset    |

      |         TS offset             |                               |

      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               |

      |                              NAL unit                         |

                 圖 11.  MTAP24多時刻聚合單元

   一個MTAP中的最早的聚合單元時戳位移必須為0。是以, MTAP的RTP時戳和最早NALU-time相同.

      注釋: 最早多時刻聚合單元是MTAP中所有聚合單元的擴充RTP時戳中的最小者,如果聚合單元封裝在單個NAL單元包中。

      擴充時戳是有多于32位的時戳,有能力計算時戳域的饒回,是以時戳如果繞回能夠确定時戳的最小值。這樣的“最早“聚合

      單元可以不是封裝在MTAP中的第一個聚合單元,最早NAL單元不必和NAL解碼順序的第一個NAL單元相同。

   圖 12 表示一個例子,一個RTP包包含一個多時刻MTAP16類型的聚合包,包括兩個多時刻聚合單元,分别用1,2标記。

      |MTAP16 NAL HDR |  decoding order number base   | NALU 1 Size   |

      |  NALU 1 Size  |  NALU 1 DOND  |       NALU 1 TS offset        |

      |  NALU 1 HDR   |  NALU 1 DATA                                  |

      +-+-+-+-+-+-+-+-+                                               +

      |               | NALU 2 SIZE                   |  NALU 2 DOND  |

      |       NALU 2 TS offset        |  NALU 2 HDR   |  NALU 2 DATA  |

      圖 12. 一個RTP包包含一個多時刻MTAP16類型的聚合包,包括兩個多時刻聚合單元

   圖 13 表示一個例子,一個RTP包包含一個多時刻MTAP24類型的聚合包,包括兩個多時刻聚合單元,分别用1,2标記。

      |MTAP24 NAL HDR |  decoding order number base   | NALU 1 Size   |

      |  NALU 1 Size  |  NALU 1 DOND  |       NALU 1 TS offs          |

      |NALU 1 TS offs |  NALU 1 HDR   |  NALU 1 DATA                  |

      |       NALU 2 TS offset                        |  NALU 2 HDR   |

      |  NALU 2 DATA                                                  |

      圖 13.  RTP包包含一個多時刻MTAP24類型的聚合包,包括兩個多時刻聚合單元

5.8.  分片單元 (FUs)

   本荷載類型允許分片一個NAL單元到幾個RTP包中。在應用層這樣做比依賴于底層(IP)的分片有以下好處:

   o  荷載格式有能力傳輸NAL單元大于64K位元組的單元通過IPv4網絡,或許存在預編碼的視訊,特别在高清格式 (

      每個圖像的分片數目有限制,導緻每個圖像的NAL單元數目的限制, 進而導緻大的 NAL單元).

   o  分派機制允許分片單個圖像并且采用一般向前的糾錯像12.5描述的那樣.

   分片隻定義于單個NAL單元不用于任何聚合包。NAL單元的一個分片由整數個連續NAL單元位元組組成. 每個NAL單元位元組

   必須正好是該NAL單元一個分片的一部分。相同NAL單元的分片必須使用遞增的RTP序号連續順序發送(第一和最後分片之間

   沒有其他的RTP包)。相似, NAL單元必須按照RTP順序号的順序裝配。

   當一個NAL單元被分片運送在分片單元(FUs)中時,被引用為分片NAL單元。STAPs,MTAP不可以被分片。 FUs不可以嵌套。

   即, 一個FU 不可以包含另一個FU.

   運送FU的RTP時戳被設定成分片NAL單元的NALU時刻.

   圖 14 表示FU-A的RTP荷載格式。FU-A由1位元組的分片單元訓示,1位元組的分片單元頭,和分片單元荷載組成。

      | FU indicator  |   FU header   |                               |

      |                         FU payload                            |

      圖 14.  FU-A的RTP荷載格式

   圖 15 表示FU-B的RTP荷載格式. FU-B由1位元組的分片單元訓示,1位元組的分片單元頭,和解碼順序号(DON)

   以及分片單元荷載組成。

      | FU indicator  |   FU header   |               DON             |

      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-|

          圖 15.  FU-B的RTP荷載格式

   對于分片NAL單元的第一個分片如果用于交錯打包方式,則必須使用NAL單元類型FU-B。NAL單元類型FU-B MUST不可以

   用于其他情況。換句話, 在交錯打包方式,每個被分片的NALU,FU-B作為第一個分片,後面跟随的是一個或多個FU-A分片.

   FU訓示位元組有以下格式:

   FU訓示位元組的類型域的28,29表示FU-A和FU-B。F的使用在5。3描述。NRI域的值必須根據分片NAL單元的NRI域的值設定。

   FU頭的格式如下:

      |S|E|R|  Type   |

   S: 1 bit

      當設定成1,開始位訓示分片NAL單元的開始。當跟随的FU荷載不是分片NAL單元荷載的開始,開始位設為0。

   E: 1 bit

      當設定成1, 結束位訓示分片NAL單元的結束,即, 荷載的最後位元組也是分片NAL單元的最後一個位元組。當跟随的

      FU荷載不是分片NAL單元的最後分片,結束位設定為0。

   R: 1 bit

      保留位必須設定為0,接收者必須忽略該位。

      NAL單元荷載類型定義在[1]的表7-1.

   FU-B中DON的值的選擇在5.5已經描述.

      注: FU-B中的DON域允許網關分片NAL單元到FU-B而不用組織進來的NAL單元到NAL單元解碼順序。

   一個分片單元不可以傳輸在一個FU中; 即, 開始位和結束位不可以被同時設定在同一個FU頭中。

   FU荷載由分片NAL單元的荷載分片組成,使得如果連續FU的分片單元荷載順序連接配接, 可以重構分片NAL單元的荷載。

   NAL單元分片的類型位元組不包括,就像在分片單元荷載中一樣,但是分片單元的NAL單元的類型資訊運送FU訓示位元組

   的F和NRI域以及FU頭的類型域。一個FU荷載可以有任意位元組也可以為空。

      注釋: 空的FUs允許減少某類發送者在幾乎無丢失環境中的延遲。這些發送者特點是他們的NALU完全産生前,可以打

      包NALU分片,是以,在NALU大小未知之前。如果零長度分片不被允許,發送者不得不産生至少一位資料在目前分片被發送

      前. 由于H.264的特性, 有時幾個宏快占據0位,這是不希望的并且增加延遲。但是, (潛在)使用0長度的NALU應該仔細

      權衡增加NALU丢失的風險,因為增加了傳輸包。

   如果一個分片單元丢失,接收者應該丢棄後續的所有分片單元對應于相同分片NAL單元的傳輸順序的分片。

   終端或MANE中的接收者可以聚合前一個NAL單元的n-1分片到一個(不完全的) NAL單元,甚至分片n沒有接收到. 這種情況下,

   NAL單元的forbidden_zero_bit必須被設定成1訓示文法違背.

6.  打包規則

   打包方式在5.2節介紹.  對于多于一個打包方式的公共打包規則在6.1節指定. 單個NAL單元方式

   的打包規則,非交錯方式,交錯方式的打包規則分别在6.2, 6.3,6.4節指定。

6.1.  公共打包規則

   不管使用那種打包方式,所有發送者必須遵守以下打包規則:

   o  屬于同一編碼圖像(共享相同RTP時戳值)的編碼NAL單元片斷或者編碼資料分區NAL單元片斷可以

      按照定義在[1]中的應用Profile允許的任何順序發送; 但是,對于延遲敏感的系統,他們應該按照

      他們原始編碼順序發送,以減少延遲。注意:編碼順序不必要是掃描順序,而是NAL包對RTP協定

      棧可用的順序。

   o  參數集根據8.4節給定的規則和建議處理。

   o  MANEs 不可以重複任何NAL單元,除了順序或圖像參數集NAL單元,同樣本文或者H.264規範也沒有提供

      手段識别重複的NAL單元。順序和圖像參數集NAL單元可以重複使得他們的糾錯接收更可靠,但是,任何

      這樣的重複不可以影響任何活動順序或圖像參數集的内容。重複應該在應用層進行,不應通過複制RTP

      包進行(相同序号)。                          

   使用非交錯方式和交錯方式的發送者必須遵守以下打包規則:

   o  MANEs可以轉換單個NAL單元包到一個聚合包,轉換一個聚合包到幾個單個NAL單元包,或在RTP轉換器中混合

      兩個概念。RTP轉換器至少應該考慮如下參數:路徑MTU大小, 不平等的保護機制(即,根據RFC 2733通過

      基于包的FEC,特别對于順序和圖像參數集NAL單元以及編碼片斷資料分區NAL單元),系統可以忍受的延遲

      以及接收者緩沖能力。

      注:RTP轉換器要求按照每個RFC3550處理RTCP.

6.2.  單個NAL單元模式

   本方式應用在OPTIONAL打包方式MIME參數值等于0,不包含打包方式,或者沒有外部手段訓示其他的打包方式的時候。

   所有的接收者必須支援本方式。它主要用于低延遲應用(和使用ITU-T H.241建議相容的系統)。(見12.1節). 

   隻有單個NAL單元包可以用在這種方式。STAPs, MTAPs, and FUs 不可以使用。單個NAL單元的傳輸順序必須和NAL

   解碼順序一緻。

6.3.  非交錯方式

   本方式應用在OPTIONAL打包方式MIME參數值等于1或者改方式被外部的手段打開時。本方式應該被支援。它主要用于

   低延遲應用。本方式隻允許單個NAL單元包, STAP-As, FU-As包。STAP-Bs, MTAPs,FU-Bs不可以使用。NAL單元的傳輸

   順序必須和NAL單元解碼順序一緻。

6.4.  交錯方式

   本方式應用在OPTIONAL打包方式MIME參數值等于2或者改方式被外部的手段打開時。有些接收者可以支援本方式。

   可以使用 STAP-Bs, MTAPs, FU-As,FU-Bs。STAP-As 和單個NAL單元包不可以使用。包和NAL單元傳輸順序的限制

   在5.5節指定。

7.  打包過程 (資訊)

   打包過程是實作相關的。是以,下面的描述應該被看成合适實作的例子。其他的方案也可以使用。相關描述算法的優化

   也是可能的。7.1示範單個NAL單元和非交錯打包方式的打包過程,7.2描述交錯方式的打包過程。7.3 包括附加的封裝

   指導對于智能接收者。

   所有相關于緩沖區管理正常的RTP機制也适用。特别的,重複的過期的RTP包(由RTP序号/時戳訓示)被删除。 為了确定

   精确的解碼時間, 如可能的延遲因素也被允許為了正确的流之間的同步。

7.1.  單個NAL單元和非交錯方式

   接收者包括一個接收緩沖區以補償傳輸延遲和抖動。接收者存儲進來的包按照接收順序在接收緩沖區中。包被解封裝

   按照RTP序号的順序。如果封裝包是一個單個NAL單元包,包含在包中的NAL單元直接傳遞給解碼器。如果解封裝的包是

   一個STAP-AI, 包含在包中的NAL單元按照他們在包中的封裝順序傳遞給解碼器。如果解封裝包是一個FU-A, 所有的分

   片NAL單元單分片連接配接在一起傳遞給解碼器。

      資訊: 如果解碼器支援任意分片順序,編碼的圖像片可以按照任意順序傳送給解碼器而不管他們的接收傳送順序。

7.2.  交錯方式

   這些打包規則後面的一般概念是重新排序NAL單元從傳輸順序到NAL單元解碼順序。

   接收者包括一個接收緩沖區以補償傳輸延遲抖動以及重新排序包從傳輸順序到NAL單元解碼順序。本部分,接收者操作

   的描述假設沒有傳輸延遲抖動。為了和實際的差異,一個接收緩沖區也用于補償傳輸延遲抖動,接收者者本部分調用

   解交錯緩沖區。接收者應該準備傳輸延遲抖動;即, 或者保留單獨的緩沖區用于傳輸延遲抖動緩沖和解交錯緩沖或者

   使用接收緩沖用于傳輸延遲抖動和解交錯。而且, 接收者應該考慮傳輸延遲抖動在緩沖區操作時,即,在開始解碼和

   回放前增加緩沖區。

   本部分組織如下: 7.2.1 描述如何計算交錯緩沖區的大小. 7.2.2指定接收過程如何組織接收到的NAL單元到NAL解碼順序。

7.2.1.  解交錯緩沖區的大小

   當 SDP Offer/Answer 模型或其他任何能力交換過程被使用時, 接收流的屬性應該使得接收者的能力不被超過。

   在 SDP Offer/Answer 摸型行中, 接收者可以訓示它的能力以配置設定一個解交錯緩沖區使用deintbuf-cap MIME 參數。

   發送者訓示解交錯緩沖區大小的要求使用sprop-deint-buf-req MIME參數. 是以,推薦設定解交錯緩沖區大小(位元組數目)

   等于或大于sprop-deint-buf-req MIME 參數指定的值.  參見 8.1 得到更多資訊關于 deint-buf-cap和sprop-deint-buf-req 

   MIME參數,8.2.2 關于他們在SDP Offer/Answer模型中的使用。

   在會話建立中一個公布的會話描述被使用,sprop-deint-buf-req MIME參數指定交錯緩沖大小的要求。是以,推薦

   設定解交錯緩沖區大小(位元組位機關)等于或大于sprop-deint-buf-req MIME 參數的值.

7.2.2.  解交錯過程

   在接收者中有兩個緩沖狀态: 初始緩沖和正在播放緩沖。初始緩沖發生在RTP會話被初始化時。初始緩沖後,解碼和播放

   開始了, 使用緩沖-播放模型。

   不管緩沖的狀态,接收者存儲進來的NAL單元按照接收順序,在解交錯緩沖區中。聚合包的 NAL單元存儲在單個解交錯緩沖區中

   DON的值被計算為所有NAL單元存儲。

   描述在下面的接收操作需要以下的函數常數幫助:

   o  函數AbsDON在8.1指定.

   o  函數don_diff在 5.5 指定.

   o  常數 N 是 OPTIONAL sprop-interleaving-depth MIME 類型參數的值( 8.1)加1.

   初始緩沖持續直到以下條件完成:

   o  在解交錯緩沖區中有 N VCL NAL單元。

   o  如果sprop-max-don-diff存在, don_diff(m,n)大于sprop-max-don-diff的值, 其中 n 對應所有接收到

      的NAL單元中最大AbsDON值的NAL單元,m 對應所有接收到的NAL單元中最小AbsDON值的NAL單元。

   o  初始緩沖區已經持續時間等于或大于 OPTIONAL sprop-init-buf-time MIME 參數指定的值.

   要從解交錯緩沖區删除的NAL單元的确定如下:

   o  如果解交錯緩沖區包含至少N 個VCL NAL單元,NAL單元被從解交錯緩沖區移出傳遞給解碼器按照下面指定

      的次序直到緩沖區中包含N-1 VCL NAL 單元。

   o  如果sprop-max-don-diff存在, 所有的NAL單元 m,他們的don_diff(m,n)大于sprop-max-don-diff的從解交錯

      緩沖區移出傳送給解碼器按照下面指定的順序。在此, n 對應所有接收到的NAL單元中最大AbsDON值的NAL單元。

   NAL單元傳遞給解碼器的順序指定如下:

   o  讓PDON是一個變量RTP會話開始時初始化為0。

   o  對于每個關聯DON的NAL單元, 按如下計算一個DON距離。如果NAL單元的DON大于PDON的值, DON距離等于DON-PDON.

      否則DON距離等于 65535 - PDON + DON + 1.

   o  NAL單元分發給解碼器按照DON距離遞增的順序。如果幾個NAL單元有相同的DON距離,則他們可以按照任意順序遞交給解碼器.

   o  當一定數目的NAL單元傳遞給解碼器, PDON的值設定為傳送給解碼器最後一個NAL單元的DON值。

7.3. 附加打包規則

   以下附加打包規則可用于實作一個可操作的H.264打包器:

   o  智能RTP接收者 (即在網關中) 可以識别丢失的編碼片斷資料分區A (DPAs). 如果發現丢失的DPA,網關可以決定不發送

      對應的編碼片斷資料分區B和C,因為對于H.264解碼器他們的資訊是無意義的。這樣通過丢棄無用的包而不用分析複雜

      的位流,一個MANE可以減少網絡負擔。

   o  智能RTP接收者(即在網關中) 可以識别丢失的FU.  如果發現丢失一個FU, 網關可以決定不發送同一個分片NAL的後續FU

      因為對于H.264解碼器他們的資訊是無意義的.這樣通過丢棄無用的包而不用分析複雜的位流,一個MANE可以減少網絡負擔。 

   o  不得不丢棄包或NALU的智能接收者應該首先丢棄所有NAL單元類型中NRI值等于0的包/NALU. 這樣最小化使用者體驗的影響并

      保持參考圖像完整。如果更多的包不得不被丢棄,則NRI值低的包應該在NRI值高的前面被丢棄。但是,丢棄任何NRI值大于

      0的包可能導緻解碼器飄移應該被避免。

8.  荷載格式參數

   This section specifies the parameters that MAY be used to select

   optional features of the payload format and certain features of the

   bitstream.  The parameters are specified here as part of the MIME

   subtype registration for the ITU-T H.264 | ISO/IEC 14496-10 codec.  A

   mapping of the parameters into the Session Description Protocol (SDP)

   [5] is also provided for applications that use SDP.  Equivalent

   parameters could be defined elsewhere for use with control protocols

   that do not use MIME or SDP.

   Some parameters provide a receiver with the properties of the stream

   that will be sent.  The name of all these parameters starts with

   "sprop" for stream properties.  Some of these "sprop" parameters are

   limited by other payload or codec configuration parameters.  For

   example, the sprop-parameter-sets parameter is constrained by the

   profile-level-id parameter.  The media sender selects all "sprop"

   parameters rather than the receiver.  This uncommon characteristic of

   the "sprop" parameters may not be compatible with some signaling

   protocol concepts, in which case the use of these parameters SHOULD

   be avoided.

8.1.  MIME Registration

   The MIME subtype for the ITU-T H.264 | ISO/IEC 14496-10 codec is

   allocated from the IETF tree.

   The receiver MUST ignore any unspecified parameter.

   Media Type name:     video

   Media subtype name:  H264

   Required parameters: none

Wenger, et al.              Standards Track                    [Page 37]

RFC 3984           RTP Payload Format for H.264 Video      February 2005

   OPTIONAL parameters:

       profile-level-id:

                        A base16 [6] (hexadecimal) representation of

                        the following three bytes in the sequence

                        parameter set NAL unit specified in [1]: 1)

                        profile_idc, 2) a byte herein referred to as

                        profile-iop, composed of the values of

                        constraint_set0_flag, constraint_set1_flag,

                        constraint_set2_flag, and reserved_zero_5bits

                        in bit-significance order, starting from the

                        most significant bit, and 3) level_idc.  Note

                        that reserved_zero_5bits is required to be

                        equal to 0 in [1], but other values for it may

                        be specified in the future by ITU-T or ISO/IEC.

                        If the profile-level-id parameter is used to

                        indicate properties of a NAL unit stream, it

                        indicates the profile and level that a decoder

                        has to support in order to comply with [1] when

                        it decodes the stream.  The profile-iop byte

                        indicates whether the NAL unit stream also

                        obeys all constraints of the indicated profiles

                        as follows.  If bit 7 (the most significant

                        bit), bit 6, or bit 5 of profile-iop is equal

                        to 1, all constraints of the Baseline profile,

                        the Main profile, or the Extended profile,

                        respectively, are obeyed in the NAL unit

                        stream.

                        If the profile-level-id parameter is used for

                        capability exchange or session setup procedure,

                        it indicates the profile that the codec

                        supports and the highest level

                        supported for the signaled profile.  The

                        profile-iop byte indicates whether the codec

                        has additional limitations whereby only the

                        common subset of the algorithmic features and

                        limitations of the profiles signaled with the

                        profile-iop byte and of the profile indicated

                        by profile_idc is supported by the codec.  For

                        example, if a codec supports only the common

                        subset of the coding tools of the Baseline

                        profile and the Main profile at level 2.1 and

                        below, the profile-level-id becomes 42E015, in

                        which 42 stands for the Baseline profile, E0

                        indicates that only the common subset for all

                        profiles is supported, and 15 indicates level

                        2.1.

Wenger, et al.              Standards Track                    [Page 38]

                            Informative note: Capability exchange and

                            session setup procedures should provide

                            means to list the capabilities for each

                            supported codec profile separately.  For

                            example, the one-of-N codec selection

                            procedure of the SDP Offer/Answer model can

                            be used (section 10.2 of [7]).

                        If no profile-level-id is present, the Baseline

                        Profile without additional constraints at Level

                        1 MUST be implied.

       max-mbps, max-fs, max-cpb, max-dpb, and max-br:

                        These parameters MAY be used to signal the

                        capabilities of a receiver implementation.

                        These parameters MUST NOT be used for any other

                        purpose.  The profile-level-id parameter MUST

                        be present in the same receiver capability

                        description that contains any of these

                        parameters.  The level conveyed in the value of

                        the profile-level-id parameter MUST be such

                        that the receiver is fully capable of

                        supporting.  max-mbps, max-fs, max-cpb, max-

                        dpb, and max-br MAY be used to indicate

                        capabilities of the receiver that extend the

                        required capabilities of the signaled level, as

                        specified below.

                        When more than one parameter from the set (max-

                        mbps, max-fs, max-cpb, max-dpb, max-br) is

                        present, the receiver MUST support all signaled

                        capabilities simultaneously.  For example, if

                        both max-mbps and max-br are present, the

                        signaled level with the extension of both the

                        frame rate and bit rate is supported.  That is,

                        the receiver is able to decode NAL unit

                        streams in which the macroblock processing rate

                        is up to max-mbps (inclusive), the bit rate is

                        up to max-br (inclusive), the coded picture

                        buffer size is derived as specified in the

                        semantics of the max-br parameter below, and

                        other properties comply with the level

                        specified in the value of the profile-level-id

                        parameter.

                        A receiver MUST NOT signal values of max-

                        mbps, max-fs, max-cpb, max-dpb, and max-br that

                        meet the requirements of a higher level,

Wenger, et al.              Standards Track                    [Page 39]

                        referred to as level A herein, compared to the

                        level specified in the value of the profile-

                        level-id parameter, if the receiver can support

                        all the properties of level A.

                            Informative note: When the OPTIONAL MIME

                            type parameters are used to signal the

                            properties of a NAL unit stream, max-mbps,

                            max-fs, max-cpb, max-dpb, and max-br are

                            not present, and the value of profile-

                            level-id must always be such that the NAL

                            unit stream complies fully with the

                            specified profile and level.

       max-mbps:        The value of max-mbps is an integer indicating

                        the maximum macroblock processing rate in units

                        of macroblocks per second.  The max-mbps

                        parameter signals that the receiver is capable

                        of decoding video at a higher rate than is

                        required by the signaled level conveyed in the

                        value of the profile-level-id parameter.  When

                        max-mbps is signaled, the receiver MUST be able

                        to decode NAL unit streams that conform to the

                        signaled level, with the exception that the

                        MaxMBPS value in Table A-1 of [1] for the

                        signaled level is replaced with the value of

                        max-mbps.  The value of max-mbps MUST be

                        greater than or equal to the value of MaxMBPS

                        for the level given in Table A-1 of [1].

                        Senders MAY use this knowledge to send pictures

                        of a given size at a higher picture rate than

                        is indicated in the signaled level.

       max-fs:          The value of max-fs is an integer indicating

                        the maximum frame size in units of macroblocks.

                        The max-fs parameter signals that the receiver

                        is capable of decoding larger picture sizes

                        than are required by the signaled level conveyed

                        in the value of the profile-level-id parameter.

                        When max-fs is signaled, the receiver MUST be

                        able to decode NAL unit streams that conform to

                        the signaled level, with the exception that the

                        MaxFS value in Table A-1 of [1] for the

                        max-fs.  The value of max-fs MUST be greater

                        than or equal to the value of MaxFS for the

                        level given in Table A-1 of [1].  Senders MAY

                        use this knowledge to send larger pictures at a

Wenger, et al.              Standards Track                    [Page 40]

                        proportionally lower frame rate than is

                        indicated in the signaled level.

       max-cpb          The value of max-cpb is an integer indicating

                        the maximum coded picture buffer size in units

                        of 1000 bits for the VCL HRD parameters (see

                        A.3.1 item i of [1]) and in units of 1200 bits

                        for the NAL HRD parameters (see A.3.1 item j of

                        [1]).  The max-cpb parameter signals that the

                        receiver has more memory than the minimum

                        amount of coded picture buffer memory required

                        by the signaled level conveyed in the value of

                        the profile-level-id parameter.  When max-cpb

                        is signaled, the receiver MUST be able to

                        decode NAL unit streams that conform to the

                        MaxCPB value in Table A-1 of [1] for the

                        max-cpb.  The value of max-cpb MUST be greater

                        than or equal to the value of MaxCPB for the

                        use this knowledge to construct coded video

                        streams with greater variation of bit rate

                        than can be achieved with the

                        MaxCPB value in Table A-1 of [1].

                            Informative note: The coded picture buffer

                            is used in the hypothetical reference

                            decoder (Annex C) of H.264.  The use of the

                            hypothetical reference decoder is

                            recommended in H.264 encoders to verify

                            that the produced bitstream conforms to the

                            standard and to control the output bitrate.

                            Thus, the coded picture buffer is

                            conceptually independent of any other

                            potential buffers in the receiver,

                            including de-interleaving and de-jitter

                            buffers.  The coded picture buffer need not

                            be implemented in decoders as specified in

                            Annex C of H.264, but rather standard-

                            compliant decoders can have any buffering

                            arrangements provided that they can decode

                            standard-compliant bitstreams.  Thus, in

                            practice, the input buffer for video

                            decoder can be integrated with de-

                            interleaving and de-jitter buffers of the

                            receiver.

Wenger, et al.              Standards Track                    [Page 41]

       max-dpb:         The value of max-dpb is an integer indicating

                        the maximum decoded picture buffer size in

                        units of 1024 bytes.  The max-dpb parameter

                        signals that the receiver has more memory than

                        the minimum amount of decoded picture buffer

                        memory required by the signaled level conveyed

                        When max-dpb is signaled, the receiver MUST be

                        MaxDPB value in Table A-1 of [1] for the

                        max-dpb.  Consequently, a receiver that signals

                        max-dpb MUST be capable of storing the

                        following number of decoded frames,

                        complementary field pairs, and non-paired

                        fields in its decoded picture buffer:

                        Min(1024 * max-dpb / ( PicWidthInMbs *

                        FrameHeightInMbs * 256 * ChromaFormatFactor ),

                        16)

                        PicWidthInMbs, FrameHeightInMbs, and

                        ChromaFormatFactor are defined in [1].

                        The value of max-dpb MUST be greater than or

                        equal to the value of MaxDPB for the level

                        given in Table A-1 of [1].  Senders MAY use

                        this knowledge to construct coded video streams

                        with improved compression.

                            Informative note: This parameter was added

                            primarily to complement a similar codepoint

                            in the ITU-T Recommendation H.245, so as to

                            facilitate signaling gateway designs.  The

                            decoded picture buffer stores reconstructed

                            samples and is a property of the video

                            decoder only.  There is no relationship

                            between the size of the decoded picture

                            buffer and the buffers used in RTP,

                            especially de-interleaving and de-jitter

                            buffers.

       max-br:          The value of max-br is an integer indicating

                        the maximum video bit rate in units of 1000

                        bits per second for the VCL HRD parameters (see

Wenger, et al.              Standards Track                    [Page 42]

                        per second for the NAL HRD parameters (see

                        A.3.1 item j of [1]).

                        The max-br parameter signals that the video

                        decoder of the receiver is capable of decoding

                        video at a higher bit rate than is required by

                        the signaled level conveyed in the value of the

                        profile-level-id parameter.  The value of max-

                        br MUST be greater than or equal to the value

                        of MaxBR for the level given in Table A-1 of

                        [1].

                        When max-br is signaled, the video codec of the

                        receiver MUST be able to decode NAL unit

                        streams that conform to the signaled level,

                        conveyed in the profile-level-id parameter,

                        with the following exceptions in the limits

                        specified by the level:

                        o The value of max-br replaces the MaxBR value

                          of the signaled level (in Table A-1 of [1]).

                        o When the max-cpb parameter is not present,

                          the result of the following formula replaces

                          the value of MaxCPB in Table A-1 of [1]:

                          (MaxCPB of the signaled level) * max-br /

                          (MaxBR of the signaled level).

                        For example, if a receiver signals capability

                        for Level 1.2 with max-br equal to 1550, this

                        indicates a maximum video bitrate of 1550

                        kbits/sec for VCL HRD parameters, a maximum

                        video bitrate of 1860 kbits/sec for NAL HRD

                        parameters, and a CPB size of 4036458 bits

                        (1550000 / 384000 * 1000 * 1000).

                        The value of max-br MUST be greater than or

                        equal to the value MaxBR for the signaled level

                        given in Table A-1 of [1].

                        Senders MAY use this knowledge to send higher

                        bitrate video as allowed in the level

                        definition of Annex A of H.264, to achieve

                        improved video quality.

                            facilitate signaling gateway designs.  No

                            assumption can be made from the value of

Wenger, et al.              Standards Track                    [Page 43]

                            this parameter that the network is capable

                            of handling such bit rates at any given

                            time.  In particular, no conclusion can be

                            drawn that the signaled bit rate is

                            possible under congestion control

                            constraints.

      redundant-pic-cap:

                        This parameter signals the capabilities of a

                        receiver implementation.  When equal to 0, the

                        parameter indicates that the receiver makes no

                        attempt to use redundant coded pictures to

                        correct incorrectly decoded primary coded

                        pictures.  When equal to 0, the receiver is not

                        capable of using redundant slices; therefore, a

                        sender SHOULD avoid sending redundant slices to

                        save bandwidth.  When equal to 1, the receiver

                        is capable of decoding any such redundant slice

                        that covers a corrupted area in a primary

                        decoded picture (at least partly), and therefore

                        a sender MAY send redundant slices.  When the

                        parameter is not present, then a value of 0

                        MUST be used for redundant-pic-cap.  When

                        present, the value of redundant-pic-cap MUST be

                        either 0 or 1.

                        When the profile-level-id parameter is present

                        in the same capability signaling as the

                        redundant-pic-cap parameter, and the profile

                        indicated in profile-level-id is such that it

                        disallows the use of redundant coded pictures

                        (e.g., Main Profile), the value of redundant-

                        pic-cap MUST be equal to 0.  When a receiver

                        indicates redundant-pic-cap equal to 0, the

                        received stream SHOULD NOT contain redundant

                        coded pictures.

                            Informative note: Even if redundant-pic-cap

                            is equal to 0, the decoder is able to

                            ignore redundant codec pictures provided

                            that the decoder supports such a profile

                            (Baseline, Extended) in which redundant

                            coded pictures are allowed.

                            is equal to 1, the receiver may also choose

                            other error concealment strategies to

Wenger, et al.              Standards Track                    [Page 44]

                            replace or complement decoding of redundant

                            slices.

       sprop-parameter-sets:

                        This parameter MAY be used to convey

                        any sequence and picture parameter set NAL

                        units (herein referred to as the initial

                        parameter set NAL units) that MUST precede any

                        other NAL units in decoding order.  The

                        parameter MUST NOT be used to indicate codec

                        capability in any capability exchange

                        procedure.  The value of the parameter is the

                        base64 [6] representation of the initial

                        parameter set NAL units as specified in

                        sections 7.3.2.1 and 7.3.2.2 of [1].  The

                        parameter sets are conveyed in decoding order,

                        and no framing of the parameter set NAL units

                        takes place.  A comma is used to separate any

                        pair of parameter sets in the list.  Note that

                        the number of bytes in a parameter set NAL unit

                        is typically less than 10, but a picture

                        parameter set NAL unit can contain several

                        hundreds of bytes.

                           Informative note: When several payload

                           types are offered in the SDP Offer/Answer

                           model, each with its own sprop-parameter-

                           sets parameter, then the receiver cannot

                           assume that those parameter sets do not use

                           conflicting storage locations (i.e.,

                           identical values of parameter set

                           identifiers).  Therefore, a receiver should

                           double-buffer all sprop-parameter-sets and

                           make them available to the decoder instance

                           that decodes a certain payload type.

       parameter-add:   This parameter MAY be used to signal whether

                        the receiver of this parameter is allowed to

                        add parameter sets in its signaling response

                        using the sprop-parameter-sets MIME parameter.

                        The value of this parameter is either 0 or 1.

                        0 is equal to false; i.e., it is not allowed to

                        add parameter sets.  1 is equal to true; i.e.,

                        it is allowed to add parameter sets.  If the

                        parameter is not present, its value MUST be 1.

Wenger, et al.              Standards Track                    [Page 45]

       packetization-mode:

                        This parameter signals the properties of an

                        RTP payload type or the capabilities of a

                        receiver implementation.  Only a single

                        configuration point can be indicated; thus,

                        when capabilities to support more than one

                        packetization-mode are declared, multiple

                        configuration points (RTP payload types) must

                        be used.

                        When the value of packetization-mode is equal

                        to 0 or packetization-mode is not present, the

                        single NAL mode, as defined in section 6.2 of

                        RFC 3984, MUST be used.  This mode is in use in

                        standards using ITU-T Recommendation H.241 [15]

                        (see section 12.1).  When the value of

                        packetization-mode is equal to 1, the non-

                        interleaved mode, as defined in section 6.3 of

                        RFC 3984, MUST be used.  When the value of

                        packetization-mode is equal to 2, the

                        interleaved mode, as defined in section 6.4 of

                        RFC 3984, MUST be used.  The value of

                        packetization mode MUST be an integer in the

                        range of 0 to 2, inclusive.

       sprop-interleaving-depth:

                        This parameter MUST NOT be present

                        when packetization-mode is not present or the

                        value of packetization-mode is equal to 0 or 1.

                        This parameter MUST be present when the value

                        of packetization-mode is equal to 2.

                        This parameter signals the properties of a NAL

                        unit stream.  It specifies the maximum number

                        of VCL NAL units that precede any VCL NAL unit

                        in the NAL unit stream in transmission order

                        and follow the VCL NAL unit in decoding order.

                        Consequently, it is guaranteed that receivers

                        can reconstruct NAL unit decoding order when

                        the buffer size for NAL unit decoding order

                        recovery is at least the value of sprop-

                        interleaving-depth + 1 in terms of VCL NAL

                        units.

                        The value of sprop-interleaving-depth MUST be

                        an integer in the range of 0 to 32767,

                        inclusive.

Wenger, et al.              Standards Track                    [Page 46]

       sprop-deint-buf-req:

                        This parameter MUST NOT be present when

                        packetization-mode is not present or the value

                        of packetization-mode is equal to 0 or 1.  It

                        MUST be present when the value of

                        packetization-mode is equal to 2.

                        sprop-deint-buf-req signals the required size

                        of the deinterleaving buffer for the NAL unit

                        stream.  The value of the parameter MUST be

                        greater than or equal to the maximum buffer

                        occupancy (in units of bytes) required in such

                        a deinterleaving buffer that is specified in

                        section 7.2 of RFC 3984.  It is guaranteed that

                        receivers can perform the deinterleaving of

                        interleaved NAL units into NAL unit decoding

                        order, when the deinterleaving buffer size is

                        at least the value of sprop-deint-buf-req in

                        terms of bytes.

                        The value of sprop-deint-buf-req MUST be an

                        integer in the range of 0 to 4294967295,

                            Informative note: sprop-deint-buf-req

                            indicates the required size of the

                            deinterleaving buffer only.  When network

                            jitter can occur, an appropriately sized

                            jitter buffer has to be provisioned for

                            as well.

       deint-buf-cap:   This parameter signals the capabilities of a

                        receiver implementation and indicates the

                        amount of deinterleaving buffer space in units

                        of bytes that the receiver has available for

                        reconstructing the NAL unit decoding order.  A

                        receiver is able to handle any stream for which

                        the value of the sprop-deint-buf-req parameter

                        is smaller than or equal to this parameter.

                        If the parameter is not present, then a value

                        of 0 MUST be used for deint-buf-cap.  The value

                        of deint-buf-cap MUST be an integer in the

                        range of 0 to 4294967295, inclusive.

                            Informative note: deint-buf-cap indicates

                            the maximum possible size of the

                            deinterleaving buffer of the receiver only.

Wenger, et al.              Standards Track                    [Page 47]

                            When network jitter can occur, an

                            appropriately sized jitter buffer has to

                            be provisioned for as well.

       sprop-init-buf-time:

                        This parameter MAY be used to signal the

                        properties of a NAL unit stream.  The parameter

                        MUST NOT be present, if the value of

                        packetization-mode is equal to 0 or 1.

                        The parameter signals the initial buffering

                        time that a receiver MUST buffer before

                        starting decoding to recover the NAL unit

                        decoding order from the transmission order.

                        The parameter is the maximum value of

                        (transmission time of a NAL unit - decoding

                        time of the NAL unit), assuming reliable and

                        instantaneous transmission, the same

                        timeline for transmission and decoding, and

                        that decoding starts when the first packet

                        arrives.

                        An example of specifying the value of sprop-

                        init-buf-time follows.  A NAL unit stream is

                        sent in the following interleaved order, in

                        which the value corresponds to the decoding

                        time and the transmission order is from left to

                        right:

                        0  2  1  3  5  4  6  8  7 ...

                        Assuming a steady transmission rate of NAL

                        units, the transmission times are:

                        0  1  2  3  4  5  6  7  8 ...

                        Subtracting the decoding time from the

                        transmission time column-wise results in the

                        following series:

                        0 -1  1  0 -1  1  0 -1  1 ...

                        Thus, in terms of intervals of NAL unit

                        transmission times, the value of

                        sprop-init-buf-time in this

                        example is 1.

Wenger, et al.              Standards Track                    [Page 48]

                        The parameter is coded as a non-negative base10

                        integer representation in clock ticks of a 90-

                        kHz clock.  If the parameter is not present,

                        then no initial buffering time value is

                        defined.  Otherwise the value of sprop-init-

                        buf-time MUST be an integer in the range of 0

                        to 4294967295, inclusive.

                        In addition to the signaled sprop-init-buf-

                        time, receivers SHOULD take into account the

                        transmission delay jitter buffering, including

                        buffering for the delay jitter caused by

                        mixers, translators, gateways, proxies,

                        traffic-shapers, and other network elements.

       sprop-max-don-diff:

                        properties of a NAL unit stream.  It MUST NOT

                        be used to signal transmitter or receiver or

                        codec capabilities.  The parameter MUST NOT be

                        present if the value of packetization-mode is

                        equal to 0 or 1.  sprop-max-don-diff is an

                        integer in the range of 0 to 32767, inclusive.

                        If sprop-max-don-diff is not present, the value

                        of the parameter is unspecified.  sprop-max-

                        don-diff is calculated as follows:

                        sprop-max-don-diff = max{AbsDON(i) -

                        AbsDON(j)},

                        for any i and any j>i,

                        where i and j indicate the index of the NAL

                        unit in the transmission order and AbsDON

                        denotes a decoding order number of the NAL

                        unit that does not wrap around to 0 after

                        65535.  In other words, AbsDON is calculated as

                        follows: Let m and n be consecutive NAL units

                        in transmission order.  For the very first NAL

                        unit in transmission order (whose index is 0),

                        AbsDON(0) = DON(0).  For other NAL units,

                        AbsDON is calculated as follows:

                        If DON(m) == DON(n), AbsDON(n) = AbsDON(m)

                        If (DON(m) < DON(n) and DON(n) - DON(m) <

                        32768),

                        AbsDON(n) = AbsDON(m) + DON(n) - DON(m)

Wenger, et al.              Standards Track                    [Page 49]

                        If (DON(m) > DON(n) and DON(m) - DON(n) >=

                        AbsDON(n) = AbsDON(m) + 65536 - DON(m) + DON(n)

                        If (DON(m) < DON(n) and DON(n) - DON(m) >=

                        AbsDON(n) = AbsDON(m) - (DON(m) + 65536 -

                        DON(n))

                        If (DON(m) > DON(n) and DON(m) - DON(n) <

                        AbsDON(n) = AbsDON(m) - (DON(m) - DON(n))

                        where DON(i) is the decoding order number of

                        the NAL unit having index i in the transmission

                        order.  The decoding order number is specified

                        in section 5.5 of RFC 3984.

                            Informative note: Receivers may use sprop-

                            max-don-diff to trigger which NAL units in

                            the receiver buffer can be passed to the

                            decoder.

     max-rcmd-nalu-size:

                        capabilities of a receiver.  The parameter MUST

                        NOT be used for any other purposes.  The value

                        of the parameter indicates the largest NALU

                        size in bytes that the receiver can handle

                        efficiently.  The parameter value is a

                        recommendation, not a strict upper boundary.

                        The sender MAY create larger NALUs but must be

                        aware that the handling of these may come at a

                        higher cost than NALUs conforming to the

                        limitation.

                        The value of max-rcmd-nalu-size MUST be an

                        inclusive.  If this parameter is not specified,

                        no known limitation to the NALU size exists.

                        Senders still have to consider the MTU size

                        available between the sender and the receiver

                        and SHOULD run MTU discovery for this purpose.

                        This parameter is motivated by, for example, an

                        IP to H.223 video telephony gateway, where

                        NALUs smaller than the H.223 transport data

Wenger, et al.              Standards Track                    [Page 50]

                        unit will be more efficient.  A gateway may

                        terminate IP; thus, MTU discovery will normally

                        not work beyond the gateway.

                            Informative note: Setting this parameter to

                            a lower than necessary value may have a

                            negative impact.

   Encoding considerations:

                        This type is only defined for transfer via RTP

                        (RFC 3550).

                        A file format of H.264/AVC video is defined in

                        [29].  This definition is utilized by other

                        file formats, such as the 3GPP multimedia file

                        format (MIME type video/3gpp) [30] or the MP4

                        file format (MIME type video/mp4).

   Security considerations:

                        See section 9 of RFC 3984.

   Public specification:

                        Please refer to RFC 3984 and its section 15.

   Additional information:

                        None

   File extensions:     none

   Macintosh file type code: none

   Object identifier or OID: none

   Person & email address to contact for further information:

                        [email protected]

   Intended usage:      COMMON

   Author:

   Change controller:

                        IETF Audio/Video Transport working group

                        delegated from the IESG.

Wenger, et al.              Standards Track                    [Page 51]

8.2.  SDP Parameters

8.2.1.  Mapping of MIME Parameters to SDP

   The MIME media type video/H264 string is mapped to fields in the

   Session Description Protocol (SDP) [5] as follows:

   o  The media name in the "m=" line of SDP MUST be video.

   o  The encoding name in the "a=rtpmap" line of SDP MUST be H264 (the

      MIME subtype).

   o  The clock rate in the "a=rtpmap" line MUST be 90000.

   o  The OPTIONAL parameters "profile-level-id", "max-mbps", "max-fs",

      "max-cpb", "max-dpb", "max-br", "redundant-pic-cap", "sprop-

      parameter-sets", "parameter-add", "packetization-mode", "sprop-

      interleaving-depth", "deint-buf-cap", "sprop-deint-buf-req",

      "sprop-init-buf-time", "sprop-max-don-diff", and "max-rcmd-nalu-

      size", when present, MUST be included in the "a=fmtp" line of SDP.

      These parameters are expressed as a MIME media type string, in the

      form of a semicolon separated list of parameter=value pairs.

   An example of media representation in SDP is as follows (Baseline

   Profile, Level 3.0, some of the constraints of the Main profile may

   not be obeyed):

      m=video 49170 RTP/AVP 98

      a=rtpmap:98 H264/90000

      a=fmtp:98 profile-level-id=42A01E;

                sprop-parameter-sets=Z0IACpZTBYmI,aMljiA==

8.2.2.  Usage with the SDP Offer/Answer Model

   When H.264 is offered over RTP using SDP in an Offer/Answer model [7]

   for negotiation for unicast usage, the following limitations and

   rules apply:

   o  The parameters identifying a media format configuration for H.264

      are "profile-level-id", "packetization-mode", and, if required by

      "packetization-mode", "sprop-deint-buf-req".  These three

      parameters MUST be used symmetrically; i.e., the answerer MUST

      either maintain all configuration parameters or remove the media

      format (payload type) completely, if one or more of the parameter

      values are not supported.

Wenger, et al.              Standards Track                    [Page 52]

         Informative note: The requirement for symmetric use applies

         only for the above three parameters and not for the other

         stream properties and capability parameters.

      To simplify handling and matching of these configurations, the

      same RTP payload type number used in the offer SHOULD also be used

      in the answer, as specified in [7].  An answer MUST NOT contain a

      payload type number used in the offer unless the configuration

      ("profile-level-id", "packetization-mode", and, if present,

      "sprop-deint-buf-req") is the same as in the offer.

         Informative note: An offerer, when receiving the answer, has to

         compare payload types not declared in the offer based on media

         type (i.e., video/h264) and the above three parameters with any

         payload types it has already declared, in order to determine

         whether the configuration in question is new or equivalent to a

         configuration already offered.

   o  The parameters "sprop-parameter-sets", "sprop-deint-buf-req",

      "sprop-interleaving-depth", "sprop-max-don-diff", and "sprop-

      init-buf-time" describe the properties of the NAL unit stream that

      the offerer or answerer is sending for this media format

      configuration.  This differs from the normal usage of the

      Offer/Answer parameters: normally such parameters declare the

      properties of the stream that the offerer or the answerer is able

      to receive.  When dealing with H.264, the offerer assumes that the

      answerer will be able to receive media encoded using the

      configuration being offered.

         Informative note: The above parameters apply for any stream

         sent by the declaring entity with the same configuration; i.e.,

         they are dependent on their source.  Rather then being bound to

         the payload type, the values may have to be applied to another

         payload type when being sent, as they apply for the

         configuration.

   o  The capability parameters ("max-mbps", "max-fs", "max-cpb", "max-

      dpb", "max-br", ,"redundant-pic-cap", "max-rcmd-nalu-size") MAY be

      used to declare further capabilities.  Their interpretation

      depends on the direction attribute.  When the direction attribute

      is sendonly, then the parameters describe the limits of the RTP

      packets and the NAL unit stream that the sender is capable of

      producing.  When the direction attribute is sendrecv or recvonly,

      then the parameters describe the limitations of what the receiver

      accepts.

Wenger, et al.              Standards Track                    [Page 53]

   o  As specified above, an offerer has to include the size of the

      deinterleaving buffer in the offer for an interleaved H.264

      stream.  To enable the offerer and answerer to inform each other

      about their capabilities for deinterleaving buffering, both

      parties are RECOMMENDED to include "deint-buf-cap".  This

      information MAY be used when the value for "sprop-deint-buf-req"

      is selected in a second round of offer and answer.  For

      interleaved streams, it is also RECOMMENDED to consider offering

      multiple payload types with different buffering requirements when

      the capabilities of the receiver are unknown.

   o  The "sprop-parameter-sets" parameter is used as described above.

      In addition, an answerer MUST maintain all parameter sets received

      in the offer in its answer.  Depending on the value of the

      "parameter-add" parameter, different rules apply: If "parameter-

      add" is false (0), the answer MUST NOT add any additional

      parameter sets.  If "parameter-add" is true (1), the answerer, in

      its answer, MAY add additional parameter sets to the "sprop-

      parameter-sets" parameter.  The answerer MUST also, independent of

      the value of "parameter-add", accept to receive a video stream

      using the sprop-parameter-sets it declared in the answer.

         Informative note: care must be taken when parameter sets are

         added not to cause overwriting of already transmitted parameter

         sets by using conflicting parameter set identifiers.

   For streams being delivered over multicast, the following rules apply

   in addition:

   o  The stream properties parameters ("sprop-parameter-sets", "sprop-

      deint-buf-req", "sprop-interleaving-depth", "sprop-max-don-diff",

      and "sprop-init-buf-time") MUST NOT be changed by the answerer.

      Thus, a payload type can either be accepted unaltered or removed.

   o  The receiver capability parameters "max-mbps", "max-fs", "max-

      cpb", "max-dpb", "max-br", and "max-rcmd-nalu-size" MUST be

      supported by the answerer for all streams declared as sendrecv or

      recvonly; otherwise, one of the following actions MUST be

      performed: the media format is removed, or the session rejected.

   o  The receiver capability parameter redundant-pic-cap SHOULD be

      recvonly as follows:  The answerer SHOULD NOT include redundant

      coded pictures in the transmitted stream if the offerer indicated

      redundant-pic-cap equal to 0.  Otherwise (when redundant_pic_cap

      is equal to 1), it is beyond the scope of this memo to recommend

      how the answerer should use redundant coded pictures.

Wenger, et al.              Standards Track                    [Page 54]

   Below are the complete lists of how the different parameters shall be

   interpreted in the different combinations of offer or answer and

   direction attribute.

   o  In offers and answers for which "a=sendrecv" or no direction

      attribute is used, or in offers and answers for which "a=recvonly"

      is used, the following interpretation of the parameters MUST be

      used.

      Declaring actual configuration or properties for receiving:

         - profile-level-id

         - packetization-mode

      Declaring actual properties of the stream to be sent (applicable

      only when "a=sendrecv" or no direction attribute is used):

         - sprop-deint-buf-req

         - sprop-interleaving-depth

         - sprop-parameter-sets

         - sprop-max-don-diff

         - sprop-init-buf-time

      Declaring receiver implementation capabilities:

         - max-mbps

         - max-fs

         - max-cpb

         - max-dpb

         - max-br

         - redundant-pic-cap

         - deint-buf-cap

         - max-rcmd-nalu-size

      Declaring how Offer/Answer negotiation shall be performed:

         - parameter-add

   o  In an offer or answer for which the direction attribute

      "a=sendonly" is included for the media stream, the following

      interpretation of the parameters MUST be used:

      Declaring actual configuration and properties of stream proposed

      to be sent:

Wenger, et al.              Standards Track                    [Page 55]

      Declaring the capabilities of the sender when it receives a

      stream:

   Furthermore, the following considerations are necessary:

   o  Parameters used for declaring receiver capabilities are in general

      downgradable; i.e., they express the upper limit for a sender's

      possible behavior.  Thus a sender MAY select to set its encoder

      using only lower/lesser or equal values of these parameters.

      "sprop-parameter-sets" MUST NOT be used in a sender's declaration

      of its capabilities, as the limits of the values that are carried

      inside the parameter sets are implicit with the profile and level

   o  Parameters declaring a configuration point are not downgradable,

      with the exception of the level part of the "profile-level-id"

      parameter.  This expresses values a receiver expects to be used

      and must be used verbatim on the sender side.

   o  When a sender's capabilities are declared, and non-downgradable

      parameters are used in this declaration, then these parameters

      express a configuration that is acceptable.  In order to achieve

      high interoperability levels, it is often advisable to offer

      multiple alternative configurations; e.g., for the packetization

      mode.  It is impossible to offer multiple configurations in a

      single payload type.  Thus, when multiple configuration offers are

      made, each offer requires its own RTP payload type associated with

      the offer.

Wenger, et al.              Standards Track                    [Page 56]

   o  A receiver SHOULD understand all MIME parameters, even if it only

      supports a subset of the payload format's functionality.  This

      ensures that a receiver is capable of understanding when an offer

      to receive media can be downgraded to what is supported by the

      receiver of the offer.

   o  An answerer MAY extend the offer with additional media format

      configurations.  However, to enable their usage, in most cases a

      second offer is required from the offerer to provide the stream

      properties parameters that the media sender will use.  This also

      has the effect that the offerer has to be able to receive this

      media format configuration, not only to send it.

   o  If an offerer wishes to have non-symmetric capabilities between

      sending and receiving, the offerer has to offer different RTP

      sessions; i.e., different media lines declared as "recvonly" and

      "sendonly", respectively.  This may have further implications on

      the system.

8.2.3.  Usage in Declarative Session Descriptions

   When H.264 over RTP is offered with SDP in a declarative style, as in

   RTSP [27] or SAP [28], the following considerations are necessary.

   o  All parameters capable of indicating the properties of both a NAL

      unit stream and a receiver are used to indicate the properties of

      a NAL unit stream.  For example, in this case, the parameter

      "profile-level-id" declares the values used by the stream, instead

      of the capabilities of the sender.  This results in that the

      following interpretation of the parameters MUST be used:

      Declaring actual configuration or properties:

Wenger, et al.              Standards Track                    [Page 57]

      Not usable:

   o  A receiver of the SDP is required to support all parameters and

      values of the parameters provided; otherwise, the receiver MUST

      reject (RTSP) or not participate in (SAP) the session.  It falls

      on the creator of the session to use values that are expected to

      be supported by the receiving application.

8.3.  Examples

   A SIP Offer/Answer exchange wherein both parties are expected to both

   send and receive could look like the following.  Only the media codec

   specific parts of the SDP are shown.  Some lines are wrapped due to

   text constraints.

      Offerer -> Answer SDP message:

      m=video 49170 RTP/AVP 100 99 98

      a=fmtp:98 profile-level-id=42A01E; packetization-mode=0;

      a=rtpmap:99 H264/90000

      a=fmtp:99 profile-level-id=42A01E; packetization-mode=1;

      a=rtpmap:100 H264/90000

      a=fmtp:100 profile-level-id=42A01E; packetization-mode=2;

                 sprop-parameter-sets=Z0IACpZTBYmI,aMljiA==;

                 sprop-interleaving-depth=45; sprop-deint-buf-req=64000;

                 sprop-init-buf-time=102478; deint-buf-cap=128000

   The above offer presents the same codec configuration in three

   different packetization formats.  PT 98 represents single NALU mode,

   PT 99 non-interleaved mode; PT 100 indicates the interleaved mode.

   In the interleaved mode case, the interleaving parameters that the

   offerer would use if the answer indicates support for PT 100 are also

   included.  In all three cases the parameter "sprop-parameter-sets"

   conveys the initial parameter sets that are required for the answerer

   when receiving a stream from the offerer when this configuration

Wenger, et al.              Standards Track                    [Page 58]

   (profile-level-id and packetization mode) is accepted.  Note that the

   value for "sprop-parameter-sets", although identical in the example

   above, could be different for each payload type.

     Answerer -> Offerer SDP message:

     m=video 49170 RTP/AVP 100 99 97

     a=rtpmap:97 H264/90000

     a=fmtp:97 profile-level-id=42A01E; packetization-mode=0;

               sprop-parameter-sets=Z0IACpZTBYmI,aMljiA==,As0DEWlsIOp==,

               KyzFGleR

     a=rtpmap:99 H264/90000

     a=fmtp:99 profile-level-id=42A01E; packetization-mode=1;

               KyzFGleR; max-rcmd-nalu-size=3980

     a=rtpmap:100 H264/90000

     a=fmtp:100 profile-level-id=42A01E; packetization-mode=2;

               KyzFGleR; sprop-interleaving-depth=60;

               sprop-deint-buf-req=86000; sprop-init-buf-time=156320;

               deint-buf-cap=128000; max-rcmd-nalu-size=3980

   As the Offer/Answer negotiation covers both sending and receiving

   streams, an offer indicates the exact parameters for what the offerer

   is willing to receive, whereas the answer indicates the same for what

   the answerer accepts to receive.  In this case the offerer declared

   that it is willing to receive payload type 98.  The answerer accepts

   this by declaring a equivalent payload type 97; i.e., it has

   identical values for the three parameters "profile-level-id",

   packetization-mode, and "sprop-deint-buf-req".  This has the

   following implications for both the offerer and the answerer

   concerning the parameters that declare properties.  The offerer

   initially declared a certain value of the "sprop-parameter-sets" in

   the payload definition for PT=98.  However, as the answerer accepted

   this as PT=97, the values of "sprop-parameter-sets" in PT=98 must now

   be used instead when the offerer sends PT=97.  Similarly, when the

   answerer sends PT=98 to the offerer, it has to use the properties

   parameters it declared in PT=97.

   The answerer also accepts the reception of the two configurations

   that payload types 99 and 100 represent.  It provides the initial

   parameter sets for the answerer-to-offerer direction, and for

   buffering related parameters that it will use to send the payload

   types.  It also provides the offerer with its memory limit for

   deinterleaving operations by providing a "deint-buf-cap" parameter.

   This is only useful if the offerer decides on making a second offer,

   where it can take the new value into account.  The "max-rcmd-nalu-

   size" indicates that the answerer can efficiently process NALUs up to

Wenger, et al.              Standards Track                    [Page 59]

   the size of 3980 bytes.  However, there is no guarantee that the

   network supports this size.

   Please note that the parameter sets in the above example do not

   represent a legal operation point of an H.264 codec.  The base64

   strings are only used for illustration.

8.4.  Parameter Set Considerations

   The H.264 parameter sets are a fundamental part of the video codec

   and vital to its operation; see section 1.2.  Due to their

   characteristics and their importance for the decoding process, lost

   or erroneously transmitted parameter sets can hardly be concealed

   locally at the receiver.  A reference to a corrupt parameter set has

   normally fatal results to the decoding process.  Corruption could

   occur, for example, due to the erroneous transmission or loss of a

   parameter set data structure, but also due to the untimely

   transmission of a parameter set update.  Therefore, the following

   recommendations are provided as a guideline for the implementer of

   the RTP sender.

   Parameter set NALUs can be transported using three different

   principles:

   A. Using a session control protocol (out-of-band) prior to the actual

      RTP session.

   B. Using a session control protocol (out-of-band) during an ongoing

   C. Within the RTP stream in the payload (in-band) during an ongoing

   It is necessary to implement principles A and B within a session

   control protocol.  SIP and SDP can be used as described in the SDP

   Offer/Answer model and in the previous sections of this memo.  This

   section contains guidelines on how principles A and B must be

   implemented within session control protocols.  It is independent of

   the particular protocol used.  Principle C is supported by the RTP

   payload format defined in this specification.

   The picture and sequence parameter set NALUs SHOULD NOT be

   transmitted in the RTP payload unless reliable transport is provided

   for RTP, as a loss of a parameter set of either type will likely

   prevent decoding of a considerable portion of the corresponding RTP

Wenger, et al.              Standards Track                    [Page 60]

   stream.  Thus, the transmission of parameter sets using a reliable

   session control protocol (i.e., usage of principle A or B above) is

   RECOMMENDED.

   In the rest of the section it is assumed that out-of-band signaling

   provides reliable transport of parameter set NALUs and that in-band

   transport does not.  If in-band signaling of parameter sets is used,

   the sender SHOULD take the error characteristics into account and use

   mechanisms to provide a high probability for delivering the parameter

   sets correctly.  Mechanisms that increase the probability for a

   correct reception include packet repetition, FEC, and retransmission.

   The use of an unreliable, out-of-band control protocol has similar

   disadvantages as the in-band signaling (possible loss) and, in

   addition, may also lead to difficulties in the synchronization (see

   below).  Therefore, it is NOT RECOMMENDED.

   Parameter sets MAY be added or updated during the lifetime of a

   session using principles B and C.  It is required that parameter sets

   are present at the decoder prior to the NAL units that refer to them.

   Updating or adding of parameter sets can result in further problems,

   and therefore the following recommendations should be considered.

   -  When parameter sets are added or updated, principle C is

      vulnerable to transmission errors as described above, and

      therefore principle B is RECOMMENDED.

   -  When parameter sets are added or updated, care SHOULD be taken to

      ensure that any parameter set is delivered prior to its usage.  It

      is common that no synchronization is present between out-of-band

      signaling and in-band traffic.  If out-of-band signaling is used,

      it is RECOMMENDED that a sender does not start sending NALUs

      requiring the updated parameter sets prior to acknowledgement of

      delivery from the signaling protocol.

   -  When parameter sets are updated, the following synchronization

      issue should be taken into account.  When overwriting a parameter

      set at the receiver, the sender has to ensure that the parameter

      set in question is not needed by any NALU present in the network

      or receiver buffers.  Otherwise, decoding with a wrong parameter

      set may occur.  To lessen this problem, it is RECOMMENDED either

      to overwrite only those parameter sets that have not been used for

      a sufficiently long time (to ensure that all related NALUs have

      been consumed), or to add a new parameter set instead (which may

      have negative consequences for the efficiency of the video

      coding).

   -  When new parameter sets are added, previously unused parameter set

      identifiers are used.  This avoids the problem identified in the

Wenger, et al.              Standards Track                    [Page 61]

      previous paragraph.  However, in a multiparty session, unless a

      synchronized control protocol is used, there is a risk that

      multiple entities try to add different parameter sets for the same

      identifier, which has to be avoided.

   -  Adding or modifying parameter sets by using both principles B and

      C in the same RTP session may lead to inconsistencies of the

      parameter sets because of the lack of synchronization between the

      control and the RTP channel.  Therefore, principles B and C MUST

      NOT both be used in the same session unless sufficient

      synchronization can be provided.

   In some scenarios (e.g., when only the subset of this payload format

   specification corresponding to H.241 is used), it is not possible to

   employ out-of-band parameter set transmission.  In this case,

   parameter sets have to be transmitted in-band.  Here, the

   synchronization with the non-parameter-set-data in the bitstream is

   implicit, but the possibility of a loss has to be taken into account.

   The loss probability should be reduced using the mechanisms discussed

   above.

   -  When parameter sets are initially provided using principle A and

      then later added or updated in-band (principle C), there is a risk

      associated with updating the parameter sets delivered out-of-band.

      If receivers miss some in-band updates (for example, because of a

      loss or a late tune-in), those receivers attempt to decode the

      bitstream using out-dated parameters.  It is RECOMMENDED that

      parameter set IDs be partitioned between the out-of-band and in-

      band parameter sets.

   To allow for maximum flexibility and best performance from the H.264

   coder, it is recommended, if possible, to allow any sender to add its

   own parameter sets to be used in a session.  Setting the "parameter-

   add" parameter to false should only be done in cases where the

   session topology prevents a participant to add its own parameter

   sets.

9.  Security Considerations

   RTP packets using the payload format defined in this specification

   are subject to the security considerations discussed in the RTP

   specification [4], and in any appropriate RTP profile (for example,

   [16]).  This implies that confidentiality of the media streams is

   achieved by encryption; for example, through the application of SRTP

   [26].  Because the data compression used with this payload format is

   applied end-to-end, any encryption needs to be performed after

   compression.

Wenger, et al.              Standards Track                    [Page 62]

   A potential denial-of-service threat exists for data encodings using

   compression techniques that have non-uniform receiver-end

   computational load.  The attacker can inject pathological datagrams

   into the stream that are complex to decode and that cause the

   receiver to be overloaded.  H.264 is particularly vulnerable to such

   attacks, as it is extremely simple to generate datagrams containing

   NAL units that affect the decoding process of many future NAL units.

   Therefore, the usage of data origin authentication and data integrity

   protection of at least the RTP packet is RECOMMENDED; for example,

   with SRTP [26].

   Note that the appropriate mechanism to ensure confidentiality and

   integrity of RTP packets and their payloads is very dependent on the

   application and on the transport and signaling protocols employed.

   Thus, although SRTP is given as an example above, other possible

   choices exist.

   Decoders MUST exercise caution with respect to the handling of user

   data SEI messages, particularly if they contain active elements, and

   MUST restrict their domain of applicability to the presentation

   containing the stream.

   End-to-End security with either authentication, integrity or

   confidentiality protection will prevent a MANE from performing

   media-aware operations other than discarding complete packets.  And

   in the case of confidentiality protection it will even be prevented

   from performing discarding of packets in a media aware way.  To allow

   any MANE to perform its operations, it will be required to be a

   trusted entity which is included in the security context

   establishment.

10.  Congestion Control

   Congestion control for RTP SHALL be used in accordance with RFC 3550

   [4], and with any applicable RTP profile; e.g., RFC 3551 [16].  An

   additional requirement if best-effort service is being used is:

   users of this payload format MUST monitor packet loss to ensure that

   the packet loss rate is within acceptable parameters.  Packet loss is

   considered acceptable if a TCP flow across the same network path, and

   experiencing the same network conditions, would achieve an average

   throughput, measured on a reasonable timescale, that is not less than

   the RTP flow is achieving.  This condition can be satisfied by

   implementing congestion control mechanisms to adapt the transmission

   rate (or the number of layers subscribed for a layered multicast

   session), or by arranging for a receiver to leave the session if the

   loss rate is unacceptably high.

Wenger, et al.              Standards Track                    [Page 63]

   The bit rate adaptation necessary for obeying the congestion control

   principle is easily achievable when real-time encoding is used.

   However, when pre-encoded content is being transmitted, bandwidth

   adaptation requires the availability of more than one coded

   representation of the same content, at different bit rates, or the

   existence of non-reference pictures or sub-sequences [22] in the

   bitstream.  The switching between the different representations can

   normally be performed in the same RTP session; e.g., by employing a

   concept known as SI/SP slices of the Extended Profile, or by

   switching streams at IDR picture boundaries.  Only when non-

   downgradable parameters (such as the profile part of the

   profile/level ID) are required to be changed does it become necessary

   to terminate and re-start the media stream.  This may be accomplished

   by using a different RTP payload type.

   MANEs MAY follow the suggestions outlined in section 7.3 and remove

   certain unusable packets from the packet stream when that stream was

   damaged due to previous packet losses.  This can help reduce the

   network load in certain special cases.

11.  IANA Consideration

   IANA has registered one new MIME type; see section 8.1.

Wenger, et al.              Standards Track                    [Page 64]

12.  Informative Appendix: Application Examples

   This payload specification is very flexible in its use, in order to

   cover the extremely wide application space anticipated for H.264.

   However, this great flexibility also makes it difficult for an

   implementer to decide on a reasonable packetization scheme.  Some

   information on how to apply this specification to real-world

   scenarios is likely to appear in the form of academic publications

   and a test model software and description in the near future.

   However, some preliminary usage scenarios are described here as well.

12.1.  Video Telephony according to ITU-T Recommendation H.241

       Annex A

   H.323-based video telephony systems that use H.264 as an optional

   video compression scheme are required to support H.241 Annex A [15]

   as a packetization scheme.  The packetization mechanism defined in

   this Annex is technically identical with a small subset of this

   specification.

   When a system operates according to H.241 Annex A, parameter set NAL

   units are sent in-band.  Only Single NAL unit packets are used.  Many

   such systems are not sending IDR pictures regularly, but only when

   required by user interaction or by control protocol means; e.g., when

   switching between video channels in a Multipoint Control Unit or for

   error recovery requested by feedback.

12.2.  Video Telephony, No Slice Data Partitioning, No NAL Unit

       Aggregation

   The RTP part of this scheme is implemented and tested (though not the

   control-protocol part; see below).

   In most real-world video telephony applications, picture parameters

   such as picture size or optional modes never change during the

   lifetime of a connection.  Therefore, all necessary parameter sets

   (usually only one) are sent as a side effect of the capability

   exchange/announcement process, e.g., according to the SDP syntax

   specified in section 8.2 of this document.  As all necessary

   parameter set information is established before the RTP session

   starts, there is no need for sending any parameter set NAL units.

   Slice data partitioning is not used, either.  Thus, the RTP packet

   stream basically consists of NAL units that carry single coded

   slices.

   The encoder chooses the size of coded slice NAL units so that they

   offer the best performance.  Often, this is done by adapting the

   coded slice size to the MTU size of the IP network.  For small

Wenger, et al.              Standards Track                    [Page 65]

   picture sizes, this may result in a one-picture-per-one-packet

   strategy.  Intra refresh algorithms clean up the loss of packets and

   the resulting drift-related artifacts.

12.3.  Video Telephony, Interleaved Packetization Using NAL Unit

   This scheme allows better error concealment and is used in H.263

   based designs using RFC 2429 packetization [10].  It has been

   implemented, and good results were reported [12].

   The VCL encoder codes the source picture so that all macroblocks

   (MBs) of one MB line are assigned to one slice.  All slices with even

   MB row addresses are combined into one STAP, and all slices with odd

   MB row addresses into another.  Those STAPs are transmitted as RTP

   packets.  The establishment of the parameter sets is performed as

   discussed above.

   Note that the use of STAPs is essential here, as the high number of

   individual slices (18 for a CIF picture) would lead to unacceptably

   high IP/UDP/RTP header overhead (unless the source coding tool FMO is

   used, which is not assumed in this scenario).  Furthermore, some

   wireless video transmission systems, such as H.324M and the IP-based

   video telephony specified in 3GPP, are likely to use relatively small

   transport packet size.  For example, a typical MTU size of H.223 AL3

   SDU is around 100 bytes [17].  Coding individual slices according to

   this packetization scheme provides further advantage in communication

   between wired and wireless networks, as individual slices are likely

   to be smaller than the preferred maximum packet size of wireless

   systems.  Consequently, a gateway can convert the STAPs used in a

   wired network into several RTP packets with only one NAL unit, which

   are preferred in a wireless network, and vice versa.

12.4.  Video Telephony with Data Partitioning

   This scheme has been implemented and has been shown to offer good

   performance, especially at higher packet loss rates [12].

   Data Partitioning is known to be useful only when some form of

   unequal error protection is available.  Normally, in single-session

   RTP environments, even error characteristics are assumed; i.e., the

   packet loss probability of all packets of the session is the same

   statistically.  However, there are means to reduce the packet loss

   probability of individual packets in an RTP session.  A FEC packet

   according to RFC 2733 [18], for example, specifies which media

   packets are associated with the FEC packet.

Wenger, et al.              Standards Track                    [Page 66]

   In all cases, the incurred overhead is substantial but is in the same

   order of magnitude as the number of bits that have otherwise been

   spent for intra information.  However, this mechanism does not add

   any delay to the system.

   Again, the complete parameter set establishment is performed through

   control protocol means.

12.5.  Video Telephony or Streaming with FUs and Forward Error

       Correction

   This scheme has been implemented and has been shown to provide good

   performance, especially at higher packet loss rates [19].

   The most efficient means to combat packet losses for scenarios where

   retransmissions are not applicable is forward error correction (FEC).

   Although application layer, end-to-end use of FEC is often less

   efficient than an FEC-based protection of individual links

   (especially when links of different characteristics are in the

   transmission path), application layer, end-to-end FEC is unavoidable

   in some scenarios.  RFC 2733 [18] provides means to use generic,

   application layer, end-to-end FEC in packet-loss environments.  A

   binary forward error correcting code is generated by applying the XOR

   operation to the bits at the same bit position in different packets.

   The binary code can be specified by the parameters (n,k) in which k

   is the number of information packets used in the connection and n is

   the total number of packets generated for k information packets;

   i.e., n-k parity packets are generated for k information packets.

   When a code is used with parameters (n,k) within the RFC 2733

   framework, the following properties are well known:

   a) If applied over one RTP packet, RFC 2733 provides only packet

      repetition.

   b) RFC 2733 is most bit rate efficient if XOR-connected packets have

      equal length.

   c) At the same packet loss probability p and for a fixed k, the

      greater the value of n is, the smaller the residual error

      probability becomes.  For example, for a packet loss probability

      of 10%, k=1, and n=2, the residual error probability is about 1%,

      whereas for n=3, the residual error probability is about 0.1%.

   d) At the same packet loss probability p and for a fixed code rate

      k/n, the greater the value of n is, the smaller the residual error

      probability becomes.  For example, at a packet loss probability of

      p=10%, k=1 and n=2, the residual error rate is about 1%, whereas

Wenger, et al.              Standards Track                    [Page 67]

      for an extended Golay code with k=12 and n=24, the residual error

      rate is about 0.01%.

   For applying RFC 2733 in combination with H.264 baseline coded video

   without using FUs, several options might be considered:

   1) The video encoder produces NAL units for which each video frame is

      coded in a single slice.  Applying FEC, one could use a simple

      code; e.g., (n=2, k=1).  That is, each NAL unit would basically

      just be repeated.  The disadvantage is obviously the bad code

      performance according to d), above, and the low flexibility, as

      only (n, k=1) codes can be used.

   2) The video encoder produces NAL units for which each video frame is

      encoded in one or more consecutive slices.  Applying FEC, one

      could use a better code, e.g., (n=24, k=12), over a sequence of

      NAL units.  Depending on the number of RTP packets per frame, a

      loss may introduce a significant delay, which is reduced when more

      RTP packets are used per frame.  Packets of completely different

      length might also be connected, which decreases bit rate

      efficiency according to b), above.  However, with some care and

      for slices of 1kb or larger, similar length (100-200 bytes

      difference) may be produced, which will not lower the bit

      efficiency catastrophically.

   3) The video encoder produces NAL units, for which a certain frame

      contains k slices of possibly almost equal length.  Then, applying

      FEC, a better code, e.g., (n=24, k=12), can be used over the

      sequence of NAL units for each frame.  The delay compared to that

      of 2), above,  may be reduced, but several disadvantages are

      obvious.  First, the coding efficiency of the encoded video is

      lowered significantly, as slice-structured coding reduces intra-

      frame prediction and additional slice overhead is necessary.

      Second, pre-encoded content or, when operating over a gateway, the

      video is usually not appropriately coded with k slices such that

      FEC can be applied.  Finally, the encoding of video producing k

      slices of equal length is not straightforward and might require

      more than one encoding pass.

   Many of the mentioned disadvantages can be avoided by applying FUs in

   combination with FEC.  Each NAL unit can be split into any number of

   FUs of basically equal length; therefore, FEC with a reasonable k and

   n can be applied, even if the encoder made no effort to produce

   slices of equal length.  For example, a coded slice NAL unit

   containing an entire frame can be split to k FUs, and a parity check

   code (n=k+1, k) can be applied.  However, this has the disadvantage

Wenger, et al.              Standards Track                    [Page 68]

   that unless all created fragments can be recovered, the whole slice

   will be lost.  Thus a larger section is lost than would be if the

   frame had been split into several slices.

   The presented technique makes it possible to achieve good

   transmission error tolerance, even if no additional source coding

   layer redundancy (such as periodic intra frames) is present.

   Consequently, the same coded video sequence can be used to achieve

   the maximum compression efficiency and quality over error-free

   transmission and for transmission over error-prone networks.

   Furthermore, the technique allows the application of FEC to pre-

   encoded sequences without adding delay.  In this case, pre-encoded

   sequences that are not encoded for error-prone networks can still be

   transmitted almost reliably without adding extensive delays.  In

   addition, FUs of equal length result in a bit rate efficient use of

   RFC 2733.

   If the error probability depends on the length of the transmitted

   packet (e.g., in case of mobile transmission [14]), the benefits of

   applying FUs with FEC are even more obvious.  Basically, the

   flexibility of the size of FUs allows appropriate FEC to be applied

   for each NAL unit and unequal error protection of NAL units.

   When FUs and FEC are used, the incurred overhead is substantial but

   is in the same order of magnitude as the number of bits that have to

   be spent for intra-coded macroblocks if no FEC is applied.  In [19],

   it was shown that the overall performance of the FEC-based approach

   enhanced quality when using the same error rate and same overall bit

   rate, including the overhead.

12.6.  Low Bit-Rate Streaming

   This scheme has been implemented with H.263 and non-standard RTP

   packetization and has given good results [20].  There is no technical

   reason why similarly good results could not be achievable with H.264.

   In today's Internet streaming, some of the offered bit rates are

   relatively low in order to allow terminals with dial-up modems to

   access the content.  In wired IP networks, relatively large packets,

   say 500 - 1500 bytes, are preferred to smaller and more frequently

   occurring packets in order to reduce network congestion.  Moreover,

   use of large packets decreases the amount of RTP/UDP/IP header

   overhead.  For low bit-rate video, the use of large packets means

   that sometimes up to few pictures should be encapsulated in one

   packet.

Wenger, et al.              Standards Track                    [Page 69]

   However, loss of a packet including many coded pictures would have

   drastic consequences for visual quality, as there is practically no

   other way to conceal a loss of an entire picture than to repeat the

   previous one.  One way to construct relatively large packets and

   maintain possibilities for successful loss concealment is to

   construct MTAPs that contain interleaved slices from several

   pictures.  An MTAP should not contain spatially adjacent slices from

   the same picture or spatially overlapping slices from any picture.

   If a packet is lost, it is likely that a lost slice is surrounded by

   spatially adjacent slices of the same picture and spatially

   corresponding slices of the temporally previous and succeeding

   pictures.  Consequently, concealment of the lost slice is likely to

   be relatively successful.

12.7.  Robust Packet Scheduling in Video Streaming

   Robust packet scheduling has been implemented with MPEG-4 Part 2 and

   simulated in a wireless streaming environment [21].  There is no

   technical reason why similar or better results could not be

   achievable with H.264.

   Streaming clients typically have a receiver buffer that is capable of

   storing a relatively large amount of data.  Initially, when a

   streaming session is established, a client does not start playing the

   stream back immediately.  Rather, it typically buffers the incoming

   data for a few seconds.  This buffering helps maintain continuous

   playback, as, in case of occasional increased transmission delays or

   network throughput drops, the client can decode and play buffered

   data.  Otherwise, without initial buffering, the client has to freeze

   the display, stop decoding, and wait for incoming data.  The

   buffering is also necessary for either automatic or selective

   retransmission in any protocol level.  If any part of a picture is

   lost, a retransmission mechanism may be used to resend the lost data.

   If the retransmitted data is received before its scheduled decoding

   or playback time, the loss is recovered perfectly.  Coded pictures

   can be ranked according to their importance in the subjective quality

   of the decoded sequence.  For example, non-reference pictures, such

   as conventional B pictures, are subjectively least important, as

   their absence does not affect decoding of any other pictures.  In

   addition to non-reference pictures, the ITU-T H.264 | ISO/IEC

   14496-10 standard includes a temporal scalability method called sub-

   sequences [22].  Subjective ranking can also be made on coded slice

   data partition or slice group basis.  Coded slices and coded slice

   data partitions that are subjectively the most important can be sent

   earlier than their decoding order indicates, whereas coded slices and

   coded slice data partitions that are subjectively the least important

   can be sent later than their natural coding order indicates.

   Consequently, any retransmitted parts of the most important slices

Wenger, et al.              Standards Track                    [Page 70]

   and coded slice data partitions are more likely to be received before

   their scheduled decoding or playback time compared to the least

   important slices and slice data partitions.

13.  Informative Appendix: Rationale for Decoding Order Number

13.1.  Introduction

   The Decoding Order Number (DON) concept was introduced mainly to

   enable efficient multi-picture slice interleaving (see section 12.6)

   and robust packet scheduling (see section 12.7).  In both of these

   applications, NAL units are transmitted out of decoding order.  DON

   indicates the decoding order of NAL units and should be used in the

   receiver to recover the decoding order.  Example use cases for

   efficient multi-picture slice interleaving and for robust packet

   scheduling are given in sections 13.2 and 13.3, respectively.

   Section 13.4 describes the benefits of the DON concept in error

   resiliency achieved by redundant coded pictures.  Section 13.5

   summarizes considered alternatives to DON and justifies why DON was

   chosen to this RTP payload specification.

13.2.  Example of Multi-Picture Slice Interleaving

   An example of multi-picture slice interleaving follows.  A subset of

   a coded video sequence is depicted below in output order.  R denotes

   a reference picture, N denotes a non-reference picture, and the

   number indicates a relative output time.

      ... R1 N2 R3 N4 R5 ...

   The decoding order of these pictures from left to right is as

   follows:

      ... R1 R3 N2 R5 N4 ...

   The NAL units of pictures R1, R3, N2, R5, and N4 are marked with a

   DON equal to 1, 2, 3, 4, and 5, respectively.

Wenger, et al.              Standards Track                    [Page 71]

   Each reference picture consists of three slice groups that are

   scattered as follows (a number denotes the slice group number for

   each macroblock in a QCIF frame):

      0 1 2 0 1 2 0 1 2 0 1

      2 0 1 2 0 1 2 0 1 2 0

      1 2 0 1 2 0 1 2 0 1 2

   For the sake of simplicity, we assume that all the macroblocks of a

   slice group are included in one slice.  Three MTAPs are constructed

   from three consecutive reference pictures so that each MTAP contains

   three aggregation units, each of which contains all the macroblocks

   from one slice group.  The first MTAP contains slice group 0 of

   picture R1, slice group 1 of picture R3, and slice group 2 of

   picture R5.  The second MTAP contains slice group 1 of picture R1,

   slice group 2 of picture R3, and slice group 0 of picture R5.  The

   third MTAP contains slice group 2 of picture R1, slice group 0 of

   picture R3, and slice group 1 of picture R5.  Each non-reference

   picture is encapsulated into an STAP-B.

   Consequently, the transmission order of NAL units is the following:

      R1, slice group 0, DON 1, carried in MTAP,   RTP SN: N

      R3, slice group 1, DON 2, carried in MTAP,   RTP SN: N

      R5, slice group 2, DON 4, carried in MTAP,   RTP SN: N

      R1, slice group 1, DON 1, carried in MTAP,   RTP SN: N+1

      R3, slice group 2, DON 2, carried in MTAP,   RTP SN: N+1

      R5, slice group 0, DON 4, carried in MTAP,   RTP SN: N+1

      R1, slice group 2, DON 1, carried in MTAP,   RTP SN: N+2

      R3, slice group 1, DON 2, carried in MTAP,   RTP SN: N+2

      R5, slice group 0, DON 4, carried in MTAP,   RTP SN: N+2

      N2,                DON 3, carried in STAP-B, RTP SN: N+3

      N4,                DON 5, carried in STAP-B, RTP SN: N+4

   The receiver is able to organize the NAL units back in decoding order

   based on the value of DON associated with each NAL unit.

   If one of the MTAPs is lost, the spatially adjacent and temporally

   co-located macroblocks are received and can be used to conceal the

   loss efficiently.  If one of the STAPs is lost, the effect of the

   loss does not propagate temporally.

Wenger, et al.              Standards Track                    [Page 72]

13.3.  Example of Robust Packet Scheduling

   An example of robust packet scheduling follows.  The communication

   system used in the example consists of the following components in

   the order that the video is processed from source to sink:

      o camera and capturing

      o pre-encoding buffer

      o encoder

      o encoded picture buffer

      o transmitter

      o transmission channel

      o receiver

      o receiver buffer

      o decoder

      o decoded picture buffer

      o display

   The video communication system used in the example operates as

   follows.  Note that processing of the video stream happens gradually

   and at the same time in all components of the system.  The source

   video sequence is shot and captured to a pre-encoding buffer.  The

   pre-encoding buffer can be used to order pictures from sampling order

   to encoding order or to analyze multiple uncompressed frames for bit

   rate control purposes, for example.  In some cases, the pre-encoding

   buffer may not exist; instead, the sampled pictures are encoded right

   away.  The encoder encodes pictures from the pre-encoding buffer and

   stores the output; i.e., coded pictures, to the encoded picture

   buffer.  The transmitter encapsulates the coded pictures from the

   encoded picture buffer to transmission packets and sends them to a

   receiver through a transmission channel.  The receiver stores the

   received packets to the receiver buffer.  The receiver buffering

   process typically includes buffering for transmission delay jitter.

   The receiver buffer can also be used to recover correct decoding

   order of coded data.  The decoder reads coded data from the receiver

   buffer and produces decoded pictures as output into the decoded

   picture buffer.  The decoded picture buffer is used to recover the

   output (or display) order of pictures.  Finally, pictures are

   displayed.

   In the following example figures, I denotes an IDR picture, R denotes

   number after I, R, or N indicates the sampling time relative to the

   previous IDR picture in decoding order.  Values below the sequence of

   pictures indicate scaled system clock timestamps.  The system clock

   is initialized arbitrarily in this example, and time runs from left

   to right.  Each I, R, and N picture is mapped into the same timeline

   compared to the previous processing step, if any, assuming that

Wenger, et al.              Standards Track                    [Page 73]

   encoding, transmission, and decoding take no time.  Thus, events

   happening at the same time are located in the same column throughout

   all example figures.

   A subset of a sequence of coded pictures is depicted below in

   sampling order.

       ...  N58 N59 I00 N01 N02 R03 N04 N05 R06 ... N58 N59 I00 N01 ...

       ... --|---|---|---|---|---|---|---|---|- ... -|---|---|---|- ...

       ...  58  59  60  61  62  63  64  65  66  ... 128 129 130 131 ...

      Figure 16.  Sequence of pictures in sampling order

   The sampled pictures are buffered in the pre-encoding buffer to

   arrange them in encoding order.  In this example, we assume that the

   non-reference pictures are predicted from both the previous and the

   next reference picture in output order, except for the non-reference

   pictures immediately preceding an IDR picture, which are predicted

   only from the previous reference picture in output order.  Thus, the

   pre-encoding buffer has to contain at least two pictures, and the

   buffering causes a delay of two picture intervals.  The output of the

   pre-encoding buffering process and the encoding (and decoding) order

   of the pictures are as follows:

                ... N58 N59 I00 R03 N01 N02 R06 N04 N05 ...

                ... -|---|---|---|---|---|---|---|---|- ...

                ... 60  61  62  63  64  65  66  67  68  ...

      Figure 17.  Re-ordered pictures in the pre-encoding buffer

   The encoder or the transmitter can set the value of DON for each

   picture to a value of DON for the previous picture in decoding order

   plus one.

   For the sake of simplicity, let us assume that:

   o  the frame rate of the sequence is constant,

   o  each picture consists of only one slice,

   o  each slice is encapsulated in a single NAL unit packet,

   o  there is no transmission delay, and

   o  pictures are transmitted at constant intervals (that is, 1 / frame

      rate).

Wenger, et al.              Standards Track                    [Page 74]

   When pictures are transmitted in decoding order, they are received as

      Figure 18.  Received pictures in decoding order

   The OPTIONAL sprop-interleaving-depth MIME type parameter is set to

   0, as the transmission (or reception) order is identical to the

   decoding order.

   The decoder has to buffer for one picture interval initially in its

   decoded picture buffer to organize pictures from decoding order to

   output order as depicted below:

                    ... N58 N59 I00 N01 N02 R03 N04 N05 R06 ...

                    ... -|---|---|---|---|---|---|---|---|- ...

                    ... 61  62  63  64  65  66  67  68  69  ...

      Figure 19.  Output order

   The amount of required initial buffering in the decoded picture

   buffer can be signaled in the buffering period SEI message or with

   the num_reorder_frames syntax element of H.264 video usability

   information.  num_reorder_frames indicates the maximum number of

   frames, complementary field pairs, or non-paired fields that precede

   any frame, complementary field pair, or non-paired field in the

   sequence in decoding order and that follow it in output order.  For

   the sake of simplicity, we assume that num_reorder_frames is used to

   indicate the initial buffer in the decoded picture buffer.  In this

   example, num_reorder_frames is equal to 1.

   It can be observed that if the IDR picture I00 is lost during

   transmission and a retransmission request is issued when the value of

   the system clock is 62, there is one picture interval of time (until

   the system clock reaches timestamp 63) to receive the retransmitted

   IDR picture I00.

Wenger, et al.              Standards Track                    [Page 75]

   Let us then assume that IDR pictures are transmitted two frame

   intervals earlier than their decoding position; i.e., the pictures

   are transmitted as follows:

                       ...  I00 N58 N59 R03 N01 N02 R06 N04 N05 ...

                       ... --|---|---|---|---|---|---|---|---|- ...

                       ...  62  63  64  65  66  67  68  69  70  ...

      Figure 20.  Interleaving: Early IDR pictures in sending order

   The OPTIONAL sprop-interleaving-depth MIME type parameter is set

   equal to 1 according to its definition.  (The value of sprop-

   interleaving-depth in this example can be derived as follows:

   Picture I00 is the only picture preceding picture N58 or N59 in

   transmission order and following it in decoding order.  Except for

   pictures I00, N58, and N59, the transmission order is the same as the

   decoding order of pictures.  As a coded picture is encapsulated into

   exactly one NAL unit, the value of sprop-interleaving-depth is equal

   to the maximum number of pictures preceding any picture in

   transmission order and following the picture in decoding order.)

   The receiver buffering process contains two pictures at a time

   according to the value of the sprop-interleaving-depth parameter and

   orders pictures from the reception order to the correct decoding

   order based on the value of DON associated with each picture.  The

   output of the receiver buffering process is as follows:

                            ... N58 N59 I00 R03 N01 N02 R06 N04 N05 ...

                            ... -|---|---|---|---|---|---|---|---|- ...

                            ... 63  64  65  66  67  68  69  70  71  ...

      Figure 21.  Interleaving: Receiver buffer

   Again, an initial buffering delay of one picture interval is needed

   to organize pictures from decoding order to output order, as depicted

   below:

                                ... N58 N59 I00 N01 N02 R03 N04 N05 ...

                                ... -|---|---|---|---|---|---|---|- ...

                                ... 64  65  66  67  68  69  70  71  ...

      Figure 22.  Interleaving: Receiver buffer after reordering

   Note that the maximum delay that IDR pictures can undergo during

   transmission, including possible application, transport, or link

   layer retransmission, is equal to three picture intervals.  Thus, the

Wenger, et al.              Standards Track                    [Page 76]

   loss resiliency of IDR pictures is improved in systems supporting

   retransmission compared to the case in which pictures were

   transmitted in their decoding order.

13.4.  Robust Transmission Scheduling of Redundant Coded Slices

   A redundant coded picture is a coded representation of a picture or a

   part of a picture that is not used in the decoding process if the

   corresponding primary coded picture is correctly decoded.  There

   should be no noticeable difference between any area of the decoded

   primary picture and a corresponding area that would result from

   application of the H.264 decoding process for any redundant picture

   in the same access unit.  A redundant coded slice is a coded slice

   that is a part of a redundant coded picture.

   Redundant coded pictures can be used to provide unequal error

   protection in error-prone video transmission.  If a primary coded

   representation of a picture is decoded incorrectly, a corresponding

   redundant coded picture can be decoded.  Examples of applications and

   coding techniques using the redundant codec picture feature include

   the video redundancy coding [23] and the protection of "key pictures"

   in multicast streaming [24].

   One property of many error-prone video communications systems is that

   transmission errors are often bursty.  Therefore, they may affect

   more than one consecutive transmission packets in transmission order.

   In low bit-rate video communication, it is relatively common that an

   entire coded picture can be encapsulated into one transmission

   packet.  Consequently, a primary coded picture and the corresponding

   redundant coded pictures may be transmitted in consecutive packets in

   transmission order.  To make the transmission scheme more tolerant of

   bursty transmission errors, it is beneficial to transmit the primary

   coded picture and redundant coded picture separated by more than a

   single packet.  The DON concept enables this.

13.5.  Remarks on Other Design Possibilities

   The slice header syntax structure of the H.264 coding standard

   contains the frame_num syntax element that can indicate the decoding

   order of coded frames.  However, the usage of the frame_num syntax

   element is not feasible or desirable to recover the decoding order,

   due to the following reasons:

   o  The receiver is required to parse at least one slice header per

      coded picture (before passing the coded data to the decoder).

Wenger, et al.              Standards Track                    [Page 77]

   o  Coded slices from multiple coded video sequences cannot be

      interleaved, as the frame number syntax element is reset to 0 in

      each IDR picture.

   o  The coded fields of a complementary field pair share the same

      value of the frame_num syntax element.  Thus, the decoding order

      of the coded fields of a complementary field pair cannot be

      recovered based on the frame_num syntax element or any other

      syntax element of the H.264 coding syntax.

   The RTP payload format for transport of MPEG-4 elementary streams

   [25] enables interleaving of access units and transmission of

   multiple access units in the same RTP packet.  An access unit is

   specified in the H.264 coding standard to comprise all NAL units

   associated with a primary coded picture according to subclause

   7.4.1.2 of [1].  Consequently, slices of different pictures cannot be

   interleaved, and the multi-picture slice interleaving technique (see

   section 12.6) for improved error resilience cannot be used.

14.  Acknowledgements

   The authors thank Roni Even, Dave Lindbergh, Philippe Gentric,

   Gonzalo Camarillo, Gary Sullivan, Joerg Ott, and Colin Perkins for

   careful review.

15.  References

15.1.  Normative References

   [1]  ITU-T Recommendation H.264, "Advanced video coding for generic

        audiovisual services", May 2003.

   [2]  ISO/IEC International Standard 14496-10:2003.

   [3]  Bradner, S., "Key words for use in RFCs to Indicate Requirement

        Levels", BCP 14, RFC 2119, March 1997.

   [4]  Schulzrinne, H.,  Casner, S., Frederick, R., and V. Jacobson,

        "RTP: A Transport Protocol for Real-Time Applications", STD 64,

        RFC 3550, July 2003.

   [5]  Handley, M. and V. Jacobson, "SDP: Session Description

        Protocol", RFC 2327, April 1998.

   [6]  Josefsson, S., "The Base16, Base32, and Base64 Data Encodings",

        RFC 3548, July 2003.

Wenger, et al.              Standards Track                    [Page 78]

   [7]  Rosenberg, J. and H. Schulzrinne, "An Offer/Answer Model with

        Session Description Protocol (SDP)", RFC 3264, June 2002.

15.2.  Informative References

   [8]  "Draft ITU-T Recommendation and Final Draft International

        Standard of Joint Video Specification (ITU-T Rec. H.264 |

        ISO/IEC 14496-10 AVC)", available from http://ftp3.itu.int/av-

        arch/jvt-site/2003_03_Pattaya/JVT-G050r1.zip, May 2003.

   [9]  Luthra, A., Sullivan, G.J., and T. Wiegand (eds.), Special Issue

        on H.264/AVC. IEEE Transactions on Circuits and Systems on Video

        Technology, July 2003.

   [10] Bormann, C., Cline, L., Deisher, G., Gardos, T., Maciocco, C.,

        Newell, D., Ott, J., Sullivan, G., Wenger, S., and C. Zhu, "RTP

        Payload Format for the 1998 Version of ITU-T Rec. H.263 Video

        (H.263+)", RFC 2429, October 1998.

   [11] ISO/IEC IS 14496-2.

   [12] Wenger, S., "H.26L over IP", IEEE Transaction on Circuits and

        Systems for Video technology, Vol. 13, No. 7, July 2003.

   [13] Wenger, S., "H.26L over IP: The IP Network Adaptation Layer",

        Proceedings Packet Video Workshop 02, April 2002.

   [14] Stockhammer, T., Hannuksela, M.M., and S. Wenger, "H.26L/JVT

        Coding Network Abstraction Layer and IP-based Transport" in

        Proc. ICIP 2002, Rochester, NY, September 2002.

   [15] ITU-T Recommendation H.241, "Extended video procedures and

        control signals for H.300 series terminals", 2004.

   [16] Schulzrinne, H. and S. Casner, "RTP Profile for Audio and Video

        Conferences with Minimal Control", STD 65, RFC 3551, July 2003.

   [17] ITU-T Recommendation H.223, "Multiplexing protocol for low bit

        rate multimedia communication", July 2001.

   [18] Rosenberg, J. and H. Schulzrinne, "An RTP Payload Format for

        Generic Forward Error Correction", RFC 2733, December 1999.

   [19] Stockhammer, T., Wiegand, T., Oelbaum, T., and F. Obermeier,

        "Video Coding and Transport Layer Techniques for H.264/AVC-Based

        Transmission over Packet-Lossy Networks", IEEE International

        Conference on Image Processing (ICIP 2003), Barcelona, Spain,

        September 2003.

Wenger, et al.              Standards Track                    [Page 79]

   [20] Varsa, V. and M. Karczewicz, "Slice interleaving in compressed

        video packetization", Packet Video Workshop 2000.

   [21] Kang, S.H. and A. Zakhor, "Packet scheduling algorithm for

        wireless video streaming," International Packet Video Workshop

        2002.

   [22] Hannuksela, M.M., "Enhanced concept of GOP", JVT-B042, available

        http://ftp3.itu.int/av-arch/video-site/0201_Gen/JVT-B042.doc,

        January 2002.

   [23] Wenger, S., "Video Redundancy Coding in H.263+", 1997

        International Workshop on Audio-Visual Services over Packet

        Networks, September 1997.

   [24] Wang, Y.-K., Hannuksela, M.M., and M. Gabbouj, "Error Resilient

        Video Coding Using Unequally Protected Key Pictures", in Proc.

        International Workshop VLBV03, September 2003.

   [25] van der Meer, J., Mackie, D., Swaminathan, V., Singer, D., and

        P. Gentric, "RTP Payload Format for Transport of MPEG-4

        Elementary Streams", RFC 3640, November 2003.

   [26] Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K.

        Norrman, "The Secure Real-time Transport Protocol (SRTP)", RFC

        3711, March 2004.

   [27] Schulzrinne, H., Rao, A., and R. Lanphier, "Real Time Streaming

        Protocol (RTSP)", RFC 2326, April 1998.

   [28] Handley, M., Perkins, C., and E. Whelan, "Session Announcement

        Protocol", RFC 2974, October 2000.

   [29] ISO/IEC 14496-15: "Information technology - Coding of audio-

        visual objects - Part 15: Advanced Video Coding (AVC) file

        format".

   [30] Castagno, R. and D. Singer, "MIME Type Registrations for 3rd

        Generation Partnership Project (3GPP) Multimedia files", RFC

        3839, July 2004.

Wenger, et al.              Standards Track                    [Page 80]

Authors' Addresses

   Stephan Wenger

   TU Berlin / Teles AG

   Franklinstr. 28-29

   D-10587 Berlin

   Germany

   Phone: +49-172-300-0813

   EMail: [email protected]

   Miska M. Hannuksela

   Nokia Corporation

   P.O. Box 100

   33721 Tampere

   Finland

   Phone: +358-7180-73151

   EMail: [email protected]

   Thomas Stockhammer

   Nomor Research

   D-83346 Bergen

   Phone: +49-8662-419407

   EMail: [email protected]

   Magnus Westerlund

   Multimedia Technologies

   Ericsson Research EAB/TVA/A

   Ericsson AB

   Torshamsgatan 23

   SE-164 80 Stockholm

   Sweden

   Phone: +46-8-7190000

   EMail: [email protected]

Wenger, et al.              Standards Track                    [Page 81]

   David Singer

   QuickTime Engineering

   Apple

   1 Infinite Loop MS 302-3MT

   Cupertino

   CA 95014

   USA

   Phone +1 408 974-3162

   EMail: [email protected]

Wenger, et al.              Standards Track                    [Page 82]

Full Copyright Statement

   This document is subject to the rights, licenses and restrictions

   contained in BCP 78, and except as set forth therein, the authors

   retain all their rights.

   This document and the information contained herein are provided on an

   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS

   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET

   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,

   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE

   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED

   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Intellectual Property

   The IETF takes no position regarding the validity or scope of any

   Intellectual Property Rights or other rights that might be claimed to

   pertain to the implementation or use of the technology described in

   this document or the extent to which any license under such rights

   might or might not be available; nor does it represent that it has

   made any independent effort to identify any such rights.  Information

   on the IETF's procedures with respect to rights in IETF Documents can

   be found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any

   assurances of licenses to be made available, or the result of an

   attempt made to obtain a general license or permission for the use of

   such proprietary rights by implementers or users of this

   specification can be obtained from the IETF on-line IPR repository at

   http://www.ietf.org/ipr.

   The IETF invites any interested party to bring to its attention any

   copyrights, patents or patent applications, or other proprietary

   rights that may cover technology that may be required to implement

   this standard.  Please address the information to the IETF at ietf-

   [email protected].

Acknowledgement

   Funding for the RFC Editor function is currently provided by the

   Internet Society.

Wenger, et al.              Standards Track                    [Page 83]

繼續閱讀