H.264視訊的RTP荷載格式

Status of This Memo

This document specifies an Internet standards track protocol for the

Internet community, and requests discussion and suggestions for

improvements. Please refer to the current edition of the "Internet

Official Protocol Standards" (STD 1) for the standardization state

and status of this protocol. Distribution of this memo is unlimited.

Abstract

This memo describes an RTP Payload format for the ITU-T

Recommendation H.264 video codec and the technically identical

ISO/IEC International Standard 14496-10 video codec. The RTP payload

format allows for packetization of one or more Network Abstraction

Layer Units (NALUs), produced by an H.264 video encoder, in each RTP

payload. The payload format has wide applicability, as it supports

applications from simple low bit-rate conversational usage, to

Internet video streaming with interleaved transmission, to high bit-

rate video-on-demand.

1. 介紹 ........................................ 3

1.1. H.264 Codec ............................... 3

1.2. 參數集概念 ........................... 4

1.3. 網絡抽象層單元類型............................ 5

2. 約定 ......................................... 6

3. 範圍 ............................................... 6

4. 定義和縮寫 ................................. 6

4.1. 定義 ..................................... 6

5. RTP 荷載格式 ..................................... 8

5.1. RTP 頭的使用.................................. 8

5.2. RTP荷載格式的公共使用 .............. 11

5.3. NAL單言位元組的用法 ............................ 12

5.4. 打包方式 .................................... 14

5.5. 解碼順序号 (DON)............................. 15

5.6. 單個NAL單元包................................. 18

5.7. 複合包 ................................. 18

5.8. 分片單元 (FUs) ............................... 27

6. 分包規則 ................................... 31

6.1. 公共分包規則 .............................. 31

6.2. 單個NAL單元方式............................... 32

6.3. 非交錯方式 ............................... 32

6.4. 交錯方式 ............................... 33

7. 打包過程 (資訊) ........................ 33

7.1. 單NAL單元和非交錯方式 ................ 33

7.2. 交錯方式 ............................... 34

7.3. 附加的打包原則 .................. 36

8. 荷載格式參數 ................................... 37

8.1. MIME 注冊 .................................... 37

8.2. SDP 參數...................................... 52

8.3. 例子.......................................... 58

8.4. 參數集考慮 ............................ 60

9. 安全考慮 ....................................... 62

10. 擁塞控制............................................ 63

11. IANA考慮 ........................................... 64

12. 資訊化附錄: 應用例子 .................... 65

12.1. 根據ITU-T H.241 附錄A的視訊電話............... 65

12.2. 沒有分片資料分區，沒有NAL單元聚合的視訊電話... 65

12.3. 使用NAL單元聚合交錯打包的視訊電話............. 66

12.4. 使用資料分區的視訊電話 .................. 66

12.5. 使用FU和向前糾錯的視訊電話和流................ 67

12.6. 低位率流 .................................. 69

12.7. 視訊流中健壯的包排程 ............. 70

13. 資訊化附錄:解碼順序号的原理 ..... 71

13.1. 介紹.......................................... 71

13.2. 多圖像片斷交錯的例子 ............. 71

13.3. 健壯包排程的例子 .................... 73

13.4. 備援編碼片斷健壯傳輸排程的例子................ 77

13.5. 其它設計可能的提醒 ................... 77

14. 緻謝 .............................................. 78

15. 參考 ............................................... 78

15.1. 标準化參考.................................... 78

15.2. 參考性的參考.................................. 79

作者位址................................................ 81

1. 介紹

1.1. H.264 Codec

本文指定一個RTP荷載規範用于ITU-T H.264 視訊編碼标準（ISO/IEC 14496 Part 10 [2]）(兩個都稱為進階視訊編碼

AVC). H.264建議在2005年5月被ITU-T采納, 草案規範對于公共回顧可用[8]. 本文H.264 縮寫用于codec和标準,但是

本文等價于采納 ISO/IEC相似的編碼标準.

H.264 視訊 codec又非常廣泛的應用覆寫所有格式的數字壓縮視訊格式,從低帶寬的Internet流應用到HDTV廣播和數字

影院應用。和目前的技術狀态比較, 整個H.264的性能被報告節省50%的位率。例如，數字衛星TV品質被報告在1.5 Mbit/s,

就可以實作，而目前的MPEG 2的操作點在大約3.5 Mbit/s [9].

該codec規範自己概念上區分[1]視訊編碼層(VCL)和網絡抽象層(NAL). VCL包含Codec的信令處理功能;以及如轉換，量化，

運動補償預測機制；以及循環過濾器。他遵從今天大多數視訊codec的一般概念,基于宏快的編碼器，使用基于運動補償的

圖像間預測和殘餘信号的轉換編碼。VCL編碼器輸出片斷: 一個位串包含整數數目宏快的宏塊資料，以及片斷頭資訊(包含

片斷内第一個宏快的空間位址, 初始量化參數以及相似資訊). 片斷内的宏快按照掃描順序安排，除非指定一個不同的宏塊

配置設定,通過使用被稱為靈活宏塊順序文法Flexible Macroblock Ordering syntax.圖像内的預測隻用于一個片斷内部。更多

資訊在[9]提供.

(NAL)編碼器封裝VCL編碼器輸出的片斷到網絡抽象層單元(NAL units),它适合于通過包網路傳輸或用于面向包的多路複用

環境。H.264的附錄B定義封裝過程傳輸這樣的NAL單元通過面向位元組流的網絡。本文檔範圍, 附錄 B 不相關的。

NAL使用NAL單元. 一個NAL單元由一位元組的頭和荷載位元組串組成。頭訓示NAL單元的類型, 是否有位錯誤或文法沖突在NAL

單元荷載中,以及對于解碼過程該NAL單元相對重要性的資訊。本RTP荷載規範被設計成不了解NAL單元荷載的位串。

H.264的一個主要特性是傳輸時間，解碼時間，圖像以及片斷采樣示範時間完全的解耦合。H.264中指定的解碼過程是不知道

時間的, 并且H.264文法沒有運送如跳過幀數目(在早期視訊壓縮标準，時間參考格式中是普遍的)資訊.同時，有的NAL單元

影響許多圖像，是以固有的是無時間性的。因為這樣的原因，處理RTP時戳要求對于采樣或示範時間沒有定義或者在傳輸時間

不知道的NAL單元進行一些特殊的考慮。

1.2. 參數集概念

H.264一個非常基本的設計概念是産生自包含包, 使得如RFC2429的頭重複或MPEG-4的頭擴充編碼（HEC）[11]機制變得不必要。

這是通過從媒體流解耦合不止一個片斷的相對資訊來實作的。高層meta資訊應該可靠/異步的發送,事先不和包含片斷包的RTP

包流發送。(對于沒有通過帶外傳輸信道發送本資訊的應用，通過帶内發送本資訊也提供了手段)。高層參數的組合被稱為參數集。

H.264規範包括兩類參數集:順序參數集和圖像參數集。一個活動順序參數集在一個編碼視訊序列中保持不變,一個活動圖像參數集

在一個編碼圖像裡保持不變。順序和圖像參數集結構包含如圖像大小，采用的可選的編碼模式，宏塊到片斷組映射等資訊。

為了改變圖像參數(如圖像大小)而不用同步傳送參數集修改給片斷包流,編碼器和解碼器可以維護不止一個順序和圖像參數集的

清單。每個片斷頭包含一個碼字訓示使用的順序和圖像參數集。

本機制允許從包流中解耦合參數集的傳輸,通過外部手段傳輸他們(即,作為能力交換的副作用),或通過一個(可靠或不可靠)控制協定

他們從沒有被傳送但是被應用設計規範修複甚至是可能的。

1.3. 網絡抽象層單元類型

可以在[12], [13],[14]中找到關于NAL設計的學習資訊.

所有NAL單元有一個單個NAL單元類型位元組,他也作為本RTP荷載格式的荷載頭.後面立即跟随NAL單元的荷載。

NAL單元類型位元組的文法語義在[1]中指定,但是NAL單元類型的基本屬性總結如下。NAL單元類型位元組格式如下：

+---------------+

|0|1|2|3|4|5|6|7|

+-+-+-+-+-+-+-+-+

|F|NRI| Type |

NAL單元類型位元組部件的語義在H.264規範中制定, 簡要描述如下.

F: 1 bit

forbidden_zero_bit. H.264規範聲明設定為1訓示文法違例。

NRI: 2 bits

nal_ref_idc. 00值訓示NAL單元的不用于幀間圖像預測的重構參考圖像。這樣的NAL單元可以被丢棄而不用冒參考

圖像完整性的風險。大于0的值訓示NAL單元的解碼要求維護參考圖像的完整性。

Type: 5 bits

nal_unit_type. 本部件指定NAL單元荷載類型定義在[1]的表 7-1中和本文後面。為了參考所有目前定義的NAL單元類型

和他們的語義，參考 [1]的7.4.1.

本文引入新的NAL單元類型，在5.2示範. 定義在本文的NAL單元類型在[1]中标記為未指定。但是,本規範擴充了F和 NRI的

語義，象5.3描述的那樣.

2. Conventions

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",

"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this

document are to be interpreted as described in BCP 14, RFC 2119 [3].

This specification uses the notion of setting and clearing a bit when

bit fields are handled. Setting a bit is the same as assigning that

bit the value of 1 (On). Clearing a bit is the same as assigning

that bit the value of 0 (Off).

3. Scope

This payload specification can only be used to carry the "naked"

H.264 NAL unit stream over RTP, and not the bitstream format

discussed in Annex B of H.264. Likely, the first applications of

this specification will be in the conversational multimedia field,

video telephony or video conferencing, but the payload format also

covers other applications, such as Internet streaming and TV over IP.

4. 定義和縮寫

4.1. 定義

本文檔使用[1]中的定義. 為了友善以下定義在[1]中的詞語總結出來：

access unit: 一組NAL單元總包括一個主要的編碼圖像。除了主要的編碼圖像,一個 access unit也可以包含

一個或多個備援編碼圖像或其他的不包括片斷或編碼圖像片斷分區資料的NAL單元。access unit的解碼總是

導緻一個解碼的圖像。

coded video sequence: A sequence of access units that consists, in

decoding order, of an instantaneous decoding refresh (IDR) access

unit followed by zero or more non-IDR access units including all

subsequent access units up to but not including any subsequent IDR

access unit.

IDR access unit: An access unit in which the primary coded picture

is an IDR picture.

IDR picture: A coded picture containing only slices with I or SI

slice types that causes a "reset" in the decoding process. After

the decoding of an IDR picture, all following coded pictures in

decoding order can be decoded without inter prediction from any

picture decoded prior to the IDR picture.

primary coded picture: The coded representation of a picture to be

used by the decoding process for a bitstream conforming to H.264.

The primary coded picture contains all macroblocks of the picture.

redundant coded picture: A coded representation of a picture or a

part of a picture. The content of a redundant coded picture shall

not be used by the decoding process for a bitstream conforming to

H.264. The content of a redundant coded picture may be used by

the decoding process for a bitstream that contains errors or

losses.

VCL NAL unit: A collective term used to refer to coded slice and

coded data partition NAL units.

In addition, the following definitions apply:

decoding order number (DON): A field in the payload structure, or

a derived variable indicating NAL unit decoding order. Values of

DON are in the range of 0 to 65535, inclusive. After reaching the

maximum value, the value of DON wraps around to 0.

NAL unit decoding order: A NAL unit order that conforms to the

constraints on NAL unit order given in section 7.4.1.2 in [1].

transmission order: The order of packets in ascending RTP sequence

number order (in modulo arithmetic). Within an aggregation

packet, the NAL unit transmission order is the same as the order

of appearance of NAL units in the packet.

media aware network element (MANE): A network element, such as a

middlebox or application layer gateway that is capable of parsing

certain aspects of the RTP payload headers or the RTP payload and

reacting to the contents.

Informative note: The concept of a MANE goes beyond normal

routers or gateways in that a MANE has to be aware of the

signaling (e.g., to learn about the payload type mappings of

the media streams), and in that it has to be trusted when

working with SRTP. The advantage of using MANEs is that they

allow packets to be dropped according to the needs of the media

coding. For example, if a MANE has to drop packets due to

congestion on a certain link, it can identify those packets

whose dropping has the smallest negative impact on the user

experience and remove them in order to remove the congestion

and/or keep the delay low.

縮寫

DON: 解碼順序号

DONB: 解碼順序基

DOND: 解碼順序号差

FEC: 向前糾錯

FU: 分片單元

IDR: 瞬間解碼重新整理

IEC: 國際電子委員會

ISO: 國際标準化組織

ITU-T: 國際電聯-通信标準部門

MANE: 美提感覺網絡元素

MTAP: 多時刻聚合包

MTAP16: 16位時戳位移的MTAP

MTAP24: 24位時戳位移的MTAP

NAL: 網絡抽象層

NALU: NAL單元

SEI: 補充增強資訊

STAP: 單時刻聚合包

STAP-A: STAP類型A

STAP-B: STAP類型B

TS: 時戳

VCL: 視訊編碼層

5. RTP 荷載格式

5.1. RTP頭的使用

RTP 頭的格式在RFC 3550 [4]中指定為了友善在圖1又顯示出來。本載荷格式使用頭中域的方式和該規範一緻。

當一個 NAL 單元封裝在每個RTP包中, 推薦的RTP荷載格式在5.6節指定。對于聚合包/分片包的RTP荷載 (以及

一些rtp頭域的設定）在5.7和5.8節指定。

0 1 2 3

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

|V=2|P|X| CC |M| PT | sequence number |

| timestamp |

| synchronization source (SSRC) identifier |

+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

| contributing source (CSRC) identifiers |

| .... |

圖 1. RTP 頭。

根據RTP荷載格式設定的RTP頭資訊按如下設定：

Marker bit (M): 1 bit

對于RTP時戳訓示的通路單元的最後一個包本位進行設定,符合視訊格式M位的正常使用,以允許有效

緩沖處理布局。對于聚合包(STAP，MTAP),RTP頭中的M位必須設定成最後一個NAL單元如果被傳送在

單個RTP包中時M位對應的值。解碼器可以使用本位作為早期最後一個包的訓示,但是不可以依賴本

屬性。

注：運送多個NAL單元的聚合包隻有一個M位相關聯。是以,如果一個網關重新打包一個聚合包為幾

個包，它可能不會可靠設定這些包的M位。

Payload type (PT): 7 bits

本新的包格式的荷載類型的值超過本文檔的範圍，在此不指明。荷載類型的指派或者通過profile或者

通過動态方式。

Sequence number (SN): 16 bits

根據RFC 3550設定使用. 對于單個NALU與非交錯打包方式, 序号用于對定NALU解碼順序。

Timestamp: 32 bits

RTP時戳設定為内容的采樣時戳。必須使用90 kHz 時鐘頻率。

如果NAL單元沒有他自己的時間屬性(即,parameter set and SEI NAL units),RTP時戳設定成通路單元主編碼圖像

的RTP時戳,根據[1]的7.4.1.2節。

MTAPs時戳的設定在5.7.2定義.

接收者應該忽略包含在通路單元（隻有一個顯示時戳）的任何圖像時間SEI消息，相反，接收者應該使用RTP時戳

同步顯示過程。

RTP發送者你不應該傳送圖像時間 SEI消息對于不支援被顯示成多個場的圖像。

如果一個通路單元有多于一個顯示時戳在圖像時間SEI消息中, SEI消息中的資訊應該被對待成相對于RTP時戳的，

最早事件發生在RTP時戳給定的時間, 後續事件發生的時間由SEI消息中圖像時間值差給定。假設tSEI1, tSEI2, ...,

tSEIn 為SEI消息中運送的顯示時間戳, 其中tSEI1 是所有這樣時間戳的最早值。tmadjst()是一個函數，他調整

SEI消息時間到90-kHz時間.TS是RTP時戳.則,和tSEI1關聯的顯示時間是TS. 和tSEIx[x=[2..n]]關聯事件的顯示時間為

TS + tmadjst (tSEIx - tSEI1).

注釋: 在一個3：2折疊的操作中需要顯示編碼的幀作為場, 在其中組成編碼幀的電影内容使用隔行掃描顯示。

圖像定時SEI消息使得運送相同編碼圖像的多個時戳,是以3:2折疊過程正确控制。圖像定時SEI消息機制是必須

的，因為在RTP時戳中隻可以運送一個時戳。

注釋:因為H.264允許解碼順序可以和顯示順序不同, RTP時戳的值針對于RTP序号可以不是單調非減的。而且

RTCP報告中的抖動區間值可以不是網絡性能問題的訓示, as the calculation rules

for interarrival jitter (section 6.4.1 of RFC 3550) assume that

the RTP timestamp of a packet is directly proportional to its

transmission time.

5.2. RTP 荷載格式的公共結構

荷載格式定義三個不同的基本荷載結構。一個接收者可以識别荷載結構通過RTP荷載的第一個位元組,

他也共享為RTP荷載頭，某些情況下,作為荷載的第一個位元組。本位元組總是結構化為NAL單元頭.

NAL單元類型訓示目前使用那個結構. 可能的結構如下：

單個NAL單元包: 荷載中隻包含一個NAL單元。NAL頭類型域等于原始 NAL單元類型,即在範圍1到23之間. 5.6指定

聚合包: 本類型用于聚合多個NAL單元到單個RTP荷載中。本包有四種版本,單時間聚合包類型A (STAP-A), 單時間

聚合包類型B (STAP-B), 多時間聚合包類型(MTAP)16位位移(MTAP16), 多時間聚合包類型(MTAP)24位位移(MTAP24)。

賦予STAP-A, STAP-B, MTAP16, MTAP24的NAL單元類型号分别是 24, 25, 26, 27。見5.7.

分片單元: 用于分片單個NAL單元到多個RTP包。現存兩個版本FU-A，FU-B,用NAL單元類型 28，29辨別。見5.8.

Table 1. 單元類型以及荷載結構總結

Type Packet Type name Section

---------------------------------------------------------

0 undefined -

1-23 NAL unit Single NAL unit packet per H.264 5.6

24 STAP-A Single-time aggregation packet 5.7.1

25 STAP-B Single-time aggregation packet 5.7.1

26 MTAP16 Multi-time aggregation packet 5.7.2

27 MTAP24 Multi-time aggregation packet 5.7.2

28 FU-A Fragmentation unit 5.8

29 FU-B Fragmentation unit 5.8

30-31 undefined -

注釋: 本規範沒有限制封裝在單個NAL單元包和分片單元的大小。封裝在聚合包中的 NAL單元大小為65535位元組。

5.3. NAL單元位元組使用

NAL單元位元組的結構語義在1.3節介紹。為了友善,NAL單元類型位元組的格式在下面列出：

本部分根據本規範指定F和NRI的語義。

forbidden_zero_bit. A value of 0 indicates that the NAL unit type

octet and payload should not contain bit errors or other syntax

violations. A value of 1 indicates that the NAL unit type octet

and payload may contain bit errors or other syntax violations.

MANEs SHOULD set the F bit to indicate detected bit errors in the

NAL unit. The H.264 specification requires that the F bit is

equal to 0. When the F bit is set, the decoder is advised that

bit errors or any other syntax violations may be present in the

payload or in the NAL unit type octet. The simplest decoder

reaction to a NAL unit in which the F bit is equal to 1 is to

discard such a NAL unit and to conceal the lost data in the

discarded NAL unit.

nal_ref_idc. 0值和非零值的語義與H.264規範保持一緻。換句話,00值訓示NAL單元的内容不用于重建引用圖像的

幀見圖像預測。這樣的NAL單元可以被丢棄而不用冒引用圖像完整性的風險。大于00的值訓示NAL單元的解碼要求維護

引用圖像的完整性。

除了上面指定的外, 根據本RTP荷載規範, 大于00的NRI值訓示相對傳輸優先級, 象編碼器決定的一樣。 MANE可以使用

本資訊保護更重要的NAL單元。最高的傳輸優先級是11, 依次是 10, 01;00 最低。

注釋: 任何非零的NRI在H.264 解碼器的處理是相同的。是以,接收者在傳送NAL單元給解碼器時不必操作NRI的值。

H.264編碼器必須根據H.264規範設定NRI值(subclause 7.4.1)當nal_unit_type 範圍的是1到12. 特别是, H.264規範

要求對于nal_unit_type為6，9，10，11，12的NAL單元的NRI的值應該為0。

對于nal_unit_type等于7，8 (訓示順序參數集或圖像參數集)的NAL單元,H.264編碼器應該設定NRI為11 (二進制格式）

對于nal_unit_type等于5的主編碼圖像的編碼片NAL單元(訓示編碼片屬于一個IDR圖像), H.264編碼器應設定NRI為11。

對于映射其他的nal_unit_types到NRI值,以下的例子可以使用并且在某些環境有效[13].其它的映射也可以，依賴于應用

以及使用的H.264/AVC Annex A profile.

注釋: 在某些profile中資料分區不可用，即 , 在Main或Baseline profiles. 是以, nal單元類型2, 3,4 隻出現在

視訊流符合資料分區被允許的profile情況下，不會出現在符合MAIN/Baseline profile的流中。

Table 2. 編碼片和主編碼參考圖像資料分區的編碼片的NRI值的例子

NAL Unit Type Content of NAL unit NRI (binary)

----------------------------------------------------------------

1 non-IDR coded slice 10

2 Coded slice data partition A 10

3 Coded slice data partition B 01

4 Coded slice data partition C 01

注釋: 像以前提起的, 非參考圖像NRI值是00.

H.264編碼器應該設定備援編碼參考圖像的編碼片和編碼片分區NAL單元的NRI值為01 (二進制格式).

對于NAL單元類型24~29的NRI的定義在本文5.7，5.8給出。

對于nal_unit_type範圍在13到23的NAL單元的NRI值沒有推薦的值,因為這些值保留給ITU-T，ISO/IEC.

對于nal_unit_type為0或30，31的NAL單元的NRI值也沒有推薦的值，因為這些值的語義本文沒有指定。

5.4. 打包方式

本文指定三種打包方式：

o 單NAL單元方式

o 非交錯方式

o 交錯方式

單NAL單元方式目标是正常的系統，該系統相容ITU-T H.241 [15] (12.1). 非交錯方式目标是正常系統，可以不符合

ITU-T H.241建議.在非交錯方式, NAL單元按照NAL單元解碼順序傳送。交錯模式目标是不要求非常低端到端延遲的系統。

交錯方式允許傳送NAL單元不按照NAL單元解碼順序。

使用的打包方式可以通過OPTIONAL packetization-mode MIME參數的值指定或外部手段。使用的打包方式控制那個NAL

單元類型在RTP荷載中允許。表3 總結對每個打包方式允許的NAL單元類型。有些NAL單元類型值(在表3中訓示為沒有定義）

保留為将來擴充. 那些類型的NAL單元不應該被發送者發送，接受者必須忽略他們。例如：

1-23, 相關的包類型"NAL unit",允許出現在 "單NAL單元方式" 和"非交錯方式", 不允許在"交錯方式".

打包方式在第六節更詳細解釋。

表 3. 每個打包方式允許的NAL單元類型總結(yes = 允許, no = 不允許, ig = 忽略)

Type Packet Single NAL Non-Interleaved Interleaved

Unit Mode Mode Mode

-------------------------------------------------------------

0 undefined ig ig ig

1-23 NAL unit yes yes no

24 STAP-A no yes no

25 STAP-B no no yes

26 MTAP16 no no yes

27 MTAP24 no no yes

28 FU-A no yes yes

29 FU-B no no yes

30-31 undefined ig ig ig

5.5. 解碼順序号(DON)

在交錯打包方式, NAL單元的傳輸順序允許和NAL單元的解碼順序不同。解碼順序号(DON)是荷載結構中的一個域

或一個獲得變量訓示NAL單元的解碼順序。不按解碼順序傳輸的例子和原理以及DON的使用見13節。

傳輸和解碼順序的耦合由OPTIONAL sprop-interleaving-depth MIME參數控制，見下。當OPTIONAL sprop-interleaving-depth

MIME 參數的值等于0 (明确或預設) 或者外部手段不允許傳輸NAL單元順序不同于他們的解碼順序, NAL單元的

傳輸順序必須和他們的解碼順序一緻。當OPTIONAL sprop-interleaving-depth MIME參數的值大于0或者傳輸NAL單元

與解碼序号不一緻通過外部手段被允許時,

o 在MTAP16/MTAP24中的NAL單元順序不要求是NAL單元的解碼順序

o 在兩個連續包中的STAP-B, MTAP,FU解嵌套産生的NAL單元序号不要求是NAL單元解碼序号。

用于單NAL單元包 STAP-A和FU-A的RTP荷載結構不包含DON. STAP-B，FU-B結構包含DON, MTAP結構允許推導DON象5.7.2指定的一樣.

注釋:檔FU-A出現在交錯方式,後邊總跟一個FU-B, 他設定自己的DON.

注釋: 一個傳輸器想封裝單個NAL單元每個包并且傳輸包不按照他們的解碼順序，可以使用STAP-B包類型。

在單個NAL單元打包方式, NAL單元的傳輸順序,由RTP順序号确定, 必須和他們的NAL單元解碼序号一緻。

在非交錯打包方式中, 在單NAL單元包,STAP-A,FU-A中NAL單元的傳輸順序必須和他們的NAL單元解碼順序一緻.

在一個STAP中的NAL單元必須按照他們的NAL單元解碼順序出現。是以，解碼順序首先由STAP隐含順序提供, 第二

通過RTP序号提供（對于STAPs, FUs, 單個NAL unit包之間的）。

對于運送在STAP-B, MTAP以及FU-B開始的一些列分片單元中的NAL單元的DON值的信令在5.7.1, 5.7.2, 指定5.8。

傳輸順序中的NAL單元的第一個DON值可以設定成任何值，DON值的範圍是0到65535。到達最大值後, DON的值回繞到0.

包含在STAP-B, MTAP,或FU-B開始的一系列分片單元中的兩個NAL單元的解碼順序按照如下确定：

DON(i)是索引為i傳輸順序的解碼順序号. 函數don_diff(m,n)定義如下：

If DON(m) == DON(n), don_diff(m,n) = 0

If (DON(m) < DON(n) and DON(n) - DON(m) < 32768),

don_diff(m,n) = DON(n) - DON(m)

If (DON(m) > DON(n) and DON(m) - DON(n) >= 32768),

don_diff(m,n) = 65536 - DON(m) + DON(n)

If (DON(m) < DON(n) and DON(n) - DON(m) >= 32768),

don_diff(m,n) = - (DON(m) + 65536 - DON(n))

If (DON(m) > DON(n) and DON(m) - DON(n) < 32768),

don_diff(m,n) = - (DON(m) - DON(n))

don_diff(m,n)正值訓示具有傳輸順序n的NAL單元解碼順序跟在具有傳輸順序m的NAL單元後面。 don_diff(m,n)等于0

訓示NAL單元解碼順序号可以按照任何NAL單元優先。don_diff(m,n)的負值訓示索引為n的NAL單元解碼序号先于索引為

m的NAL單元。

DON相關域的值(DON, DONB, and DOND; 5.7)必須使得上面指定的DON的值确定的解碼器順序号符合NAL單元解碼序号。

如果兩個NAL解碼單元順序的NAL單元交換，新的順序号不符合NAL單元解碼順序，NAL單元不可以有相同的DON值. 如果

在一個NAL單元流中兩個連續NAL單元的序号交換并且新的序号仍符合NAL單元解碼順序号，NAL解碼單元可以有相同的

DON值。例如：當使用的視訊編碼profile允許任意分片順序, 一個編碼圖像的所有編碼片的NAL單元可以有相同的DON

值。是以，相同DON值的 NAL單元可以按照任何順序解碼,有不同DON值的NAL單元應該按照上面指定的順序傳遞給解碼器。

當兩個連續的NAL單元解碼順序的NAL單元有不同的DON值, 第二個NAL單元的DON應該是第一個NAL單元的DON值加1。

解包過程恢複NAL單元解碼的例子在第7部分給出。

注: 接收者不應該預測兩個解碼順序号連續的NAL的DON值的絕對差等于1,甚至在沒有錯誤的傳輸過程。

沒有要求增加1,就像關聯DON的值到NAL單元的時間一樣, 不可能知道所有NAL單元是否分發給接收者。例如：

一個網關可以不轉發非引用的編碼的NAL片或SEI NAL 單元，當需要轉發的網絡帶寬不足時。；另外的例子：

現場廣播被預先編碼的内容不時的打斷，如廣告。預先編碼的第一個内幀圖像事先傳送使得接收端準備可用。

當傳送第一個内幀時，發送者不能精确知道在解碼順序後的第一個内幀前，有多少NAL單元被編碼。是以, 預編碼

片斷的第一個内幀的DON值不得不估算，當他們傳送時,是以DON中可能産生空隙。

5.6. 單個NAL單元包

定義在此的單個NAL單元包必須隻包含一個類型定義在[1]中的NAL單元。這意味聚合包和分片單元不可以用在單個NAL

單元包中。一個封裝單個NAL單元包到RTP的NAL單元流的RTP序号必須符合NAL單元的解碼順序。單個NAL單元包的結構

顯示在圖2。

注: NAL單元的第一位元組和RTP荷載頭第一個位元組重合。

|F|NRI| type | |

+-+-+-+-+-+-+-+-+ |

| |

| Bytes 2..n of a Single NAL unit |

| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

| :...OPTIONAL RTP padding |

Figure 2. 單個NAL單元包的RTP荷載格式。

5.7. 聚合包

聚合包是本荷載規範的NAL單元聚合安排。本計劃的引入是反映兩個主要目标網絡差異巨大的MTU:

有線IP網絡(MTU 通常被以太網的MTU限制; 大約1500 位元組), 基于無線通信系統的IP或非IP (ITU-T

H.324/M)網絡，它的優先傳輸最大單元是254或更少。為了阻止連個世界媒體的轉換以及避免不必要的打包

負擔，引入聚合單元安排。

本規範定義了兩類聚合包:

o 單時間聚合包(STAP): 聚合相同NALU時間的NAL單元。兩類STAP被定義, 一類不包括DON (STAP-A)另一類包括DON (STAP-B).

o 多時間聚合包(MTAP): 聚合具有差異NALU時間的NAL單元. 兩個MTAP被定義, 差别在 NAL單元時戳位移長度不同。

詞語NALU-時間被定義成如果NAL單元被傳輸他自己的RTP包中時RTP的時戳。

運送在一個聚合包中的每個NAL單元封裝在一個聚合單元中。參見下面四個不同聚合單元和他們的特性。

聚合包的RTP荷載格式的結構見圖3。

| one or more aggregation units |

圖 3. 聚合包的RTP荷載格式。

MTAPs，STAPs公用以下打包規則:RTP時戳必須設定為被聚合NAL單元中最早NALU時間。NAL單元類型的類型域必須被設定成

适當的值,像表4描述的一樣.

如果聚合NAL單元的F位是0，F位必須清除,否則，則必須被設定。 NRI的值必須是運送在聚合包中NAL單元的最大值。

表 4. STAPs和MTAPs的類型域

Type Packet 時戳位移域長度（位） DON相關的域(DON, DONB, DOND）是否存在

--------------------------------------------------------

24 STAP-A 0 no

25 STAP-B 0 yes

26 MTAP16 16 yes

27 MTAP24 24 yes

RTP頭的marker位設定為聚合包中最後NAL單元如果單獨封裝在RTP傳輸中對應Marker位的值。

聚合包的荷載由一個或多個聚合單元組成。見5.7.1，5.7.2四個不同類型的聚合單元。一個包聚合包可以運送必要多的

聚合單元; 但是, 聚合包中整個資料顯然必須适合于一個IP包,并且大小應該選擇使得結果的IP包比MTU小。一個聚合包

不可以包含5.8中指定的分片單元。聚合包不可以嵌套;即，一個聚合包包含另一個聚合包。

5.7.1. 單時間聚合包

單時刻聚合包(STAP)應該用于當聚合在一起的NAL單元共享相同的NALU時刻。STAP-A荷載不包括DON，至少包含一個單時刻聚合單元

見圖4. STAP-B荷載包含一個16位的無符号解碼順序号(DON) (網絡位元組序)緊跟至少一個單時刻聚合單元。見圖5.

: |

| single-time aggregation units |

| :

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

圖 4. STAP-A荷載格式

: decoding order number (DON) | |

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |

圖 5. STAP-B 荷載格式

DON域指定STAP-B傳輸順序中第一個NAL單元的DON值. 對每個後續出現在STAP-B中的NAL單元，它的DON值等于

(STAP-B中前一個NAL的DON值+1)%65535, %是取模運算。

單時刻聚合單元有一個16位無符号大小資訊（網絡位元組序），他訓示後續NAL單元的大小（以位元組為機關）(不包括

這兩個位元組,但包括NAL單元類型位元組),後面緊跟NAL單元本身, 包括它的NAL單元類型位元組. 單時刻聚合單元在RTP荷載

中是位元組對齊的,單可以不是32位字邊界對齊。圖6 表示單時刻聚合單元的結構。

: NAL unit size | |

| NAL unit |

圖 6. 單時刻聚合單元的結構

圖 7表示一個例子--一個RTP包包含一個STAP-A. STAP包含兩個單時刻聚合單元, 在圖中用1，2标記。

| RTP Header |

|STAP-A NAL HDR | NALU 1 Size | NALU 1 HDR |

| NALU 1 Data |

: :

+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

| | NALU 2 Size | NALU 2 HDR |

| NALU 2 Data |

圖 7. RTP包包含一個STAP-A. STAP包含兩個單時刻聚合單元

圖 8 表示一個RTP包包含一個STAP-B. STAP包含兩個單時刻聚合單元, 用 1，2标記。

|STAP-B NAL HDR | DON | NALU 1 Size |

| NALU 1 Size | NALU 1 HDR | NALU 1 Data |

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +

| NALU 2 Data |

圖 8. 一個RTP包包含一個STAP-B. STAP包含兩個單時刻聚合單元例子

5.7.2. 多時刻聚合包(MTAPs)

多時刻聚合包的NAL單元荷載有16位的無符号解碼順序号基址(DONB) (網絡位元組序）以及一個或多個多時刻聚合單元，如

圖9表示。DONB 必須包含MTAP中NAL單元的第一個NAL的DON的值。

注釋:NAL解碼順序中的第一個NAL單元不必要是封裝在MTAP中的第一個NAL單元。

: decoding order number base | |

| multi-time aggregation units |

圖 9. MTAP的NAL單元荷載格式

本規範定義兩個不同多時刻聚合單元。兩個都有16位的無符号大小資訊用于後續NAL單元(網絡位元組序),一個8位無符号解碼序号

內插補點(DOND), 和n位 (網絡位元組序) 時戳位移(TS 位移)用于本NAL單元,n可以是16/24. 不同MTAP類型的選擇是應用相關的(MTAP16

/MTAP24): 時戳位移越大, MTAP的靈活性越大, 但是負擔也越大。

MTAP16/MTAP24多時刻聚合單元的結構分别在圖 10 ，11表示。一個包中的聚合單元的開始/結束不要求位于32位的邊界。

跟随NAL單元的DON 等于(DONB + DOND) % 65536, %代表取摸操作. 本文沒有指定MTAP内的NAL單元如何排序，但大多數

情況，應該使用NAL單元解碼順序。

時戳位移域必須設定成等于以下公式的值：如果NALU-time大于等于包的RTP時戳,則時戳位移等于(NALU-time - 包的RTP時戳).

如果NALU-time小于包的RTP時戳,則時戳位移等于 NALU-time + (2^32 - 包的RTP時戳).

: NAL unit size | DOND | TS offset |

| TS offset | |

+-+-+-+-+-+-+-+-+ NAL unit |

圖 10. MTAP16多時刻聚合單元

: NALU unit size | DOND | TS offset |

| TS offset | |

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |

| NAL unit |

圖 11. MTAP24多時刻聚合單元

一個MTAP中的最早的聚合單元時戳位移必須為0。是以, MTAP的RTP時戳和最早NALU-time相同.

注釋: 最早多時刻聚合單元是MTAP中所有聚合單元的擴充RTP時戳中的最小者，如果聚合單元封裝在單個NAL單元包中。

擴充時戳是有多于32位的時戳，有能力計算時戳域的饒回,是以時戳如果繞回能夠确定時戳的最小值。這樣的“最早“聚合

單元可以不是封裝在MTAP中的第一個聚合單元，最早NAL單元不必和NAL解碼順序的第一個NAL單元相同。

圖 12 表示一個例子，一個RTP包包含一個多時刻MTAP16類型的聚合包，包括兩個多時刻聚合單元，分别用1，2标記。

|MTAP16 NAL HDR | decoding order number base | NALU 1 Size |

| NALU 1 Size | NALU 1 DOND | NALU 1 TS offset |

| NALU 1 HDR | NALU 1 DATA |

+-+-+-+-+-+-+-+-+ +

| | NALU 2 SIZE | NALU 2 DOND |

| NALU 2 TS offset | NALU 2 HDR | NALU 2 DATA |

圖 12. 一個RTP包包含一個多時刻MTAP16類型的聚合包，包括兩個多時刻聚合單元

圖 13 表示一個例子，一個RTP包包含一個多時刻MTAP24類型的聚合包，包括兩個多時刻聚合單元，分别用1，2标記。

|MTAP24 NAL HDR | decoding order number base | NALU 1 Size |

| NALU 1 Size | NALU 1 DOND | NALU 1 TS offs |

|NALU 1 TS offs | NALU 1 HDR | NALU 1 DATA |

| NALU 2 TS offset | NALU 2 HDR |

| NALU 2 DATA |

圖 13. RTP包包含一個多時刻MTAP24類型的聚合包，包括兩個多時刻聚合單元

5.8. 分片單元 (FUs)

本荷載類型允許分片一個NAL單元到幾個RTP包中。在應用層這樣做比依賴于底層（IP）的分片有以下好處：

o 荷載格式有能力傳輸NAL單元大于64K位元組的單元通過IPv4網絡，或許存在預編碼的視訊,特别在高清格式 (

每個圖像的分片數目有限制，導緻每個圖像的NAL單元數目的限制, 進而導緻大的 NAL單元).

o 分派機制允許分片單個圖像并且采用一般向前的糾錯像12.5描述的那樣.

分片隻定義于單個NAL單元不用于任何聚合包。NAL單元的一個分片由整數個連續NAL單元位元組組成. 每個NAL單元位元組

必須正好是該NAL單元一個分片的一部分。相同NAL單元的分片必須使用遞增的RTP序号連續順序發送(第一和最後分片之間

沒有其他的RTP包）。相似, NAL單元必須按照RTP順序号的順序裝配。

當一個NAL單元被分片運送在分片單元(FUs)中時，被引用為分片NAL單元。STAPs,MTAP不可以被分片。 FUs不可以嵌套。

即, 一個FU 不可以包含另一個FU.

運送FU的RTP時戳被設定成分片NAL單元的NALU時刻.

圖 14 表示FU-A的RTP荷載格式。FU-A由1位元組的分片單元訓示，1位元組的分片單元頭，和分片單元荷載組成。

| FU indicator | FU header | |

| FU payload |

圖 14. FU-A的RTP荷載格式

圖 15 表示FU-B的RTP荷載格式. FU-B由1位元組的分片單元訓示，1位元組的分片單元頭，和解碼順序号（DON）

以及分片單元荷載組成。

| FU indicator | FU header | DON |

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-|

圖 15. FU-B的RTP荷載格式

對于分片NAL單元的第一個分片如果用于交錯打包方式，則必須使用NAL單元類型FU-B。NAL單元類型FU-B MUST不可以

用于其他情況。換句話, 在交錯打包方式,每個被分片的NALU，FU-B作為第一個分片,後面跟随的是一個或多個FU-A分片.

FU訓示位元組有以下格式：

FU訓示位元組的類型域的28，29表示FU-A和FU-B。F的使用在5。3描述。NRI域的值必須根據分片NAL單元的NRI域的值設定。

FU頭的格式如下：

|S|E|R| Type |

S: 1 bit

當設定成1,開始位訓示分片NAL單元的開始。當跟随的FU荷載不是分片NAL單元荷載的開始，開始位設為0。

E: 1 bit

當設定成1, 結束位訓示分片NAL單元的結束，即, 荷載的最後位元組也是分片NAL單元的最後一個位元組。當跟随的

FU荷載不是分片NAL單元的最後分片,結束位設定為0。

R: 1 bit

保留位必須設定為0，接收者必須忽略該位。

NAL單元荷載類型定義在[1]的表7-1.

FU-B中DON的值的選擇在5.5已經描述.

注: FU-B中的DON域允許網關分片NAL單元到FU-B而不用組織進來的NAL單元到NAL單元解碼順序。

一個分片單元不可以傳輸在一個FU中; 即, 開始位和結束位不可以被同時設定在同一個FU頭中。

FU荷載由分片NAL單元的荷載分片組成，使得如果連續FU的分片單元荷載順序連接配接, 可以重構分片NAL單元的荷載。

NAL單元分片的類型位元組不包括，就像在分片單元荷載中一樣,但是分片單元的NAL單元的類型資訊運送FU訓示位元組

的F和NRI域以及FU頭的類型域。一個FU荷載可以有任意位元組也可以為空。

注釋: 空的FUs允許減少某類發送者在幾乎無丢失環境中的延遲。這些發送者特點是他們的NALU完全産生前，可以打

包NALU分片,是以,在NALU大小未知之前。如果零長度分片不被允許,發送者不得不産生至少一位資料在目前分片被發送

前. 由于H.264的特性, 有時幾個宏快占據0位，這是不希望的并且增加延遲。但是, (潛在)使用0長度的NALU應該仔細

權衡增加NALU丢失的風險，因為增加了傳輸包。

如果一個分片單元丢失,接收者應該丢棄後續的所有分片單元對應于相同分片NAL單元的傳輸順序的分片。

終端或MANE中的接收者可以聚合前一個NAL單元的n-1分片到一個(不完全的) NAL單元,甚至分片n沒有接收到. 這種情況下，

NAL單元的forbidden_zero_bit必須被設定成1訓示文法違背.

6. 打包規則

打包方式在5.2節介紹. 對于多于一個打包方式的公共打包規則在6.1節指定. 單個NAL單元方式

的打包規則，非交錯方式，交錯方式的打包規則分别在6.2, 6.3,6.4節指定。

6.1. 公共打包規則

不管使用那種打包方式，所有發送者必須遵守以下打包規則:

o 屬于同一編碼圖像（共享相同RTP時戳值）的編碼NAL單元片斷或者編碼資料分區NAL單元片斷可以

按照定義在[1]中的應用Profile允許的任何順序發送; 但是,對于延遲敏感的系統,他們應該按照

他們原始編碼順序發送，以減少延遲。注意：編碼順序不必要是掃描順序，而是NAL包對RTP協定

棧可用的順序。

o 參數集根據8.4節給定的規則和建議處理。

o MANEs 不可以重複任何NAL單元，除了順序或圖像參數集NAL單元,同樣本文或者H.264規範也沒有提供

手段識别重複的NAL單元。順序和圖像參數集NAL單元可以重複使得他們的糾錯接收更可靠，但是,任何

這樣的重複不可以影響任何活動順序或圖像參數集的内容。重複應該在應用層進行，不應通過複制RTP

包進行（相同序号）。

使用非交錯方式和交錯方式的發送者必須遵守以下打包規則：

o MANEs可以轉換單個NAL單元包到一個聚合包,轉換一個聚合包到幾個單個NAL單元包,或在RTP轉換器中混合

兩個概念。RTP轉換器至少應該考慮如下參數：路徑MTU大小, 不平等的保護機制(即,根據RFC 2733通過

基于包的FEC,特别對于順序和圖像參數集NAL單元以及編碼片斷資料分區NAL單元），系統可以忍受的延遲

以及接收者緩沖能力。

注：RTP轉換器要求按照每個RFC3550處理RTCP.

6.2. 單個NAL單元模式

本方式應用在OPTIONAL打包方式MIME參數值等于0,不包含打包方式,或者沒有外部手段訓示其他的打包方式的時候。

所有的接收者必須支援本方式。它主要用于低延遲應用（和使用ITU-T H.241建議相容的系統）。(見12.1節).

隻有單個NAL單元包可以用在這種方式。STAPs, MTAPs, and FUs 不可以使用。單個NAL單元的傳輸順序必須和NAL

解碼順序一緻。

6.3. 非交錯方式

本方式應用在OPTIONAL打包方式MIME參數值等于1或者改方式被外部的手段打開時。本方式應該被支援。它主要用于

低延遲應用。本方式隻允許單個NAL單元包, STAP-As, FU-As包。STAP-Bs, MTAPs,FU-Bs不可以使用。NAL單元的傳輸

順序必須和NAL單元解碼順序一緻。

6.4. 交錯方式

本方式應用在OPTIONAL打包方式MIME參數值等于2或者改方式被外部的手段打開時。有些接收者可以支援本方式。

可以使用 STAP-Bs, MTAPs, FU-As,FU-Bs。STAP-As 和單個NAL單元包不可以使用。包和NAL單元傳輸順序的限制

在5.5節指定。

7. 打包過程 (資訊)

打包過程是實作相關的。是以,下面的描述應該被看成合适實作的例子。其他的方案也可以使用。相關描述算法的優化

也是可能的。7.1示範單個NAL單元和非交錯打包方式的打包過程,7.2描述交錯方式的打包過程。7.3 包括附加的封裝

指導對于智能接收者。

所有相關于緩沖區管理正常的RTP機制也适用。特别的,重複的過期的RTP包(由RTP序号/時戳訓示)被删除。為了确定

精确的解碼時間, 如可能的延遲因素也被允許為了正确的流之間的同步。

7.1. 單個NAL單元和非交錯方式

接收者包括一個接收緩沖區以補償傳輸延遲和抖動。接收者存儲進來的包按照接收順序在接收緩沖區中。包被解封裝

按照RTP序号的順序。如果封裝包是一個單個NAL單元包,包含在包中的NAL單元直接傳遞給解碼器。如果解封裝的包是

一個STAP-AI, 包含在包中的NAL單元按照他們在包中的封裝順序傳遞給解碼器。如果解封裝包是一個FU-A, 所有的分

片NAL單元單分片連接配接在一起傳遞給解碼器。

資訊: 如果解碼器支援任意分片順序,編碼的圖像片可以按照任意順序傳送給解碼器而不管他們的接收傳送順序。

7.2. 交錯方式

這些打包規則後面的一般概念是重新排序NAL單元從傳輸順序到NAL單元解碼順序。

接收者包括一個接收緩沖區以補償傳輸延遲抖動以及重新排序包從傳輸順序到NAL單元解碼順序。本部分,接收者操作

的描述假設沒有傳輸延遲抖動。為了和實際的差異，一個接收緩沖區也用于補償傳輸延遲抖動,接收者者本部分調用

解交錯緩沖區。接收者應該準備傳輸延遲抖動;即, 或者保留單獨的緩沖區用于傳輸延遲抖動緩沖和解交錯緩沖或者

使用接收緩沖用于傳輸延遲抖動和解交錯。而且, 接收者應該考慮傳輸延遲抖動在緩沖區操作時，即,在開始解碼和

回放前增加緩沖區。

本部分組織如下: 7.2.1 描述如何計算交錯緩沖區的大小. 7.2.2指定接收過程如何組織接收到的NAL單元到NAL解碼順序。

7.2.1. 解交錯緩沖區的大小

當 SDP Offer/Answer 模型或其他任何能力交換過程被使用時, 接收流的屬性應該使得接收者的能力不被超過。

在 SDP Offer/Answer 摸型行中, 接收者可以訓示它的能力以配置設定一個解交錯緩沖區使用deintbuf-cap MIME 參數。

發送者訓示解交錯緩沖區大小的要求使用sprop-deint-buf-req MIME參數. 是以，推薦設定解交錯緩沖區大小（位元組數目）

等于或大于sprop-deint-buf-req MIME 參數指定的值. 參見 8.1 得到更多資訊關于 deint-buf-cap和sprop-deint-buf-req

MIME參數，8.2.2 關于他們在SDP Offer/Answer模型中的使用。

在會話建立中一個公布的會話描述被使用,sprop-deint-buf-req MIME參數指定交錯緩沖大小的要求。是以，推薦

設定解交錯緩沖區大小（位元組位機關）等于或大于sprop-deint-buf-req MIME 參數的值.

7.2.2. 解交錯過程

在接收者中有兩個緩沖狀态: 初始緩沖和正在播放緩沖。初始緩沖發生在RTP會話被初始化時。初始緩沖後,解碼和播放

開始了, 使用緩沖-播放模型。

不管緩沖的狀态,接收者存儲進來的NAL單元按照接收順序,在解交錯緩沖區中。聚合包的 NAL單元存儲在單個解交錯緩沖區中

DON的值被計算為所有NAL單元存儲。

描述在下面的接收操作需要以下的函數常數幫助：

o 函數AbsDON在8.1指定.

o 函數don_diff在 5.5 指定.

o 常數 N 是 OPTIONAL sprop-interleaving-depth MIME 類型參數的值( 8.1)加1.

初始緩沖持續直到以下條件完成:

o 在解交錯緩沖區中有 N VCL NAL單元。

o 如果sprop-max-don-diff存在, don_diff(m,n)大于sprop-max-don-diff的值, 其中 n 對應所有接收到

的NAL單元中最大AbsDON值的NAL單元，m 對應所有接收到的NAL單元中最小AbsDON值的NAL單元。

o 初始緩沖區已經持續時間等于或大于 OPTIONAL sprop-init-buf-time MIME 參數指定的值.

要從解交錯緩沖區删除的NAL單元的确定如下：

o 如果解交錯緩沖區包含至少N 個VCL NAL單元,NAL單元被從解交錯緩沖區移出傳遞給解碼器按照下面指定

的次序直到緩沖區中包含N-1 VCL NAL 單元。

o 如果sprop-max-don-diff存在, 所有的NAL單元 m，他們的don_diff(m,n)大于sprop-max-don-diff的從解交錯

緩沖區移出傳送給解碼器按照下面指定的順序。在此, n 對應所有接收到的NAL單元中最大AbsDON值的NAL單元。

NAL單元傳遞給解碼器的順序指定如下：

o 讓PDON是一個變量RTP會話開始時初始化為0。

o 對于每個關聯DON的NAL單元, 按如下計算一個DON距離。如果NAL單元的DON大于PDON的值, DON距離等于DON-PDON.

否則DON距離等于 65535 - PDON + DON + 1.

o NAL單元分發給解碼器按照DON距離遞增的順序。如果幾個NAL單元有相同的DON距離，則他們可以按照任意順序遞交給解碼器.

o 當一定數目的NAL單元傳遞給解碼器, PDON的值設定為傳送給解碼器最後一個NAL單元的DON值。

7.3. 附加打包規則

以下附加打包規則可用于實作一個可操作的H.264打包器:

o 智能RTP接收者 (即在網關中) 可以識别丢失的編碼片斷資料分區A (DPAs). 如果發現丢失的DPA,網關可以決定不發送

對應的編碼片斷資料分區B和C,因為對于H.264解碼器他們的資訊是無意義的。這樣通過丢棄無用的包而不用分析複雜

的位流，一個MANE可以減少網絡負擔。

o 智能RTP接收者(即在網關中) 可以識别丢失的FU. 如果發現丢失一個FU, 網關可以決定不發送同一個分片NAL的後續FU

因為對于H.264解碼器他們的資訊是無意義的.這樣通過丢棄無用的包而不用分析複雜的位流，一個MANE可以減少網絡負擔。

o 不得不丢棄包或NALU的智能接收者應該首先丢棄所有NAL單元類型中NRI值等于0的包/NALU. 這樣最小化使用者體驗的影響并

保持參考圖像完整。如果更多的包不得不被丢棄,則NRI值低的包應該在NRI值高的前面被丢棄。但是,丢棄任何NRI值大于

0的包可能導緻解碼器飄移應該被避免。

8. 荷載格式參數

This section specifies the parameters that MAY be used to select

optional features of the payload format and certain features of the

bitstream. The parameters are specified here as part of the MIME

subtype registration for the ITU-T H.264 | ISO/IEC 14496-10 codec. A

mapping of the parameters into the Session Description Protocol (SDP)

[5] is also provided for applications that use SDP. Equivalent

parameters could be defined elsewhere for use with control protocols

that do not use MIME or SDP.

Some parameters provide a receiver with the properties of the stream

that will be sent. The name of all these parameters starts with

"sprop" for stream properties. Some of these "sprop" parameters are

limited by other payload or codec configuration parameters. For

example, the sprop-parameter-sets parameter is constrained by the

profile-level-id parameter. The media sender selects all "sprop"

parameters rather than the receiver. This uncommon characteristic of

the "sprop" parameters may not be compatible with some signaling

protocol concepts, in which case the use of these parameters SHOULD

be avoided.

8.1. MIME Registration

The MIME subtype for the ITU-T H.264 | ISO/IEC 14496-10 codec is

allocated from the IETF tree.

The receiver MUST ignore any unspecified parameter.

Media Type name: video

Media subtype name: H264

Required parameters: none

Wenger, et al. Standards Track [Page 37]

RFC 3984 RTP Payload Format for H.264 Video February 2005

OPTIONAL parameters:

profile-level-id:

A base16 [6] (hexadecimal) representation of

the following three bytes in the sequence

parameter set NAL unit specified in [1]: 1)

profile_idc, 2) a byte herein referred to as

profile-iop, composed of the values of

constraint_set0_flag, constraint_set1_flag,

constraint_set2_flag, and reserved_zero_5bits

in bit-significance order, starting from the

most significant bit, and 3) level_idc. Note

that reserved_zero_5bits is required to be

equal to 0 in [1], but other values for it may

be specified in the future by ITU-T or ISO/IEC.

If the profile-level-id parameter is used to

indicate properties of a NAL unit stream, it

indicates the profile and level that a decoder

has to support in order to comply with [1] when

it decodes the stream. The profile-iop byte

indicates whether the NAL unit stream also

obeys all constraints of the indicated profiles

as follows. If bit 7 (the most significant

bit), bit 6, or bit 5 of profile-iop is equal

to 1, all constraints of the Baseline profile,

the Main profile, or the Extended profile,

respectively, are obeyed in the NAL unit

stream.

If the profile-level-id parameter is used for

capability exchange or session setup procedure,

it indicates the profile that the codec

supports and the highest level

supported for the signaled profile. The

profile-iop byte indicates whether the codec

has additional limitations whereby only the

common subset of the algorithmic features and

limitations of the profiles signaled with the

profile-iop byte and of the profile indicated

by profile_idc is supported by the codec. For

example, if a codec supports only the common

subset of the coding tools of the Baseline

profile and the Main profile at level 2.1 and

below, the profile-level-id becomes 42E015, in

which 42 stands for the Baseline profile, E0

indicates that only the common subset for all

profiles is supported, and 15 indicates level

2.1.

Wenger, et al. Standards Track [Page 38]

Informative note: Capability exchange and

session setup procedures should provide

means to list the capabilities for each

supported codec profile separately. For

example, the one-of-N codec selection

procedure of the SDP Offer/Answer model can

be used (section 10.2 of [7]).

If no profile-level-id is present, the Baseline

Profile without additional constraints at Level

1 MUST be implied.

max-mbps, max-fs, max-cpb, max-dpb, and max-br:

These parameters MAY be used to signal the

capabilities of a receiver implementation.

These parameters MUST NOT be used for any other

purpose. The profile-level-id parameter MUST

be present in the same receiver capability

description that contains any of these

parameters. The level conveyed in the value of

the profile-level-id parameter MUST be such

that the receiver is fully capable of

supporting. max-mbps, max-fs, max-cpb, max-

dpb, and max-br MAY be used to indicate

capabilities of the receiver that extend the

required capabilities of the signaled level, as

specified below.

When more than one parameter from the set (max-

mbps, max-fs, max-cpb, max-dpb, max-br) is

present, the receiver MUST support all signaled

capabilities simultaneously. For example, if

both max-mbps and max-br are present, the

signaled level with the extension of both the

frame rate and bit rate is supported. That is,

the receiver is able to decode NAL unit

streams in which the macroblock processing rate

is up to max-mbps (inclusive), the bit rate is

up to max-br (inclusive), the coded picture

buffer size is derived as specified in the

semantics of the max-br parameter below, and

other properties comply with the level

specified in the value of the profile-level-id

parameter.

A receiver MUST NOT signal values of max-

mbps, max-fs, max-cpb, max-dpb, and max-br that

meet the requirements of a higher level,

Wenger, et al. Standards Track [Page 39]

referred to as level A herein, compared to the

level specified in the value of the profile-

level-id parameter, if the receiver can support

all the properties of level A.

Informative note: When the OPTIONAL MIME

type parameters are used to signal the

properties of a NAL unit stream, max-mbps,

max-fs, max-cpb, max-dpb, and max-br are

not present, and the value of profile-

level-id must always be such that the NAL

unit stream complies fully with the

specified profile and level.

max-mbps: The value of max-mbps is an integer indicating

the maximum macroblock processing rate in units

of macroblocks per second. The max-mbps

parameter signals that the receiver is capable

of decoding video at a higher rate than is

required by the signaled level conveyed in the

value of the profile-level-id parameter. When

max-mbps is signaled, the receiver MUST be able

to decode NAL unit streams that conform to the

signaled level, with the exception that the

MaxMBPS value in Table A-1 of [1] for the

signaled level is replaced with the value of

max-mbps. The value of max-mbps MUST be

greater than or equal to the value of MaxMBPS

for the level given in Table A-1 of [1].

Senders MAY use this knowledge to send pictures

of a given size at a higher picture rate than

is indicated in the signaled level.

max-fs: The value of max-fs is an integer indicating

the maximum frame size in units of macroblocks.

The max-fs parameter signals that the receiver

is capable of decoding larger picture sizes

than are required by the signaled level conveyed

in the value of the profile-level-id parameter.

When max-fs is signaled, the receiver MUST be

able to decode NAL unit streams that conform to

the signaled level, with the exception that the

MaxFS value in Table A-1 of [1] for the

max-fs. The value of max-fs MUST be greater

than or equal to the value of MaxFS for the

level given in Table A-1 of [1]. Senders MAY

use this knowledge to send larger pictures at a

Wenger, et al. Standards Track [Page 40]

proportionally lower frame rate than is

indicated in the signaled level.

max-cpb The value of max-cpb is an integer indicating

the maximum coded picture buffer size in units

of 1000 bits for the VCL HRD parameters (see

A.3.1 item i of [1]) and in units of 1200 bits

for the NAL HRD parameters (see A.3.1 item j of

[1]). The max-cpb parameter signals that the

receiver has more memory than the minimum

amount of coded picture buffer memory required

by the signaled level conveyed in the value of

the profile-level-id parameter. When max-cpb

is signaled, the receiver MUST be able to

decode NAL unit streams that conform to the

MaxCPB value in Table A-1 of [1] for the

max-cpb. The value of max-cpb MUST be greater

than or equal to the value of MaxCPB for the

use this knowledge to construct coded video

streams with greater variation of bit rate

than can be achieved with the

MaxCPB value in Table A-1 of [1].

Informative note: The coded picture buffer

is used in the hypothetical reference

decoder (Annex C) of H.264. The use of the

hypothetical reference decoder is

recommended in H.264 encoders to verify

that the produced bitstream conforms to the

standard and to control the output bitrate.

Thus, the coded picture buffer is

conceptually independent of any other

potential buffers in the receiver,

including de-interleaving and de-jitter

buffers. The coded picture buffer need not

be implemented in decoders as specified in

Annex C of H.264, but rather standard-

compliant decoders can have any buffering

arrangements provided that they can decode

standard-compliant bitstreams. Thus, in

practice, the input buffer for video

decoder can be integrated with de-

interleaving and de-jitter buffers of the

receiver.

Wenger, et al. Standards Track [Page 41]

max-dpb: The value of max-dpb is an integer indicating

the maximum decoded picture buffer size in

units of 1024 bytes. The max-dpb parameter

signals that the receiver has more memory than

the minimum amount of decoded picture buffer

memory required by the signaled level conveyed

When max-dpb is signaled, the receiver MUST be

MaxDPB value in Table A-1 of [1] for the

max-dpb. Consequently, a receiver that signals

max-dpb MUST be capable of storing the

following number of decoded frames,

complementary field pairs, and non-paired

fields in its decoded picture buffer:

Min(1024 * max-dpb / ( PicWidthInMbs *

FrameHeightInMbs * 256 * ChromaFormatFactor ),

16)

PicWidthInMbs, FrameHeightInMbs, and

ChromaFormatFactor are defined in [1].

The value of max-dpb MUST be greater than or

equal to the value of MaxDPB for the level

given in Table A-1 of [1]. Senders MAY use

this knowledge to construct coded video streams

with improved compression.

Informative note: This parameter was added

primarily to complement a similar codepoint

in the ITU-T Recommendation H.245, so as to

facilitate signaling gateway designs. The

decoded picture buffer stores reconstructed

samples and is a property of the video

decoder only. There is no relationship

between the size of the decoded picture

buffer and the buffers used in RTP,

especially de-interleaving and de-jitter

buffers.

max-br: The value of max-br is an integer indicating

the maximum video bit rate in units of 1000

bits per second for the VCL HRD parameters (see

Wenger, et al. Standards Track [Page 42]

per second for the NAL HRD parameters (see

A.3.1 item j of [1]).

The max-br parameter signals that the video

decoder of the receiver is capable of decoding

video at a higher bit rate than is required by

the signaled level conveyed in the value of the

profile-level-id parameter. The value of max-

br MUST be greater than or equal to the value

of MaxBR for the level given in Table A-1 of

[1].

When max-br is signaled, the video codec of the

receiver MUST be able to decode NAL unit

streams that conform to the signaled level,

conveyed in the profile-level-id parameter,

with the following exceptions in the limits

specified by the level:

o The value of max-br replaces the MaxBR value

of the signaled level (in Table A-1 of [1]).

o When the max-cpb parameter is not present,

the result of the following formula replaces

the value of MaxCPB in Table A-1 of [1]:

(MaxCPB of the signaled level) * max-br /

(MaxBR of the signaled level).

For example, if a receiver signals capability

for Level 1.2 with max-br equal to 1550, this

indicates a maximum video bitrate of 1550

kbits/sec for VCL HRD parameters, a maximum

video bitrate of 1860 kbits/sec for NAL HRD

parameters, and a CPB size of 4036458 bits

(1550000 / 384000 * 1000 * 1000).

The value of max-br MUST be greater than or

equal to the value MaxBR for the signaled level

given in Table A-1 of [1].

Senders MAY use this knowledge to send higher

bitrate video as allowed in the level

definition of Annex A of H.264, to achieve

improved video quality.

facilitate signaling gateway designs. No

assumption can be made from the value of

Wenger, et al. Standards Track [Page 43]

this parameter that the network is capable

of handling such bit rates at any given

time. In particular, no conclusion can be

drawn that the signaled bit rate is

possible under congestion control

constraints.

redundant-pic-cap:

This parameter signals the capabilities of a

receiver implementation. When equal to 0, the

parameter indicates that the receiver makes no

attempt to use redundant coded pictures to

correct incorrectly decoded primary coded

pictures. When equal to 0, the receiver is not

capable of using redundant slices; therefore, a

sender SHOULD avoid sending redundant slices to

save bandwidth. When equal to 1, the receiver

is capable of decoding any such redundant slice

that covers a corrupted area in a primary

decoded picture (at least partly), and therefore

a sender MAY send redundant slices. When the

parameter is not present, then a value of 0

MUST be used for redundant-pic-cap. When

present, the value of redundant-pic-cap MUST be

either 0 or 1.

When the profile-level-id parameter is present

in the same capability signaling as the

redundant-pic-cap parameter, and the profile

indicated in profile-level-id is such that it

disallows the use of redundant coded pictures

(e.g., Main Profile), the value of redundant-

pic-cap MUST be equal to 0. When a receiver

indicates redundant-pic-cap equal to 0, the

received stream SHOULD NOT contain redundant

coded pictures.

Informative note: Even if redundant-pic-cap

is equal to 0, the decoder is able to

ignore redundant codec pictures provided

that the decoder supports such a profile

(Baseline, Extended) in which redundant

coded pictures are allowed.

is equal to 1, the receiver may also choose

other error concealment strategies to

Wenger, et al. Standards Track [Page 44]

replace or complement decoding of redundant

slices.

sprop-parameter-sets:

This parameter MAY be used to convey

any sequence and picture parameter set NAL

units (herein referred to as the initial

parameter set NAL units) that MUST precede any

other NAL units in decoding order. The

parameter MUST NOT be used to indicate codec

capability in any capability exchange

procedure. The value of the parameter is the

base64 [6] representation of the initial

parameter set NAL units as specified in

sections 7.3.2.1 and 7.3.2.2 of [1]. The

parameter sets are conveyed in decoding order,

and no framing of the parameter set NAL units

takes place. A comma is used to separate any

pair of parameter sets in the list. Note that

the number of bytes in a parameter set NAL unit

is typically less than 10, but a picture

parameter set NAL unit can contain several

hundreds of bytes.

Informative note: When several payload

types are offered in the SDP Offer/Answer

model, each with its own sprop-parameter-

sets parameter, then the receiver cannot

assume that those parameter sets do not use

conflicting storage locations (i.e.,

identical values of parameter set

identifiers). Therefore, a receiver should

double-buffer all sprop-parameter-sets and

make them available to the decoder instance

that decodes a certain payload type.

parameter-add: This parameter MAY be used to signal whether

the receiver of this parameter is allowed to

add parameter sets in its signaling response

using the sprop-parameter-sets MIME parameter.

The value of this parameter is either 0 or 1.

0 is equal to false; i.e., it is not allowed to

add parameter sets. 1 is equal to true; i.e.,

it is allowed to add parameter sets. If the

parameter is not present, its value MUST be 1.

Wenger, et al. Standards Track [Page 45]

packetization-mode:

This parameter signals the properties of an

RTP payload type or the capabilities of a

receiver implementation. Only a single

configuration point can be indicated; thus,

when capabilities to support more than one

packetization-mode are declared, multiple

configuration points (RTP payload types) must

be used.

When the value of packetization-mode is equal

to 0 or packetization-mode is not present, the

single NAL mode, as defined in section 6.2 of

RFC 3984, MUST be used. This mode is in use in

standards using ITU-T Recommendation H.241 [15]

(see section 12.1). When the value of

packetization-mode is equal to 1, the non-

interleaved mode, as defined in section 6.3 of

RFC 3984, MUST be used. When the value of

packetization-mode is equal to 2, the

interleaved mode, as defined in section 6.4 of

RFC 3984, MUST be used. The value of

packetization mode MUST be an integer in the

range of 0 to 2, inclusive.

sprop-interleaving-depth:

This parameter MUST NOT be present

when packetization-mode is not present or the

value of packetization-mode is equal to 0 or 1.

This parameter MUST be present when the value

of packetization-mode is equal to 2.

This parameter signals the properties of a NAL

unit stream. It specifies the maximum number

of VCL NAL units that precede any VCL NAL unit

in the NAL unit stream in transmission order

and follow the VCL NAL unit in decoding order.

Consequently, it is guaranteed that receivers

can reconstruct NAL unit decoding order when

the buffer size for NAL unit decoding order

recovery is at least the value of sprop-

interleaving-depth + 1 in terms of VCL NAL

units.

The value of sprop-interleaving-depth MUST be

an integer in the range of 0 to 32767,

inclusive.

Wenger, et al. Standards Track [Page 46]

sprop-deint-buf-req:

This parameter MUST NOT be present when

packetization-mode is not present or the value

of packetization-mode is equal to 0 or 1. It

MUST be present when the value of

packetization-mode is equal to 2.

sprop-deint-buf-req signals the required size

of the deinterleaving buffer for the NAL unit

stream. The value of the parameter MUST be

greater than or equal to the maximum buffer

occupancy (in units of bytes) required in such

a deinterleaving buffer that is specified in

section 7.2 of RFC 3984. It is guaranteed that

receivers can perform the deinterleaving of

interleaved NAL units into NAL unit decoding

order, when the deinterleaving buffer size is

at least the value of sprop-deint-buf-req in

terms of bytes.

The value of sprop-deint-buf-req MUST be an

integer in the range of 0 to 4294967295,

Informative note: sprop-deint-buf-req

indicates the required size of the

deinterleaving buffer only. When network

jitter can occur, an appropriately sized

jitter buffer has to be provisioned for

as well.

deint-buf-cap: This parameter signals the capabilities of a

receiver implementation and indicates the

amount of deinterleaving buffer space in units

of bytes that the receiver has available for

reconstructing the NAL unit decoding order. A

receiver is able to handle any stream for which

the value of the sprop-deint-buf-req parameter

is smaller than or equal to this parameter.

If the parameter is not present, then a value

of 0 MUST be used for deint-buf-cap. The value

of deint-buf-cap MUST be an integer in the

range of 0 to 4294967295, inclusive.

Informative note: deint-buf-cap indicates

the maximum possible size of the

deinterleaving buffer of the receiver only.

Wenger, et al. Standards Track [Page 47]

When network jitter can occur, an

appropriately sized jitter buffer has to

be provisioned for as well.

sprop-init-buf-time:

This parameter MAY be used to signal the

properties of a NAL unit stream. The parameter

MUST NOT be present, if the value of

packetization-mode is equal to 0 or 1.

The parameter signals the initial buffering

time that a receiver MUST buffer before

starting decoding to recover the NAL unit

decoding order from the transmission order.

The parameter is the maximum value of

(transmission time of a NAL unit - decoding

time of the NAL unit), assuming reliable and

instantaneous transmission, the same

timeline for transmission and decoding, and

that decoding starts when the first packet

arrives.

An example of specifying the value of sprop-

init-buf-time follows. A NAL unit stream is

sent in the following interleaved order, in

which the value corresponds to the decoding

time and the transmission order is from left to

right:

0 2 1 3 5 4 6 8 7 ...

Assuming a steady transmission rate of NAL

units, the transmission times are:

0 1 2 3 4 5 6 7 8 ...

Subtracting the decoding time from the

transmission time column-wise results in the

following series:

0 -1 1 0 -1 1 0 -1 1 ...

Thus, in terms of intervals of NAL unit

transmission times, the value of

sprop-init-buf-time in this

example is 1.

Wenger, et al. Standards Track [Page 48]

The parameter is coded as a non-negative base10

integer representation in clock ticks of a 90-

kHz clock. If the parameter is not present,

then no initial buffering time value is

defined. Otherwise the value of sprop-init-

buf-time MUST be an integer in the range of 0

to 4294967295, inclusive.

In addition to the signaled sprop-init-buf-

time, receivers SHOULD take into account the

transmission delay jitter buffering, including

buffering for the delay jitter caused by

mixers, translators, gateways, proxies,

traffic-shapers, and other network elements.

sprop-max-don-diff:

properties of a NAL unit stream. It MUST NOT

be used to signal transmitter or receiver or

codec capabilities. The parameter MUST NOT be

present if the value of packetization-mode is

equal to 0 or 1. sprop-max-don-diff is an

integer in the range of 0 to 32767, inclusive.

If sprop-max-don-diff is not present, the value

of the parameter is unspecified. sprop-max-

don-diff is calculated as follows:

sprop-max-don-diff = max{AbsDON(i) -

AbsDON(j)},

for any i and any j>i,

where i and j indicate the index of the NAL

unit in the transmission order and AbsDON

denotes a decoding order number of the NAL

unit that does not wrap around to 0 after

65535. In other words, AbsDON is calculated as

follows: Let m and n be consecutive NAL units

in transmission order. For the very first NAL

unit in transmission order (whose index is 0),

AbsDON(0) = DON(0). For other NAL units,

AbsDON is calculated as follows:

If DON(m) == DON(n), AbsDON(n) = AbsDON(m)

If (DON(m) < DON(n) and DON(n) - DON(m) <

32768),

AbsDON(n) = AbsDON(m) + DON(n) - DON(m)

Wenger, et al. Standards Track [Page 49]

If (DON(m) > DON(n) and DON(m) - DON(n) >=

AbsDON(n) = AbsDON(m) + 65536 - DON(m) + DON(n)

If (DON(m) < DON(n) and DON(n) - DON(m) >=

AbsDON(n) = AbsDON(m) - (DON(m) + 65536 -

DON(n))

If (DON(m) > DON(n) and DON(m) - DON(n) <

AbsDON(n) = AbsDON(m) - (DON(m) - DON(n))

where DON(i) is the decoding order number of

the NAL unit having index i in the transmission

order. The decoding order number is specified

in section 5.5 of RFC 3984.

Informative note: Receivers may use sprop-

max-don-diff to trigger which NAL units in

the receiver buffer can be passed to the

decoder.

max-rcmd-nalu-size:

capabilities of a receiver. The parameter MUST

NOT be used for any other purposes. The value

of the parameter indicates the largest NALU

size in bytes that the receiver can handle

efficiently. The parameter value is a

recommendation, not a strict upper boundary.

The sender MAY create larger NALUs but must be

aware that the handling of these may come at a

higher cost than NALUs conforming to the

limitation.

The value of max-rcmd-nalu-size MUST be an

inclusive. If this parameter is not specified,

no known limitation to the NALU size exists.

Senders still have to consider the MTU size

available between the sender and the receiver

and SHOULD run MTU discovery for this purpose.

This parameter is motivated by, for example, an

IP to H.223 video telephony gateway, where

NALUs smaller than the H.223 transport data

Wenger, et al. Standards Track [Page 50]

unit will be more efficient. A gateway may

terminate IP; thus, MTU discovery will normally

not work beyond the gateway.

Informative note: Setting this parameter to

a lower than necessary value may have a

negative impact.

Encoding considerations:

This type is only defined for transfer via RTP

(RFC 3550).

A file format of H.264/AVC video is defined in

[29]. This definition is utilized by other

file formats, such as the 3GPP multimedia file

format (MIME type video/3gpp) [30] or the MP4

file format (MIME type video/mp4).

Security considerations:

See section 9 of RFC 3984.

Public specification:

Please refer to RFC 3984 and its section 15.

Additional information:

None

File extensions: none

Macintosh file type code: none

Object identifier or OID: none

Person & email address to contact for further information:

[email protected]

Intended usage: COMMON

Author:

Change controller:

IETF Audio/Video Transport working group

delegated from the IESG.

Wenger, et al. Standards Track [Page 51]

8.2. SDP Parameters

8.2.1. Mapping of MIME Parameters to SDP

The MIME media type video/H264 string is mapped to fields in the

Session Description Protocol (SDP) [5] as follows:

o The media name in the "m=" line of SDP MUST be video.

o The encoding name in the "a=rtpmap" line of SDP MUST be H264 (the

MIME subtype).

o The clock rate in the "a=rtpmap" line MUST be 90000.

o The OPTIONAL parameters "profile-level-id", "max-mbps", "max-fs",

"max-cpb", "max-dpb", "max-br", "redundant-pic-cap", "sprop-

parameter-sets", "parameter-add", "packetization-mode", "sprop-

interleaving-depth", "deint-buf-cap", "sprop-deint-buf-req",

"sprop-init-buf-time", "sprop-max-don-diff", and "max-rcmd-nalu-

size", when present, MUST be included in the "a=fmtp" line of SDP.

These parameters are expressed as a MIME media type string, in the

form of a semicolon separated list of parameter=value pairs.

An example of media representation in SDP is as follows (Baseline

Profile, Level 3.0, some of the constraints of the Main profile may

not be obeyed):

m=video 49170 RTP/AVP 98

a=rtpmap:98 H264/90000

a=fmtp:98 profile-level-id=42A01E;

sprop-parameter-sets=Z0IACpZTBYmI,aMljiA==

8.2.2. Usage with the SDP Offer/Answer Model

When H.264 is offered over RTP using SDP in an Offer/Answer model [7]

for negotiation for unicast usage, the following limitations and

rules apply:

o The parameters identifying a media format configuration for H.264

are "profile-level-id", "packetization-mode", and, if required by

"packetization-mode", "sprop-deint-buf-req". These three

parameters MUST be used symmetrically; i.e., the answerer MUST

either maintain all configuration parameters or remove the media

format (payload type) completely, if one or more of the parameter

values are not supported.

Wenger, et al. Standards Track [Page 52]

Informative note: The requirement for symmetric use applies

only for the above three parameters and not for the other

stream properties and capability parameters.

To simplify handling and matching of these configurations, the

same RTP payload type number used in the offer SHOULD also be used

in the answer, as specified in [7]. An answer MUST NOT contain a

payload type number used in the offer unless the configuration

("profile-level-id", "packetization-mode", and, if present,

"sprop-deint-buf-req") is the same as in the offer.

Informative note: An offerer, when receiving the answer, has to

compare payload types not declared in the offer based on media

type (i.e., video/h264) and the above three parameters with any

payload types it has already declared, in order to determine

whether the configuration in question is new or equivalent to a

configuration already offered.

o The parameters "sprop-parameter-sets", "sprop-deint-buf-req",

"sprop-interleaving-depth", "sprop-max-don-diff", and "sprop-

init-buf-time" describe the properties of the NAL unit stream that

the offerer or answerer is sending for this media format

configuration. This differs from the normal usage of the

Offer/Answer parameters: normally such parameters declare the

properties of the stream that the offerer or the answerer is able

to receive. When dealing with H.264, the offerer assumes that the

answerer will be able to receive media encoded using the

configuration being offered.

Informative note: The above parameters apply for any stream

sent by the declaring entity with the same configuration; i.e.,

they are dependent on their source. Rather then being bound to

the payload type, the values may have to be applied to another

payload type when being sent, as they apply for the

configuration.

o The capability parameters ("max-mbps", "max-fs", "max-cpb", "max-

dpb", "max-br", ,"redundant-pic-cap", "max-rcmd-nalu-size") MAY be

used to declare further capabilities. Their interpretation

depends on the direction attribute. When the direction attribute

is sendonly, then the parameters describe the limits of the RTP

packets and the NAL unit stream that the sender is capable of

producing. When the direction attribute is sendrecv or recvonly,

then the parameters describe the limitations of what the receiver

accepts.

Wenger, et al. Standards Track [Page 53]

o As specified above, an offerer has to include the size of the

deinterleaving buffer in the offer for an interleaved H.264

stream. To enable the offerer and answerer to inform each other

about their capabilities for deinterleaving buffering, both

parties are RECOMMENDED to include "deint-buf-cap". This

information MAY be used when the value for "sprop-deint-buf-req"

is selected in a second round of offer and answer. For

interleaved streams, it is also RECOMMENDED to consider offering

multiple payload types with different buffering requirements when

the capabilities of the receiver are unknown.

o The "sprop-parameter-sets" parameter is used as described above.

In addition, an answerer MUST maintain all parameter sets received

in the offer in its answer. Depending on the value of the

"parameter-add" parameter, different rules apply: If "parameter-

add" is false (0), the answer MUST NOT add any additional

parameter sets. If "parameter-add" is true (1), the answerer, in

its answer, MAY add additional parameter sets to the "sprop-

parameter-sets" parameter. The answerer MUST also, independent of

the value of "parameter-add", accept to receive a video stream

using the sprop-parameter-sets it declared in the answer.

Informative note: care must be taken when parameter sets are

added not to cause overwriting of already transmitted parameter

sets by using conflicting parameter set identifiers.

For streams being delivered over multicast, the following rules apply

in addition:

o The stream properties parameters ("sprop-parameter-sets", "sprop-

deint-buf-req", "sprop-interleaving-depth", "sprop-max-don-diff",

and "sprop-init-buf-time") MUST NOT be changed by the answerer.

Thus, a payload type can either be accepted unaltered or removed.

o The receiver capability parameters "max-mbps", "max-fs", "max-

cpb", "max-dpb", "max-br", and "max-rcmd-nalu-size" MUST be

supported by the answerer for all streams declared as sendrecv or

recvonly; otherwise, one of the following actions MUST be

performed: the media format is removed, or the session rejected.

o The receiver capability parameter redundant-pic-cap SHOULD be

recvonly as follows: The answerer SHOULD NOT include redundant

coded pictures in the transmitted stream if the offerer indicated

redundant-pic-cap equal to 0. Otherwise (when redundant_pic_cap

is equal to 1), it is beyond the scope of this memo to recommend

how the answerer should use redundant coded pictures.

Wenger, et al. Standards Track [Page 54]

Below are the complete lists of how the different parameters shall be

interpreted in the different combinations of offer or answer and

direction attribute.

o In offers and answers for which "a=sendrecv" or no direction

attribute is used, or in offers and answers for which "a=recvonly"

is used, the following interpretation of the parameters MUST be

used.

Declaring actual configuration or properties for receiving:

- profile-level-id

- packetization-mode

Declaring actual properties of the stream to be sent (applicable

only when "a=sendrecv" or no direction attribute is used):

- sprop-deint-buf-req

- sprop-interleaving-depth

- sprop-parameter-sets

- sprop-max-don-diff

- sprop-init-buf-time

Declaring receiver implementation capabilities:

- max-mbps

- max-fs

- max-cpb

- max-dpb

- max-br

- redundant-pic-cap

- deint-buf-cap

- max-rcmd-nalu-size

Declaring how Offer/Answer negotiation shall be performed:

- parameter-add

o In an offer or answer for which the direction attribute

"a=sendonly" is included for the media stream, the following

interpretation of the parameters MUST be used:

Declaring actual configuration and properties of stream proposed

to be sent:

Wenger, et al. Standards Track [Page 55]

Declaring the capabilities of the sender when it receives a

stream:

Furthermore, the following considerations are necessary:

o Parameters used for declaring receiver capabilities are in general

downgradable; i.e., they express the upper limit for a sender's

possible behavior. Thus a sender MAY select to set its encoder

using only lower/lesser or equal values of these parameters.

"sprop-parameter-sets" MUST NOT be used in a sender's declaration

of its capabilities, as the limits of the values that are carried

inside the parameter sets are implicit with the profile and level

o Parameters declaring a configuration point are not downgradable,

with the exception of the level part of the "profile-level-id"

parameter. This expresses values a receiver expects to be used

and must be used verbatim on the sender side.

o When a sender's capabilities are declared, and non-downgradable

parameters are used in this declaration, then these parameters

express a configuration that is acceptable. In order to achieve

high interoperability levels, it is often advisable to offer

multiple alternative configurations; e.g., for the packetization

mode. It is impossible to offer multiple configurations in a

single payload type. Thus, when multiple configuration offers are

made, each offer requires its own RTP payload type associated with

the offer.

Wenger, et al. Standards Track [Page 56]

o A receiver SHOULD understand all MIME parameters, even if it only

supports a subset of the payload format's functionality. This

ensures that a receiver is capable of understanding when an offer

to receive media can be downgraded to what is supported by the

receiver of the offer.

o An answerer MAY extend the offer with additional media format

configurations. However, to enable their usage, in most cases a

second offer is required from the offerer to provide the stream

properties parameters that the media sender will use. This also

has the effect that the offerer has to be able to receive this

media format configuration, not only to send it.

o If an offerer wishes to have non-symmetric capabilities between

sending and receiving, the offerer has to offer different RTP

sessions; i.e., different media lines declared as "recvonly" and

"sendonly", respectively. This may have further implications on

the system.

8.2.3. Usage in Declarative Session Descriptions

When H.264 over RTP is offered with SDP in a declarative style, as in

RTSP [27] or SAP [28], the following considerations are necessary.

o All parameters capable of indicating the properties of both a NAL

unit stream and a receiver are used to indicate the properties of

a NAL unit stream. For example, in this case, the parameter

"profile-level-id" declares the values used by the stream, instead

of the capabilities of the sender. This results in that the

following interpretation of the parameters MUST be used:

Declaring actual configuration or properties:

Wenger, et al. Standards Track [Page 57]

Not usable:

o A receiver of the SDP is required to support all parameters and

values of the parameters provided; otherwise, the receiver MUST

reject (RTSP) or not participate in (SAP) the session. It falls

on the creator of the session to use values that are expected to

be supported by the receiving application.

8.3. Examples

A SIP Offer/Answer exchange wherein both parties are expected to both

send and receive could look like the following. Only the media codec

specific parts of the SDP are shown. Some lines are wrapped due to

text constraints.

Offerer -> Answer SDP message:

m=video 49170 RTP/AVP 100 99 98

a=fmtp:98 profile-level-id=42A01E; packetization-mode=0;

a=rtpmap:99 H264/90000

a=fmtp:99 profile-level-id=42A01E; packetization-mode=1;

a=rtpmap:100 H264/90000

a=fmtp:100 profile-level-id=42A01E; packetization-mode=2;

sprop-parameter-sets=Z0IACpZTBYmI,aMljiA==;

sprop-interleaving-depth=45; sprop-deint-buf-req=64000;

sprop-init-buf-time=102478; deint-buf-cap=128000

The above offer presents the same codec configuration in three

different packetization formats. PT 98 represents single NALU mode,

PT 99 non-interleaved mode; PT 100 indicates the interleaved mode.

In the interleaved mode case, the interleaving parameters that the

offerer would use if the answer indicates support for PT 100 are also

included. In all three cases the parameter "sprop-parameter-sets"

conveys the initial parameter sets that are required for the answerer

when receiving a stream from the offerer when this configuration

Wenger, et al. Standards Track [Page 58]

(profile-level-id and packetization mode) is accepted. Note that the

value for "sprop-parameter-sets", although identical in the example

above, could be different for each payload type.

Answerer -> Offerer SDP message:

m=video 49170 RTP/AVP 100 99 97

a=rtpmap:97 H264/90000

a=fmtp:97 profile-level-id=42A01E; packetization-mode=0;

sprop-parameter-sets=Z0IACpZTBYmI,aMljiA==,As0DEWlsIOp==,

KyzFGleR

a=rtpmap:99 H264/90000

a=fmtp:99 profile-level-id=42A01E; packetization-mode=1;

KyzFGleR; max-rcmd-nalu-size=3980

a=rtpmap:100 H264/90000

a=fmtp:100 profile-level-id=42A01E; packetization-mode=2;

KyzFGleR; sprop-interleaving-depth=60;

sprop-deint-buf-req=86000; sprop-init-buf-time=156320;

deint-buf-cap=128000; max-rcmd-nalu-size=3980

As the Offer/Answer negotiation covers both sending and receiving

streams, an offer indicates the exact parameters for what the offerer

is willing to receive, whereas the answer indicates the same for what

the answerer accepts to receive. In this case the offerer declared

that it is willing to receive payload type 98. The answerer accepts

this by declaring a equivalent payload type 97; i.e., it has

identical values for the three parameters "profile-level-id",

packetization-mode, and "sprop-deint-buf-req". This has the

following implications for both the offerer and the answerer

concerning the parameters that declare properties. The offerer

initially declared a certain value of the "sprop-parameter-sets" in

the payload definition for PT=98. However, as the answerer accepted

this as PT=97, the values of "sprop-parameter-sets" in PT=98 must now

be used instead when the offerer sends PT=97. Similarly, when the

answerer sends PT=98 to the offerer, it has to use the properties

parameters it declared in PT=97.

The answerer also accepts the reception of the two configurations

that payload types 99 and 100 represent. It provides the initial

parameter sets for the answerer-to-offerer direction, and for

buffering related parameters that it will use to send the payload

types. It also provides the offerer with its memory limit for

deinterleaving operations by providing a "deint-buf-cap" parameter.

This is only useful if the offerer decides on making a second offer,

where it can take the new value into account. The "max-rcmd-nalu-

size" indicates that the answerer can efficiently process NALUs up to

Wenger, et al. Standards Track [Page 59]

the size of 3980 bytes. However, there is no guarantee that the

network supports this size.

Please note that the parameter sets in the above example do not

represent a legal operation point of an H.264 codec. The base64

strings are only used for illustration.

8.4. Parameter Set Considerations

The H.264 parameter sets are a fundamental part of the video codec

and vital to its operation; see section 1.2. Due to their

characteristics and their importance for the decoding process, lost

or erroneously transmitted parameter sets can hardly be concealed

locally at the receiver. A reference to a corrupt parameter set has

normally fatal results to the decoding process. Corruption could

occur, for example, due to the erroneous transmission or loss of a

parameter set data structure, but also due to the untimely

transmission of a parameter set update. Therefore, the following

recommendations are provided as a guideline for the implementer of

the RTP sender.

Parameter set NALUs can be transported using three different

principles:

A. Using a session control protocol (out-of-band) prior to the actual

RTP session.

B. Using a session control protocol (out-of-band) during an ongoing

C. Within the RTP stream in the payload (in-band) during an ongoing

It is necessary to implement principles A and B within a session

control protocol. SIP and SDP can be used as described in the SDP

Offer/Answer model and in the previous sections of this memo. This

section contains guidelines on how principles A and B must be

implemented within session control protocols. It is independent of

the particular protocol used. Principle C is supported by the RTP

payload format defined in this specification.

The picture and sequence parameter set NALUs SHOULD NOT be

transmitted in the RTP payload unless reliable transport is provided

for RTP, as a loss of a parameter set of either type will likely

prevent decoding of a considerable portion of the corresponding RTP

Wenger, et al. Standards Track [Page 60]

stream. Thus, the transmission of parameter sets using a reliable

session control protocol (i.e., usage of principle A or B above) is

RECOMMENDED.

In the rest of the section it is assumed that out-of-band signaling

provides reliable transport of parameter set NALUs and that in-band

transport does not. If in-band signaling of parameter sets is used,

the sender SHOULD take the error characteristics into account and use

mechanisms to provide a high probability for delivering the parameter

sets correctly. Mechanisms that increase the probability for a

correct reception include packet repetition, FEC, and retransmission.

The use of an unreliable, out-of-band control protocol has similar

disadvantages as the in-band signaling (possible loss) and, in

addition, may also lead to difficulties in the synchronization (see

below). Therefore, it is NOT RECOMMENDED.

Parameter sets MAY be added or updated during the lifetime of a

session using principles B and C. It is required that parameter sets

are present at the decoder prior to the NAL units that refer to them.

Updating or adding of parameter sets can result in further problems,

and therefore the following recommendations should be considered.

- When parameter sets are added or updated, principle C is

vulnerable to transmission errors as described above, and

therefore principle B is RECOMMENDED.

- When parameter sets are added or updated, care SHOULD be taken to

ensure that any parameter set is delivered prior to its usage. It

is common that no synchronization is present between out-of-band

signaling and in-band traffic. If out-of-band signaling is used,

it is RECOMMENDED that a sender does not start sending NALUs

requiring the updated parameter sets prior to acknowledgement of

delivery from the signaling protocol.

- When parameter sets are updated, the following synchronization

issue should be taken into account. When overwriting a parameter

set at the receiver, the sender has to ensure that the parameter

set in question is not needed by any NALU present in the network

or receiver buffers. Otherwise, decoding with a wrong parameter

set may occur. To lessen this problem, it is RECOMMENDED either

to overwrite only those parameter sets that have not been used for

a sufficiently long time (to ensure that all related NALUs have

been consumed), or to add a new parameter set instead (which may

have negative consequences for the efficiency of the video

coding).

- When new parameter sets are added, previously unused parameter set

identifiers are used. This avoids the problem identified in the

Wenger, et al. Standards Track [Page 61]

previous paragraph. However, in a multiparty session, unless a

synchronized control protocol is used, there is a risk that

multiple entities try to add different parameter sets for the same

identifier, which has to be avoided.

- Adding or modifying parameter sets by using both principles B and

C in the same RTP session may lead to inconsistencies of the

parameter sets because of the lack of synchronization between the

control and the RTP channel. Therefore, principles B and C MUST

NOT both be used in the same session unless sufficient

synchronization can be provided.

In some scenarios (e.g., when only the subset of this payload format

specification corresponding to H.241 is used), it is not possible to

employ out-of-band parameter set transmission. In this case,

parameter sets have to be transmitted in-band. Here, the

synchronization with the non-parameter-set-data in the bitstream is

implicit, but the possibility of a loss has to be taken into account.

The loss probability should be reduced using the mechanisms discussed

above.

- When parameter sets are initially provided using principle A and

then later added or updated in-band (principle C), there is a risk

associated with updating the parameter sets delivered out-of-band.

If receivers miss some in-band updates (for example, because of a

loss or a late tune-in), those receivers attempt to decode the

bitstream using out-dated parameters. It is RECOMMENDED that

parameter set IDs be partitioned between the out-of-band and in-

band parameter sets.

To allow for maximum flexibility and best performance from the H.264

coder, it is recommended, if possible, to allow any sender to add its

own parameter sets to be used in a session. Setting the "parameter-

add" parameter to false should only be done in cases where the

session topology prevents a participant to add its own parameter

sets.

9. Security Considerations

RTP packets using the payload format defined in this specification

are subject to the security considerations discussed in the RTP

specification [4], and in any appropriate RTP profile (for example,

[16]). This implies that confidentiality of the media streams is

achieved by encryption; for example, through the application of SRTP

[26]. Because the data compression used with this payload format is

applied end-to-end, any encryption needs to be performed after

compression.

Wenger, et al. Standards Track [Page 62]

A potential denial-of-service threat exists for data encodings using

compression techniques that have non-uniform receiver-end

computational load. The attacker can inject pathological datagrams

into the stream that are complex to decode and that cause the

receiver to be overloaded. H.264 is particularly vulnerable to such

attacks, as it is extremely simple to generate datagrams containing

NAL units that affect the decoding process of many future NAL units.

Therefore, the usage of data origin authentication and data integrity

protection of at least the RTP packet is RECOMMENDED; for example,

with SRTP [26].

Note that the appropriate mechanism to ensure confidentiality and

integrity of RTP packets and their payloads is very dependent on the

application and on the transport and signaling protocols employed.

Thus, although SRTP is given as an example above, other possible

choices exist.

Decoders MUST exercise caution with respect to the handling of user

data SEI messages, particularly if they contain active elements, and

MUST restrict their domain of applicability to the presentation

containing the stream.

End-to-End security with either authentication, integrity or

confidentiality protection will prevent a MANE from performing

media-aware operations other than discarding complete packets. And

in the case of confidentiality protection it will even be prevented

from performing discarding of packets in a media aware way. To allow

any MANE to perform its operations, it will be required to be a

trusted entity which is included in the security context

establishment.

10. Congestion Control

Congestion control for RTP SHALL be used in accordance with RFC 3550

[4], and with any applicable RTP profile; e.g., RFC 3551 [16]. An

additional requirement if best-effort service is being used is:

users of this payload format MUST monitor packet loss to ensure that

the packet loss rate is within acceptable parameters. Packet loss is

considered acceptable if a TCP flow across the same network path, and

experiencing the same network conditions, would achieve an average

throughput, measured on a reasonable timescale, that is not less than

the RTP flow is achieving. This condition can be satisfied by

implementing congestion control mechanisms to adapt the transmission

rate (or the number of layers subscribed for a layered multicast

session), or by arranging for a receiver to leave the session if the

loss rate is unacceptably high.

Wenger, et al. Standards Track [Page 63]

The bit rate adaptation necessary for obeying the congestion control

principle is easily achievable when real-time encoding is used.

However, when pre-encoded content is being transmitted, bandwidth

adaptation requires the availability of more than one coded

representation of the same content, at different bit rates, or the

existence of non-reference pictures or sub-sequences [22] in the

bitstream. The switching between the different representations can

normally be performed in the same RTP session; e.g., by employing a

concept known as SI/SP slices of the Extended Profile, or by

switching streams at IDR picture boundaries. Only when non-

downgradable parameters (such as the profile part of the

profile/level ID) are required to be changed does it become necessary

to terminate and re-start the media stream. This may be accomplished

by using a different RTP payload type.

MANEs MAY follow the suggestions outlined in section 7.3 and remove

certain unusable packets from the packet stream when that stream was

damaged due to previous packet losses. This can help reduce the

network load in certain special cases.

11. IANA Consideration

IANA has registered one new MIME type; see section 8.1.

Wenger, et al. Standards Track [Page 64]

12. Informative Appendix: Application Examples

This payload specification is very flexible in its use, in order to

cover the extremely wide application space anticipated for H.264.

However, this great flexibility also makes it difficult for an

implementer to decide on a reasonable packetization scheme. Some

information on how to apply this specification to real-world

scenarios is likely to appear in the form of academic publications

and a test model software and description in the near future.

However, some preliminary usage scenarios are described here as well.

12.1. Video Telephony according to ITU-T Recommendation H.241

Annex A

H.323-based video telephony systems that use H.264 as an optional

video compression scheme are required to support H.241 Annex A [15]

as a packetization scheme. The packetization mechanism defined in

this Annex is technically identical with a small subset of this

specification.

When a system operates according to H.241 Annex A, parameter set NAL

units are sent in-band. Only Single NAL unit packets are used. Many

such systems are not sending IDR pictures regularly, but only when

required by user interaction or by control protocol means; e.g., when

switching between video channels in a Multipoint Control Unit or for

error recovery requested by feedback.

12.2. Video Telephony, No Slice Data Partitioning, No NAL Unit

Aggregation

The RTP part of this scheme is implemented and tested (though not the

control-protocol part; see below).

In most real-world video telephony applications, picture parameters

such as picture size or optional modes never change during the

lifetime of a connection. Therefore, all necessary parameter sets

(usually only one) are sent as a side effect of the capability

exchange/announcement process, e.g., according to the SDP syntax

specified in section 8.2 of this document. As all necessary

parameter set information is established before the RTP session

starts, there is no need for sending any parameter set NAL units.

Slice data partitioning is not used, either. Thus, the RTP packet

stream basically consists of NAL units that carry single coded

slices.

The encoder chooses the size of coded slice NAL units so that they

offer the best performance. Often, this is done by adapting the

coded slice size to the MTU size of the IP network. For small

Wenger, et al. Standards Track [Page 65]

picture sizes, this may result in a one-picture-per-one-packet

strategy. Intra refresh algorithms clean up the loss of packets and

the resulting drift-related artifacts.

12.3. Video Telephony, Interleaved Packetization Using NAL Unit

This scheme allows better error concealment and is used in H.263

based designs using RFC 2429 packetization [10]. It has been

implemented, and good results were reported [12].

The VCL encoder codes the source picture so that all macroblocks

(MBs) of one MB line are assigned to one slice. All slices with even

MB row addresses are combined into one STAP, and all slices with odd

MB row addresses into another. Those STAPs are transmitted as RTP

packets. The establishment of the parameter sets is performed as

discussed above.

Note that the use of STAPs is essential here, as the high number of

individual slices (18 for a CIF picture) would lead to unacceptably

high IP/UDP/RTP header overhead (unless the source coding tool FMO is

used, which is not assumed in this scenario). Furthermore, some

wireless video transmission systems, such as H.324M and the IP-based

video telephony specified in 3GPP, are likely to use relatively small

transport packet size. For example, a typical MTU size of H.223 AL3

SDU is around 100 bytes [17]. Coding individual slices according to

this packetization scheme provides further advantage in communication

between wired and wireless networks, as individual slices are likely

to be smaller than the preferred maximum packet size of wireless

systems. Consequently, a gateway can convert the STAPs used in a

wired network into several RTP packets with only one NAL unit, which

are preferred in a wireless network, and vice versa.

12.4. Video Telephony with Data Partitioning

This scheme has been implemented and has been shown to offer good

performance, especially at higher packet loss rates [12].

Data Partitioning is known to be useful only when some form of

unequal error protection is available. Normally, in single-session

RTP environments, even error characteristics are assumed; i.e., the

packet loss probability of all packets of the session is the same

statistically. However, there are means to reduce the packet loss

probability of individual packets in an RTP session. A FEC packet

according to RFC 2733 [18], for example, specifies which media

packets are associated with the FEC packet.

Wenger, et al. Standards Track [Page 66]

In all cases, the incurred overhead is substantial but is in the same

order of magnitude as the number of bits that have otherwise been

spent for intra information. However, this mechanism does not add

any delay to the system.

Again, the complete parameter set establishment is performed through

control protocol means.

12.5. Video Telephony or Streaming with FUs and Forward Error

Correction

This scheme has been implemented and has been shown to provide good

performance, especially at higher packet loss rates [19].

The most efficient means to combat packet losses for scenarios where

retransmissions are not applicable is forward error correction (FEC).

Although application layer, end-to-end use of FEC is often less

efficient than an FEC-based protection of individual links

(especially when links of different characteristics are in the

transmission path), application layer, end-to-end FEC is unavoidable

in some scenarios. RFC 2733 [18] provides means to use generic,

application layer, end-to-end FEC in packet-loss environments. A

binary forward error correcting code is generated by applying the XOR

operation to the bits at the same bit position in different packets.

The binary code can be specified by the parameters (n,k) in which k

is the number of information packets used in the connection and n is

the total number of packets generated for k information packets;

i.e., n-k parity packets are generated for k information packets.

When a code is used with parameters (n,k) within the RFC 2733

framework, the following properties are well known:

a) If applied over one RTP packet, RFC 2733 provides only packet

repetition.

b) RFC 2733 is most bit rate efficient if XOR-connected packets have

equal length.

c) At the same packet loss probability p and for a fixed k, the

greater the value of n is, the smaller the residual error

probability becomes. For example, for a packet loss probability

of 10%, k=1, and n=2, the residual error probability is about 1%,

whereas for n=3, the residual error probability is about 0.1%.

d) At the same packet loss probability p and for a fixed code rate

k/n, the greater the value of n is, the smaller the residual error

probability becomes. For example, at a packet loss probability of

p=10%, k=1 and n=2, the residual error rate is about 1%, whereas

Wenger, et al. Standards Track [Page 67]

for an extended Golay code with k=12 and n=24, the residual error

rate is about 0.01%.

For applying RFC 2733 in combination with H.264 baseline coded video

without using FUs, several options might be considered:

1) The video encoder produces NAL units for which each video frame is

coded in a single slice. Applying FEC, one could use a simple

code; e.g., (n=2, k=1). That is, each NAL unit would basically

just be repeated. The disadvantage is obviously the bad code

performance according to d), above, and the low flexibility, as

only (n, k=1) codes can be used.

2) The video encoder produces NAL units for which each video frame is

encoded in one or more consecutive slices. Applying FEC, one

could use a better code, e.g., (n=24, k=12), over a sequence of

NAL units. Depending on the number of RTP packets per frame, a

loss may introduce a significant delay, which is reduced when more

RTP packets are used per frame. Packets of completely different

length might also be connected, which decreases bit rate

efficiency according to b), above. However, with some care and

for slices of 1kb or larger, similar length (100-200 bytes

difference) may be produced, which will not lower the bit

efficiency catastrophically.

3) The video encoder produces NAL units, for which a certain frame

contains k slices of possibly almost equal length. Then, applying

FEC, a better code, e.g., (n=24, k=12), can be used over the

sequence of NAL units for each frame. The delay compared to that

of 2), above, may be reduced, but several disadvantages are

obvious. First, the coding efficiency of the encoded video is

lowered significantly, as slice-structured coding reduces intra-

frame prediction and additional slice overhead is necessary.

Second, pre-encoded content or, when operating over a gateway, the

video is usually not appropriately coded with k slices such that

FEC can be applied. Finally, the encoding of video producing k

slices of equal length is not straightforward and might require

more than one encoding pass.

Many of the mentioned disadvantages can be avoided by applying FUs in

combination with FEC. Each NAL unit can be split into any number of

FUs of basically equal length; therefore, FEC with a reasonable k and

n can be applied, even if the encoder made no effort to produce

slices of equal length. For example, a coded slice NAL unit

containing an entire frame can be split to k FUs, and a parity check

code (n=k+1, k) can be applied. However, this has the disadvantage

Wenger, et al. Standards Track [Page 68]

that unless all created fragments can be recovered, the whole slice

will be lost. Thus a larger section is lost than would be if the

frame had been split into several slices.

The presented technique makes it possible to achieve good

transmission error tolerance, even if no additional source coding

layer redundancy (such as periodic intra frames) is present.

Consequently, the same coded video sequence can be used to achieve

the maximum compression efficiency and quality over error-free

transmission and for transmission over error-prone networks.

Furthermore, the technique allows the application of FEC to pre-

encoded sequences without adding delay. In this case, pre-encoded

sequences that are not encoded for error-prone networks can still be

transmitted almost reliably without adding extensive delays. In

addition, FUs of equal length result in a bit rate efficient use of

RFC 2733.

If the error probability depends on the length of the transmitted

packet (e.g., in case of mobile transmission [14]), the benefits of

applying FUs with FEC are even more obvious. Basically, the

flexibility of the size of FUs allows appropriate FEC to be applied

for each NAL unit and unequal error protection of NAL units.

When FUs and FEC are used, the incurred overhead is substantial but

is in the same order of magnitude as the number of bits that have to

be spent for intra-coded macroblocks if no FEC is applied. In [19],

it was shown that the overall performance of the FEC-based approach

enhanced quality when using the same error rate and same overall bit

rate, including the overhead.

12.6. Low Bit-Rate Streaming

This scheme has been implemented with H.263 and non-standard RTP

packetization and has given good results [20]. There is no technical

reason why similarly good results could not be achievable with H.264.

In today's Internet streaming, some of the offered bit rates are

relatively low in order to allow terminals with dial-up modems to

access the content. In wired IP networks, relatively large packets,

say 500 - 1500 bytes, are preferred to smaller and more frequently

occurring packets in order to reduce network congestion. Moreover,

use of large packets decreases the amount of RTP/UDP/IP header

overhead. For low bit-rate video, the use of large packets means

that sometimes up to few pictures should be encapsulated in one

packet.

Wenger, et al. Standards Track [Page 69]

However, loss of a packet including many coded pictures would have

drastic consequences for visual quality, as there is practically no

other way to conceal a loss of an entire picture than to repeat the

previous one. One way to construct relatively large packets and

maintain possibilities for successful loss concealment is to

construct MTAPs that contain interleaved slices from several

pictures. An MTAP should not contain spatially adjacent slices from

the same picture or spatially overlapping slices from any picture.

If a packet is lost, it is likely that a lost slice is surrounded by

spatially adjacent slices of the same picture and spatially

corresponding slices of the temporally previous and succeeding

pictures. Consequently, concealment of the lost slice is likely to

be relatively successful.

12.7. Robust Packet Scheduling in Video Streaming

Robust packet scheduling has been implemented with MPEG-4 Part 2 and

simulated in a wireless streaming environment [21]. There is no

technical reason why similar or better results could not be

achievable with H.264.

Streaming clients typically have a receiver buffer that is capable of

storing a relatively large amount of data. Initially, when a

streaming session is established, a client does not start playing the

stream back immediately. Rather, it typically buffers the incoming

data for a few seconds. This buffering helps maintain continuous

playback, as, in case of occasional increased transmission delays or

network throughput drops, the client can decode and play buffered

data. Otherwise, without initial buffering, the client has to freeze

the display, stop decoding, and wait for incoming data. The

buffering is also necessary for either automatic or selective

retransmission in any protocol level. If any part of a picture is

lost, a retransmission mechanism may be used to resend the lost data.

If the retransmitted data is received before its scheduled decoding

or playback time, the loss is recovered perfectly. Coded pictures

can be ranked according to their importance in the subjective quality

of the decoded sequence. For example, non-reference pictures, such

as conventional B pictures, are subjectively least important, as

their absence does not affect decoding of any other pictures. In

addition to non-reference pictures, the ITU-T H.264 | ISO/IEC

14496-10 standard includes a temporal scalability method called sub-

sequences [22]. Subjective ranking can also be made on coded slice

data partition or slice group basis. Coded slices and coded slice

data partitions that are subjectively the most important can be sent

earlier than their decoding order indicates, whereas coded slices and

coded slice data partitions that are subjectively the least important

can be sent later than their natural coding order indicates.

Consequently, any retransmitted parts of the most important slices

Wenger, et al. Standards Track [Page 70]

and coded slice data partitions are more likely to be received before

their scheduled decoding or playback time compared to the least

important slices and slice data partitions.

13. Informative Appendix: Rationale for Decoding Order Number

13.1. Introduction

The Decoding Order Number (DON) concept was introduced mainly to

enable efficient multi-picture slice interleaving (see section 12.6)

and robust packet scheduling (see section 12.7). In both of these

applications, NAL units are transmitted out of decoding order. DON

indicates the decoding order of NAL units and should be used in the

receiver to recover the decoding order. Example use cases for

efficient multi-picture slice interleaving and for robust packet

scheduling are given in sections 13.2 and 13.3, respectively.

Section 13.4 describes the benefits of the DON concept in error

resiliency achieved by redundant coded pictures. Section 13.5

summarizes considered alternatives to DON and justifies why DON was

chosen to this RTP payload specification.

13.2. Example of Multi-Picture Slice Interleaving

An example of multi-picture slice interleaving follows. A subset of

a coded video sequence is depicted below in output order. R denotes

a reference picture, N denotes a non-reference picture, and the

number indicates a relative output time.

... R1 N2 R3 N4 R5 ...

The decoding order of these pictures from left to right is as

follows:

... R1 R3 N2 R5 N4 ...

The NAL units of pictures R1, R3, N2, R5, and N4 are marked with a

DON equal to 1, 2, 3, 4, and 5, respectively.

Wenger, et al. Standards Track [Page 71]

Each reference picture consists of three slice groups that are

scattered as follows (a number denotes the slice group number for

each macroblock in a QCIF frame):

0 1 2 0 1 2 0 1 2 0 1

2 0 1 2 0 1 2 0 1 2 0

1 2 0 1 2 0 1 2 0 1 2

For the sake of simplicity, we assume that all the macroblocks of a

slice group are included in one slice. Three MTAPs are constructed

from three consecutive reference pictures so that each MTAP contains

three aggregation units, each of which contains all the macroblocks

from one slice group. The first MTAP contains slice group 0 of

picture R1, slice group 1 of picture R3, and slice group 2 of

picture R5. The second MTAP contains slice group 1 of picture R1,

slice group 2 of picture R3, and slice group 0 of picture R5. The

third MTAP contains slice group 2 of picture R1, slice group 0 of

picture R3, and slice group 1 of picture R5. Each non-reference

picture is encapsulated into an STAP-B.

Consequently, the transmission order of NAL units is the following:

R1, slice group 0, DON 1, carried in MTAP, RTP SN: N

R3, slice group 1, DON 2, carried in MTAP, RTP SN: N

R5, slice group 2, DON 4, carried in MTAP, RTP SN: N

R1, slice group 1, DON 1, carried in MTAP, RTP SN: N+1

R3, slice group 2, DON 2, carried in MTAP, RTP SN: N+1

R5, slice group 0, DON 4, carried in MTAP, RTP SN: N+1

R1, slice group 2, DON 1, carried in MTAP, RTP SN: N+2

R3, slice group 1, DON 2, carried in MTAP, RTP SN: N+2

R5, slice group 0, DON 4, carried in MTAP, RTP SN: N+2

N2, DON 3, carried in STAP-B, RTP SN: N+3

N4, DON 5, carried in STAP-B, RTP SN: N+4

The receiver is able to organize the NAL units back in decoding order

based on the value of DON associated with each NAL unit.

If one of the MTAPs is lost, the spatially adjacent and temporally

co-located macroblocks are received and can be used to conceal the

loss efficiently. If one of the STAPs is lost, the effect of the

loss does not propagate temporally.

Wenger, et al. Standards Track [Page 72]

13.3. Example of Robust Packet Scheduling

An example of robust packet scheduling follows. The communication

system used in the example consists of the following components in

the order that the video is processed from source to sink:

o camera and capturing

o pre-encoding buffer

o encoder

o encoded picture buffer

o transmitter

o transmission channel

o receiver

o receiver buffer

o decoder

o decoded picture buffer

o display

The video communication system used in the example operates as

follows. Note that processing of the video stream happens gradually

and at the same time in all components of the system. The source

video sequence is shot and captured to a pre-encoding buffer. The

pre-encoding buffer can be used to order pictures from sampling order

to encoding order or to analyze multiple uncompressed frames for bit

rate control purposes, for example. In some cases, the pre-encoding

buffer may not exist; instead, the sampled pictures are encoded right

away. The encoder encodes pictures from the pre-encoding buffer and

stores the output; i.e., coded pictures, to the encoded picture

buffer. The transmitter encapsulates the coded pictures from the

encoded picture buffer to transmission packets and sends them to a

receiver through a transmission channel. The receiver stores the

received packets to the receiver buffer. The receiver buffering

process typically includes buffering for transmission delay jitter.

The receiver buffer can also be used to recover correct decoding

order of coded data. The decoder reads coded data from the receiver

buffer and produces decoded pictures as output into the decoded

picture buffer. The decoded picture buffer is used to recover the

output (or display) order of pictures. Finally, pictures are

displayed.

In the following example figures, I denotes an IDR picture, R denotes

number after I, R, or N indicates the sampling time relative to the

previous IDR picture in decoding order. Values below the sequence of

pictures indicate scaled system clock timestamps. The system clock

is initialized arbitrarily in this example, and time runs from left

to right. Each I, R, and N picture is mapped into the same timeline

compared to the previous processing step, if any, assuming that

Wenger, et al. Standards Track [Page 73]

encoding, transmission, and decoding take no time. Thus, events

happening at the same time are located in the same column throughout

all example figures.

A subset of a sequence of coded pictures is depicted below in

sampling order.

... N58 N59 I00 N01 N02 R03 N04 N05 R06 ... N58 N59 I00 N01 ...

... --|---|---|---|---|---|---|---|---|- ... -|---|---|---|- ...

... 58 59 60 61 62 63 64 65 66 ... 128 129 130 131 ...

Figure 16. Sequence of pictures in sampling order

The sampled pictures are buffered in the pre-encoding buffer to

arrange them in encoding order. In this example, we assume that the

non-reference pictures are predicted from both the previous and the

next reference picture in output order, except for the non-reference

pictures immediately preceding an IDR picture, which are predicted

only from the previous reference picture in output order. Thus, the

pre-encoding buffer has to contain at least two pictures, and the

buffering causes a delay of two picture intervals. The output of the

pre-encoding buffering process and the encoding (and decoding) order

of the pictures are as follows:

... N58 N59 I00 R03 N01 N02 R06 N04 N05 ...

... -|---|---|---|---|---|---|---|---|- ...

... 60 61 62 63 64 65 66 67 68 ...

Figure 17. Re-ordered pictures in the pre-encoding buffer

The encoder or the transmitter can set the value of DON for each

picture to a value of DON for the previous picture in decoding order

plus one.

For the sake of simplicity, let us assume that:

o the frame rate of the sequence is constant,

o each picture consists of only one slice,

o each slice is encapsulated in a single NAL unit packet,

o there is no transmission delay, and

o pictures are transmitted at constant intervals (that is, 1 / frame

rate).

Wenger, et al. Standards Track [Page 74]

When pictures are transmitted in decoding order, they are received as

Figure 18. Received pictures in decoding order

The OPTIONAL sprop-interleaving-depth MIME type parameter is set to

0, as the transmission (or reception) order is identical to the

decoding order.

The decoder has to buffer for one picture interval initially in its

decoded picture buffer to organize pictures from decoding order to

output order as depicted below:

... N58 N59 I00 N01 N02 R03 N04 N05 R06 ...

... -|---|---|---|---|---|---|---|---|- ...

... 61 62 63 64 65 66 67 68 69 ...

Figure 19. Output order

The amount of required initial buffering in the decoded picture

buffer can be signaled in the buffering period SEI message or with

the num_reorder_frames syntax element of H.264 video usability

information. num_reorder_frames indicates the maximum number of

frames, complementary field pairs, or non-paired fields that precede

any frame, complementary field pair, or non-paired field in the

sequence in decoding order and that follow it in output order. For

the sake of simplicity, we assume that num_reorder_frames is used to

indicate the initial buffer in the decoded picture buffer. In this

example, num_reorder_frames is equal to 1.

It can be observed that if the IDR picture I00 is lost during

transmission and a retransmission request is issued when the value of

the system clock is 62, there is one picture interval of time (until

the system clock reaches timestamp 63) to receive the retransmitted

IDR picture I00.

Wenger, et al. Standards Track [Page 75]

Let us then assume that IDR pictures are transmitted two frame

intervals earlier than their decoding position; i.e., the pictures

are transmitted as follows:

... I00 N58 N59 R03 N01 N02 R06 N04 N05 ...

... --|---|---|---|---|---|---|---|---|- ...

... 62 63 64 65 66 67 68 69 70 ...

Figure 20. Interleaving: Early IDR pictures in sending order

The OPTIONAL sprop-interleaving-depth MIME type parameter is set

equal to 1 according to its definition. (The value of sprop-

interleaving-depth in this example can be derived as follows:

Picture I00 is the only picture preceding picture N58 or N59 in

transmission order and following it in decoding order. Except for

pictures I00, N58, and N59, the transmission order is the same as the

decoding order of pictures. As a coded picture is encapsulated into

exactly one NAL unit, the value of sprop-interleaving-depth is equal

to the maximum number of pictures preceding any picture in

transmission order and following the picture in decoding order.)

The receiver buffering process contains two pictures at a time

according to the value of the sprop-interleaving-depth parameter and

orders pictures from the reception order to the correct decoding

order based on the value of DON associated with each picture. The

output of the receiver buffering process is as follows:

... N58 N59 I00 R03 N01 N02 R06 N04 N05 ...

... -|---|---|---|---|---|---|---|---|- ...

... 63 64 65 66 67 68 69 70 71 ...

Figure 21. Interleaving: Receiver buffer

Again, an initial buffering delay of one picture interval is needed

to organize pictures from decoding order to output order, as depicted

below:

... N58 N59 I00 N01 N02 R03 N04 N05 ...

... -|---|---|---|---|---|---|---|- ...

... 64 65 66 67 68 69 70 71 ...

Figure 22. Interleaving: Receiver buffer after reordering

Note that the maximum delay that IDR pictures can undergo during

transmission, including possible application, transport, or link

layer retransmission, is equal to three picture intervals. Thus, the

Wenger, et al. Standards Track [Page 76]

loss resiliency of IDR pictures is improved in systems supporting

retransmission compared to the case in which pictures were

transmitted in their decoding order.

13.4. Robust Transmission Scheduling of Redundant Coded Slices

A redundant coded picture is a coded representation of a picture or a

part of a picture that is not used in the decoding process if the

corresponding primary coded picture is correctly decoded. There

should be no noticeable difference between any area of the decoded

primary picture and a corresponding area that would result from

application of the H.264 decoding process for any redundant picture

in the same access unit. A redundant coded slice is a coded slice

that is a part of a redundant coded picture.

Redundant coded pictures can be used to provide unequal error

protection in error-prone video transmission. If a primary coded

representation of a picture is decoded incorrectly, a corresponding

redundant coded picture can be decoded. Examples of applications and

coding techniques using the redundant codec picture feature include

the video redundancy coding [23] and the protection of "key pictures"

in multicast streaming [24].

One property of many error-prone video communications systems is that

transmission errors are often bursty. Therefore, they may affect

more than one consecutive transmission packets in transmission order.

In low bit-rate video communication, it is relatively common that an

entire coded picture can be encapsulated into one transmission

packet. Consequently, a primary coded picture and the corresponding

redundant coded pictures may be transmitted in consecutive packets in

transmission order. To make the transmission scheme more tolerant of

bursty transmission errors, it is beneficial to transmit the primary

coded picture and redundant coded picture separated by more than a

single packet. The DON concept enables this.

13.5. Remarks on Other Design Possibilities

The slice header syntax structure of the H.264 coding standard

contains the frame_num syntax element that can indicate the decoding

order of coded frames. However, the usage of the frame_num syntax

element is not feasible or desirable to recover the decoding order,

due to the following reasons:

o The receiver is required to parse at least one slice header per

coded picture (before passing the coded data to the decoder).

Wenger, et al. Standards Track [Page 77]

o Coded slices from multiple coded video sequences cannot be

interleaved, as the frame number syntax element is reset to 0 in

each IDR picture.

o The coded fields of a complementary field pair share the same

value of the frame_num syntax element. Thus, the decoding order

of the coded fields of a complementary field pair cannot be

recovered based on the frame_num syntax element or any other

syntax element of the H.264 coding syntax.

The RTP payload format for transport of MPEG-4 elementary streams

[25] enables interleaving of access units and transmission of

multiple access units in the same RTP packet. An access unit is

specified in the H.264 coding standard to comprise all NAL units

associated with a primary coded picture according to subclause

7.4.1.2 of [1]. Consequently, slices of different pictures cannot be

interleaved, and the multi-picture slice interleaving technique (see

section 12.6) for improved error resilience cannot be used.

14. Acknowledgements

The authors thank Roni Even, Dave Lindbergh, Philippe Gentric,

Gonzalo Camarillo, Gary Sullivan, Joerg Ott, and Colin Perkins for

careful review.

15. References

15.1. Normative References

[1] ITU-T Recommendation H.264, "Advanced video coding for generic

audiovisual services", May 2003.

[2] ISO/IEC International Standard 14496-10:2003.

[3] Bradner, S., "Key words for use in RFCs to Indicate Requirement

Levels", BCP 14, RFC 2119, March 1997.

[4] Schulzrinne, H., Casner, S., Frederick, R., and V. Jacobson,

"RTP: A Transport Protocol for Real-Time Applications", STD 64,

RFC 3550, July 2003.

[5] Handley, M. and V. Jacobson, "SDP: Session Description

Protocol", RFC 2327, April 1998.

[6] Josefsson, S., "The Base16, Base32, and Base64 Data Encodings",

RFC 3548, July 2003.

Wenger, et al. Standards Track [Page 78]

[7] Rosenberg, J. and H. Schulzrinne, "An Offer/Answer Model with

Session Description Protocol (SDP)", RFC 3264, June 2002.

15.2. Informative References

[8] "Draft ITU-T Recommendation and Final Draft International

Standard of Joint Video Specification (ITU-T Rec. H.264 |

ISO/IEC 14496-10 AVC)", available from http://ftp3.itu.int/av-

arch/jvt-site/2003_03_Pattaya/JVT-G050r1.zip, May 2003.

[9] Luthra, A., Sullivan, G.J., and T. Wiegand (eds.), Special Issue

on H.264/AVC. IEEE Transactions on Circuits and Systems on Video

Technology, July 2003.

[10] Bormann, C., Cline, L., Deisher, G., Gardos, T., Maciocco, C.,

Newell, D., Ott, J., Sullivan, G., Wenger, S., and C. Zhu, "RTP

Payload Format for the 1998 Version of ITU-T Rec. H.263 Video

(H.263+)", RFC 2429, October 1998.

[11] ISO/IEC IS 14496-2.

[12] Wenger, S., "H.26L over IP", IEEE Transaction on Circuits and

Systems for Video technology, Vol. 13, No. 7, July 2003.

[13] Wenger, S., "H.26L over IP: The IP Network Adaptation Layer",

Proceedings Packet Video Workshop 02, April 2002.

[14] Stockhammer, T., Hannuksela, M.M., and S. Wenger, "H.26L/JVT

Coding Network Abstraction Layer and IP-based Transport" in

Proc. ICIP 2002, Rochester, NY, September 2002.

[15] ITU-T Recommendation H.241, "Extended video procedures and

control signals for H.300 series terminals", 2004.

[16] Schulzrinne, H. and S. Casner, "RTP Profile for Audio and Video

Conferences with Minimal Control", STD 65, RFC 3551, July 2003.

[17] ITU-T Recommendation H.223, "Multiplexing protocol for low bit

rate multimedia communication", July 2001.

[18] Rosenberg, J. and H. Schulzrinne, "An RTP Payload Format for

Generic Forward Error Correction", RFC 2733, December 1999.

[19] Stockhammer, T., Wiegand, T., Oelbaum, T., and F. Obermeier,

"Video Coding and Transport Layer Techniques for H.264/AVC-Based

Transmission over Packet-Lossy Networks", IEEE International

Conference on Image Processing (ICIP 2003), Barcelona, Spain,

September 2003.

Wenger, et al. Standards Track [Page 79]

[20] Varsa, V. and M. Karczewicz, "Slice interleaving in compressed

video packetization", Packet Video Workshop 2000.

[21] Kang, S.H. and A. Zakhor, "Packet scheduling algorithm for

wireless video streaming," International Packet Video Workshop

2002.

[22] Hannuksela, M.M., "Enhanced concept of GOP", JVT-B042, available

http://ftp3.itu.int/av-arch/video-site/0201_Gen/JVT-B042.doc,

January 2002.

[23] Wenger, S., "Video Redundancy Coding in H.263+", 1997

International Workshop on Audio-Visual Services over Packet

Networks, September 1997.

[24] Wang, Y.-K., Hannuksela, M.M., and M. Gabbouj, "Error Resilient

Video Coding Using Unequally Protected Key Pictures", in Proc.

International Workshop VLBV03, September 2003.

[25] van der Meer, J., Mackie, D., Swaminathan, V., Singer, D., and

P. Gentric, "RTP Payload Format for Transport of MPEG-4

Elementary Streams", RFC 3640, November 2003.

[26] Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K.

Norrman, "The Secure Real-time Transport Protocol (SRTP)", RFC

3711, March 2004.

[27] Schulzrinne, H., Rao, A., and R. Lanphier, "Real Time Streaming

Protocol (RTSP)", RFC 2326, April 1998.

[28] Handley, M., Perkins, C., and E. Whelan, "Session Announcement

Protocol", RFC 2974, October 2000.

[29] ISO/IEC 14496-15: "Information technology - Coding of audio-

visual objects - Part 15: Advanced Video Coding (AVC) file

format".

[30] Castagno, R. and D. Singer, "MIME Type Registrations for 3rd

Generation Partnership Project (3GPP) Multimedia files", RFC

3839, July 2004.

Wenger, et al. Standards Track [Page 80]

Authors' Addresses

Stephan Wenger

TU Berlin / Teles AG

Franklinstr. 28-29

D-10587 Berlin

Germany

Phone: +49-172-300-0813

EMail: [email protected]

Miska M. Hannuksela

Nokia Corporation

P.O. Box 100

33721 Tampere

Finland

Phone: +358-7180-73151

EMail: [email protected]

Thomas Stockhammer

Nomor Research

D-83346 Bergen

Phone: +49-8662-419407

EMail: [email protected]

Magnus Westerlund

Multimedia Technologies

Ericsson Research EAB/TVA/A

Ericsson AB

Torshamsgatan 23

SE-164 80 Stockholm

Sweden

Phone: +46-8-7190000

EMail: [email protected]

Wenger, et al. Standards Track [Page 81]

David Singer

QuickTime Engineering

Apple

1 Infinite Loop MS 302-3MT

Cupertino

CA 95014

USA

Phone +1 408 974-3162

EMail: [email protected]

Wenger, et al. Standards Track [Page 82]

Full Copyright Statement

This document is subject to the rights, licenses and restrictions

contained in BCP 78, and except as set forth therein, the authors

retain all their rights.

This document and the information contained herein are provided on an

"AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS

OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET

ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,

INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE

INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED

WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Intellectual Property

The IETF takes no position regarding the validity or scope of any

Intellectual Property Rights or other rights that might be claimed to

pertain to the implementation or use of the technology described in

this document or the extent to which any license under such rights

might or might not be available; nor does it represent that it has

made any independent effort to identify any such rights. Information

on the IETF's procedures with respect to rights in IETF Documents can

be found in BCP 78 and BCP 79.

Copies of IPR disclosures made to the IETF Secretariat and any

assurances of licenses to be made available, or the result of an

attempt made to obtain a general license or permission for the use of

such proprietary rights by implementers or users of this

specification can be obtained from the IETF on-line IPR repository at

http://www.ietf.org/ipr.

The IETF invites any interested party to bring to its attention any

copyrights, patents or patent applications, or other proprietary

rights that may cover technology that may be required to implement

this standard. Please address the information to the IETF at ietf-

[email protected].

Acknowledgement

Funding for the RFC Editor function is currently provided by the

Internet Society.

Wenger, et al. Standards Track [Page 83]

H.264視訊的RTP荷載格式

繼續閱讀

【圖解HTTP】——確定Web安全的HTTPSHTTPS小結

網絡攻防技術（2021期末考試）

華為筆試軟體

軟體設計師筆記-----系統安全分析與設計五、系統安全分析與設計

軟考-軟體設計師筆記五（系統安全分析與設計）資訊系統安全屬性對稱加密技術非對稱加密技術資訊摘要數字簽名數字信封與PGP各個網絡層次的安全保障網絡威脅與攻擊防火牆技術

初談驗證碼與驗證碼設計

項目管理那些事兒

OS --written test1

OS-written test2

壓縮編碼M-JPEG、MPEG4、H.264

内網滲透1一、資訊收集二、一些概念三、提權四、擷取目前機器下各類密碼四、用到的工具

DOG（4）：解析器的部分實作細節先來說說parser一些可能迷惑的地方結果如何傳回?pcd其實是一回事最後的一點說明

Kali的安裝、配置和換國内源KALI的安裝、配置

轉詳解C#資料庫存取圖檔三大方式

BMP檔案結構及圖像每行位元組計算方法

磁盤結構及在Linux中的命名