Substantial expansion of Ogg container overview document; still requires filling in of several

references by not-yet-present examples.


svn path=/trunk/ogg/; revision=16991
This commit is contained in:
Monty 2010-03-20 06:32:37 +00:00
parent 92018068d3
commit 8528ebe3ea

View File

@ -70,135 +70,397 @@ li {
<a href="http://www.xiph.org/"><img src="fish_xiph_org.png" alt="Fish Logo and Xiph.org"/></a>
</div>
<h1>Ogg logical and physical bitstream overview</h1>
<h1>Ogg bitstream overview</h1>
<h2>Ogg bitstreams</h2>
This document serves as starting point for understanding the design
and implementation of the Ogg container format. If you're new to Ogg
or merely want a high-level technical overview, start reading here.
Other documents linked from the <a href="index.html">index page</a>
give distilled technical descriptions and references of the container
mechanisms. This document is intended to aid understanding.
<p>Ogg codecs use octet vectors of raw, compressed data
(<em>packets</em>). These compressed packets do not have any
high-level structure or boundary information; strung together, they
appear to be streams of random bytes with no landmarks.</p>
<h2>Container format design points</h2>
<p>Raw packets may be used directly by transport mechanisms that provide
their own framing and packet-separation mechanisms (such as UDP
datagrams). For stream based storage (such as files) and transport
(such as TCP streams or pipes), Vorbis and other future Ogg codecs use
the Ogg bitstream format to provide framing/sync, sync recapture
after error, landmarks during seeking, and enough information to
properly separate data back into packets at the original packet
boundaries without relying on decoding to find packet boundaries.</p>
<p>Ogg is intended to be a simplest-possible container, concerned only
with framing, ordering, and interleave. It can be used as a stream delivery
mechanism, for media file storage, or as a building block toward
implementing a more complex, non-linear container (for example, see
the <a href="skeleton.html">Skeleton</a> or <a
href="http://en.wikipedia.org/wiki/Annodex">Annodex/CMML</a>).
<h2>Logical and physical bitstreams</h2>
<p>The Ogg container is not intended to be a monolithic
'kitchen-sink'. It exists only to frame and deliver in-order stream
data and as such is vastly simpler than most other containers.
Elementary and multiplexed streams are both constructed entirely from a
single building block (an Ogg page) comprised of eight fields
totalling twenty-eight bytes (the page header) a list of packet lengths
(up to 255 bytes) and payload data (up to 65025 bytes). The structure
of every page is the same. There are no optional fields or alternate
encodings.
<p>Raw packets are grouped and encoded into contiguous pages of
structured bitstream data called <em>logical bitstreams</em>. A
logical bitstream consists of pages, in order, belonging to a single
codec instance. Each page is a self contained entity (although it is
possible that a packet may be split and encoded across one or more
pages); that is, the page decode mechanism is designed to recognize,
verify and handle single pages at a time from the overall bitstream.</p>
<p>Stream and media metadata is contained in Ogg and not built into
the Ogg container itself. Metadata is thus compartmentalized and
layered rather than part of a monolithic design, an especially good
idea as no two groups seem able to agree on what a complete or
complete-enough metadata set should be. In this way, the container and
container implementation are isolated from unnecessary design flux.
<p>Multiple logical bitstreams can be combined (with restrictions) into a
single <em>physical bitstream</em>. A physical bitstream consists of
multiple logical bitstreams multiplexed at the page level and may
include a 'meta-header' at the beginning of the multiplexed logical
stream that serves as identification magic. Whole pages are taken in
order from multiple logical bitstreams and combined into a single
physical stream of pages. The decoder reconstructs the original
logical bitstreams from the physical bitstream by taking the pages in
order from the physical bitstream and redirecting them into the
appropriate logical decoding entity. The simplest physical bitstream
is a single, unmultiplexed logical bitstream with no meta-header; this
is referred to as a 'degenerate stream'.</p>
<h3>Streaming</h3>
<p><a href="framing.html">Ogg Logical Bitstream Framing</a> discusses
<p>The Ogg container is primarily a streaming format,
encapsulating chronological, time-linear mixed media into a single
delivery stream or file. The design is such that an application can
always encode and/or decode all features of a bitstream in one pass
with no seeking and minimal buffering. Seeking to provide optimized
encoding (such as two-pass encoding) or interactive decoding (such as
scrubbing or instant replay) is not disallowed or discouraged, however
no container feature requires nonlinear access of the bitstream.
<h3>Variable Bit Rate, Variable Payload Size</h3>
<p>Ogg is designed to contain any size data payload with bounded,
predictable efficiency. Ogg packets have no maximum size and a
zero-byte minimum size. There is no restriction on size changes from
packet to packet. Variable size packets do not require the use of any
optional or additional container features. There is no optimal
suggested packet size, though special consideration was paid to make
sure 50-200 byte packets were no less efficient than larger packet
sizes. The original design criteria was a 2% overhead at 50 byte
packets, dropping to a maximum working overhead of 1% with larger
packets, and a typical working overhead of .5-.7% for most practical
uses.
<h3>Simple pagination</h3>
<p>Ogg is a byte-aligned container with no context-dependent, optional
or variable-length fields. Ogg requires no repacking of codec data.
The page structure is written out in-line as packet data is submitted
to the streaming abstraction. In addition, it is possible to
implement both Ogg mux and demux as MT-hot zero-copy abstractions (as
is done in the Tremor sourcebase).
<h3>Capture</h3>
<p>Ogg is designed for efficient and immediate stream capture with
high confidence. Although packets have no size limit in Ogg, pages
are a maximum of just under 64kB meaning that any Ogg stream can be
captured with confidence after seeing 128kB of data or less [worst
case; typical figure is 6kB] from any random starting point in the
stream.
<h3>Seeking</h3>
<p>Ogg implements simple coarse- and fine-grained seeking by design.
<p>Coarse seeking may be performed by simply 'moving the tone arm' to a
new position and 'dropping the needle'. Rapid capture with
accompanying timecode from any location in an Ogg file is guaranteed
by the stream design. From the acquisition of the first timecode,
all data needed to play back from that time code forward is ahead of
the stream cursor.
<p>Ogg implements full sample-granularity seeking using an
interpolated bisection search built on the capture and timecode
mechanisms used by coarse seeking. As above, once a search finds
the desired timecode, all data needed to play back from that time code
forward is ahead of the stream cursor.
<p>Both coarse and fine seeking use the page structure and sequencing
inherent to the Ogg format. All Ogg streams are fully seekable from
creation; seekability is unaffected by truncation or missing data, and
is tolerant of gross corruption. Seek operations are neither 'fuzzy' nor
heuristic.
<p>Seeking without use of an index is a major point of the Ogg
design. There are several reasons why Ogg forgoes an index:
<ul>
<li>It must be possible to create an Ogg stream in a single pass, and
an index requires either two passes to create, or the index must be
tacked onto the end of a live stream after the stream is finished.
Both methods run afoul of other design constraints.
<li>An index is only marginally useful in Ogg for the complexity
added; it adds no new functionality and seldom improves performance
noticeably. Empirical testing shows that indexless interpolation
search does not require many more seeks in practice than using an
index would.
<li>'Optional' indexes encourage lazy implementations that can seek
only when indexes are present, or that implement indexless seeking
only by building an internal index after reading the entire file
beginning to end. This has been the fate of other containers that
specify optional indexing.
</ul>
<h3>Simple multiplexing</h3>
<p>Ogg multiplexes streams by interleaving pages from multiple elementary streams into a
multiplexed stream in time order. The multiplexed pages are not
altered. Muxing an Ogg AV stream out of separate audio,
video and data streams is akin to shuffling several decks of cards
together into a single deck; the cards themselves remain unchanged.
Demultiplexing is similarly simple.
<p>The goal of this design is to make the mux/demux operation as
trivial as possible to allow live streaming systems to build and
rebuild streams on the fly with minimal CPU usage and no additional
storage or latency requirements.
<h3>Continuous and Discontinuous Media</h3>
<p>Ogg streams belong to one of two categories, "Continuous" streams and
"Discontinuous" streams.
<p>A stream that provides a gapless, time-continuous media type with a
fine-grained timebase is considered to be 'Continuous'. A continuous
stream should never be starved of data. Examples of continuous data
types include broadcast audio and video.
<p>A stream that delivers data in a potentially irregular pattern or
with widely spaced timing gaps is considered to be 'Discontinuous'. A
discontinuous stream may be best thought of as data representing
scattered events; although they happen in order, they are typically
unconnected data often located far apart. One example of a
discontinuous stream types would be captioning such as <a
href="http://wiki.xiph.org/OggKate">Ogg Kate</a>. Although it's
possible to design captions as a continuous stream type, it's most
natural to think of captions as widely spaced pieces of text with
little happening between.
<p>The fundamental reason for distinction between continuous and
discontinuous streams concerns buffering.
<h3>Buffering</h3>
<p>A continuous stream is, by definition, gapless. Ogg buffering is based
on the simple premise of never allowing an active continuous stream
to starve for data during decode; buffering works ahead until all
continuous streams in a physical stream have data ready and no further.
<p>Discontinuous stream data is not assumed to be predictable. The
buffering design takes discontinuous data 'as it comes' rather than
working ahead to look for future discontinuous data for a potentially
unbounded period. Thus, the buffering process makes no attempt to fill
discontinuous stream buffers; their pages simply 'fall out' of the
stream when continuous streams are handled properly.
<p>Buffering requirements in this design need not be explicitly
declared or managed in the encoded stream. The decoder simply reads as
much data as is necessary to keep all continuous stream types gapless
and no more, with discontinuous data processed as it arrives in the
continuous data. Buffering is implicitly optimal for the given
stream. Because all pages of all data types are stamped with absolute
timing information within the stream, inter-stream synchronization
timing is always maintained without the need for explicitly declared
buffer-ahead hinting.
<h3>Codec metadata</h3>
<p>Ogg does not replicate codec-specific metadata into the mux layer
in an attempt to make the mux and codec layer implementations 'fully
separable'. Things like specific timebase, keyframing strategy, frame
duration, etc, do not appear in the Ogg container. The mux layer is,
instead, expected to query a codec through a standardized interface,
left to the implementation, for this data when it is needed.
<p>Though modern design wisdom usually prefers to predict all possible
needs of current and future codecs then embed these dependencies and
the required metadata into the container itself, this strategy
increases container specification complexity, fragility, and rigidity.
The mux and codec implementations become more independent, but the
specifications become less independent. A codec can't do what a
container hasn't already provided for. New codecs are harder to
support, and you can do fewer useful things with the ones you've
already got (eg, try to make a good splitter without using any codecs.
You're stuck splitting at keyframes only, or building yet another new
mechanism into the container layer to mark what frames to skip
displaying).
<p>Ogg's design goes the opposite direction, where the specification
is to be as simple, easy to understand, and 'proofed' against novel
codecs as possible. When an Ogg mux layer requires codec-specific
information, it queries the codec (or a codec stub). This trades a
more complex implementation for a simpler, more flexible
specification.
<h3>Stream structure metadata</h3>
<p>The Ogg container itself does not define a metadata system for
declaring the structure and interrelations between multiple media
types in a muxed stream. That is, the Ogg container itself does not
specify data like 'which steam is the subtitle stream?' or 'which
video stream is the primary angle?'. This metadata still exists, but
is stored in the Ogg container rather than being built into the Ogg
container. Xiph specifies the 'Skeleton' metadata format for Ogg
streams, but this decoupling of container and stream structure
metadata means it is possible to use Ogg with any metadata
specification without altering the container itself, or without stream
structure metadata at all.
<h3>Frame accurate absolute position</h3>
<p>Every Ogg page is stamped with a 64 bit 'granule position' that
serves as an absolute timestamp for mux and seeking. A few nifty
little tricks are usually also embedded in the granpos state, but
we'll leave those aside for the moment (strictly speaking, they're
part of each codec's mapping, not Ogg).
<p>As previously mentioned above, granule positions are mapped into
absolute timestamps by the codec, rather than being a hard timestamp.
This allows maximally efficient use of the available 64 bits to
address every sample/frame position without approximation while
supporting new and previously unknown timebase encodings without
needing to extend or update the mux layer. When a codec needs a novel
timebase, it simply brings the code for that mapping along with it.
This is not a theoretical curiosity; new, wholly novel timebases were
deployed with the adoption of both Theora and Dirac. "Rolling INTRA"
(keyframeless video) also benefits from novel use of the granule
position.
<h2>Ogg stream arrangement</h2>
<h3>Packets, pages, and bitstreams</h3>
<p>Ogg codecs use <em>packets</em>. Packets are octet payloads of
raw, compressed data, containing the data needed for a single
decompressed unit, eg, one video frame. Packets have no maximum size
and may be zero length. They do not have any high-level structure or
boundary information; strung together, the unframed packets form a
<em>logical bitstream</em> of apparently random bytes with no internal
landmarks.
<p>Logical bitstream packets are grouped and framed into Ogg pages
along with a unique stream <em>serial number</em> to produce a
<em>physical bitstream</em>. An <em>elementary stream</em> is a
physical bitstream containing only the pages framing a single logical
bitstream. Each page is a self contained entity, although a packet may
be split and encoded across one or more pages. The page decode
mechanism is designed to recognize, verify and handle single pages at
a time from the overall bitstream.
<p><a href="framing.html">Ogg Bitstream Framing</a> specifies
the page format of an Ogg bitstream, the packet coding process
and logical bitstreams in detail. The remainder of this document
specifies requirements for constructing finished, physical Ogg
bitstreams.</p>
and elementary bitstreams in detail.
<h2>Mapping Restrictions</h2>
<h3>Multiplexed bitstreams</h3>
<p>Logical bitstreams may not be mapped/multiplexed into physical
bitstreams without restriction. Here we discuss design restrictions
on Ogg physical bitstreams in general, mostly to introduce
design rationale. Each 'media' format defines its own (generally more
restrictive) mapping. An 'Ogg Vorbis Audio Bitstream', for example, has a
specific physical bitstream structure.
Any other codec or combination of codecs will generally also mandate a
corresponding restricted physical bitstream format.</p>
<p>Multiple logical/elementary bitstreams can be combined into a single
<em>multiplexed bitstream</em> by interleaving whole pages from each
contributing elementary stream in time order. The result is a single
physical stream that multiplexes and frames multiple logical streams.
Each logical stream is identified by the unique stream serial number
stamped in its pages. A physical stream may include a 'meta-header'
(such as the <a href="skeleton.html">Ogg Skeleton</a>) comprising its
own Ogg page at the beginning of the physical stream. A decoder
recovers the original logical/elementary bitstreams out of the
physical bitstream by taking the pages in order from the physical
bitstream and redirecting them into the appropriate logical decoding
entity.
<h3>additional end-to-end structure</h3>
<p><a href="ogg-multiplex.html">Ogg Bitstream Multiplexing</a> specifies
proper multiplexing of an Ogg bitstream in detail.
<h3>Chaining</h3>
<p>Multiple Ogg physical bitstreams may be concatenated into a single new
stream; this is <em>chaining</em>. The bitstreams do not overlap; the
final page of a given logical bitstream is immediately followed by the
initial page of the next.</p>
<p>Each logical bitstream in a chain must have a unique serial number
within the scope of the full physical bitstream, not only within a
particular <em>link</em> or <em>segment</em> of the chain.</p>
<h3>Continuous and discontinuous streams</h3>
<p>Within Ogg, each stream must be declared (by the codec) to be
continuous- or discontinuous-time. Most codecs treat all streams they
use as either inherently continuous- or discontinuous-time, although
this is not a requirement. A codec may, as part of its mapping, choose
according to data in the initial header.
<p>Continuous-time pages are stamped by end-time, discontinuous pages
are stamped by begin-time. Pages in a multiplexed stream are
interleaved in order of the time stamp regardless of stream type.
Both continuous and discontinuous logical streams are used to seek
within a physical stream, however only continuous streams are used to
determine buffering depth; because discontinuous streams are stamped
by start time, they will always 'fall out' in time when buffering
tracks only the continuous streams. See 'Examples' for an
illustration of the buffering mechanism.
<h2>Mapping Requirements</h2>
<p>Each codec is allowed some freedom in deciding how its logical
bitstream is encapsulated into an Ogg bitstream (even if it is a
trivial mapping, eg, 'plop the packets in and go'). This is the
codec's <em>mapping</em>. Ogg imposes a few mapping requirements
on any codec.
<p>The <a href="framing.html">framing specification</a> defines
'beginning of stream' and 'end of stream' page markers via a header
flag (it is possible for a stream to consist of a single page). A
stream always consists of an integer number of pages, an easy
correct stream always consists of an integer number of pages, an easy
requirement given the variable size nature of pages.</p>
<p>In addition to the header flag marking the first and last pages of a
logical bitstream, the first page of an Ogg bitstream obeys
additional restrictions. Each individual media mapping specifies its
own implementation details regarding these restrictions.</p>
<p>The first page of an elementary Ogg bitstream consists of a single,
small 'initial header' packet that must include sufficient information
to identify the exact CODEC type. From this initial header, the codec
must also be able to determine its timebase and whether or not it is a
continuous- or discontinuous-time stream. The initial header must fit
on a single page. If a codec makes use of auxiliary headers (for
example, Vorbis uses two auxiliary headers), these headers must follow
the initial header immediately. The last header finishes its page;
data begins on a fresh page.
<p>The first page of a logical Ogg bitstream consists of a single,
small 'initial header' packet that includes sufficient information to
identify the exact CODEC type and media requirements of the logical
bitstream. The intent of this restriction is to simplify identifying
the bitstream type and content; for a given media type (or across all
Ogg media types) we can know that we only need a small, fixed
amount of data to uniquely identify the bitstream type.</p>
<p>As an example, Ogg Vorbis places the name and revision of the
Vorbis CODEC, the audio rate and the audio quality into this initial
header. Comments and detailed codec setup appears in the larger
auxiliary headers.</p>
<p>As an example, Ogg Vorbis places the name and revision of the Vorbis
CODEC, the audio rate and the audio quality into this initial header,
thus simplifying vastly the certain identification of an Ogg Vorbis
audio bitstream.</p>
<h2>Multiplexing Requirements</h2>
<h3>sequential multiplexing (chaining)</h3>
<p>Multiplexing requirements within Ogg are straightforward. When
constructing a single-link (unchained) physical bitstream consisting
of multiple elementary streams:
<p>The simplest form of logical bitstream multiplexing is concatenation
(<em>chaining</em>). Complete logical bitstreams are strung
one-after-another in order. The bitstreams do not overlap; the final
page of a given logical bitstream is immediately followed by the
initial page of the next. Chaining is the only logical->physical
mapping allowed by Ogg Vorbis.</p>
<ol>
<p>Each chained logical bitstream must have a unique serial number within
the scope of the physical bitstream.</p>
<li> The initial header for each stream appears in sequence, each
header on a single page. All initial headers must appear with no
intervening data (no auxiliary header pages or packets, no data pages
or packets). Order of the initial headers is unspecified. The
'beginning of stream' flag is set on each initial header.
<h3>concurrent multiplexing (grouping)</h3>
<li> All auxiliary headers for all streams must follow. Order
is unspecified. The final auxiliary header of each stream must flush
its page.
<p>Logical bitstreams may also be multiplexed 'in parallel'
(<em>grouped</em>). An example of grouping would be to allow
streaming of separate audio and video streams, using different codecs
and different logical bitstreams, in the same physical bitstream.
Whole pages from multiple logical bitstreams are mixed together.</p>
<li>Data pages for each stream follow, interleaved in time order.
<p>The initial pages of each logical bitstream must appear first; the
media mapping specifies the order of the initial pages. For example,
Ogg Theora describes video bitstream with audio.
The mapping specifies that the physical bitstream must begin
with the initial page of a logical video bitstream, followed by the
initial page of an audio stream. Unlike initial pages, terminal pages
for the logical bitstreams need not all occur contiguously (although a
specific media mapping may require this; it is not mandated by the
generic Ogg stream spec). Terminal pages may be 'nil' pages,
that is, pages containing no content but simply a page header with
position information and the 'last page of bitstream' flag set in the
page header.</p>
<li>The final page of each stream sets the 'end of stream' flag.
Unlike initial pages, terminal pages for the logical bitstreams need
not occur contiguously; indeed it may not be possible for them to do so.
</oL>
<p>Each grouped bitstream must have a unique serial number within the
scope of the physical bitstream.</p>
<h3>sequential and concurrent multiplexing</h3>
<h3>chaining and multiplexing</h3>
<p>Groups of concurrently multiplexed bitstreams may be chained
<p>Multiplexed and/or unmultiplexed bitstreams may be chained
consecutively. Such a physical bitstream obeys all the rules of both
grouped and chained multiplexed streams; the groups, when unchained ,
must stand on their own as a valid concurrently multiplexed
bitstream.</p>
chained and multiplexed streams. Each link, when unchained, must
stand on its own as a valid physical bitstream. Chained streams do
not mix; a new segment may not begin until all streams in the
preceding segment have terminated. </p>
<h3>multiplexing example</h3>
<h2>Examples</h2>
<em>[More to come shortly; this section is currently being revised and expanded]</em>
<p>Below, we present an example of a grouped and chained bitstream:</p>
@ -227,7 +489,7 @@ where decode requires more information).</li>
The Xiph Fish Logo is a
trademark (&trade;) of Xiph.Org.<br/>
These pages &copy; 1994 - 2005 Xiph.Org. All rights reserved.
These pages &copy; 1994 - 2010 Xiph.Org. All rights reserved.
</div>
</body>