ogg/doc/ogg-multiplex.html

<HTML><HEAD><TITLE>xiph.org: Ogg documentation</TITLE>
<BODY bgcolor="#ffffff" text="#202020" link="#006666" vlink="#000000">
<nobr><a href="http://www.xiph.org/ogg/index.html"><img src="white-ogg.png" border=0><img src="vorbisword2.png" border=0></a></nobr><p>

<h1><font color=#000070>
Page Multiplexing and Ordering in a Physical Ogg Stream
</font></h1>

<em>Last update to this document: May 7, 2004</em><br> 
<p>

The low-level mechanisms of an Ogg stream (as described in the Ogg
Bitstream Overview) provide means for mixing multiple logical streams
and media types into a single linear-chronological stream.  This
document specifices the high-level arrangement and use of page
structure to multiplex multiple streams of mixed media type within a
physical Ogg stream.

<h2>Design Elements</h2>

The design and arrangement of the Ogg container format is governed by
several high-level design decisions that form the reasoning behind
specific low-level design decisions.

<h3>Linear media</h3> 

The Ogg bitstream is intended to encapsulate chronological,
time-linear mixed media into a single delivery stream or file.  The
design is such that an application can always encode and/or decode a
full-featured bitstream in one pass with no seeking an minimal
buffering.  Seeking to provide optimized encoding (such as two-pass
encoding) or interactive decoding (such as scrubbing or instant
replay) is not disallowed or discouraged, however no bitstream feature
must require nonlinear operation on the bitstream.<p>

<h3>Seeking</h3> 

Ogg is designed to use a bisection search to implement exact
positional seeking rather than building an index; an index requires
two-pass encoding and as such is not acceptible according to original
design requirements.  <p>

<i>Even making an index optional then requires an
application to support multiple methods (bisection search for a
one-pass stream, indexing for a two-pass stream), which adds no
additional functionality as bisection search delivers the same
functionality for both stream types.</i><p>

<h3>Multiplexing</h3>

Ogg bitstreams multiplex multiple logical streams into a single
physical stream at the page level.  Each page contains an abstract
time stamp (the Granule Position) that represents an absolute time
landmark within the stream.  After the pages representing stream
headers (all logical stream headers occur at the beginning of a
physical bitstream section before any logical stream data), logical
stream data pages are arranged in strict, monotonically increasing
order of chronological absolute time as specified by the granule
position.  <p>

The only exception to arranging pages in strictly ascending time order
by granule position is those pages that do not set the granule
position value.  This is a special case when exceptionally large
packets span multiple pages; the specifics of handling this special
case are described later under 'Continuous and Discontinuous
Streams'.<p>

<h3>Buffering</h3>

Ogg's multiplexing design minimizes extraneous buffering required to
maintain audio/video sync by arranging audio, video and other data in
chronological order.  Thus, a normally streamed file delivers all
data for decode 'just in time'; pages arrive in the order they must
be consumed.<p>

Buffering requirements need not be explicitly declared or managed for
the encoded stream; the decoder simply reads as much data as is
necessary to keep all continuous stream types gapless (also ensuring
discontinuous data arrives in time) and no more, resulting in optimum
buffer usage for free.  Because all pages of all data types are
stamped with absolute timing information within the stream,
inter-stream synchronization timing is always explicitly maintained
without the need for explicitly declared buffer-ahead hinting.<p>

<h3>Whole-stream navigation</h3>

Ogg is designed sot hat the simplest navigation operations are one
that best treat the physical Ogg stream as whole summary of its
streams, rather than navigating each interleaved stream as a seperate
entity.  <p>

Example: the simplest method of seeking to a desired
position in a multiplexed (or unmultiplexed) Ogg stream is to
bisection search by time position (as encoded in the granule
position).  <p>

Example: A bitstream section may consist of three multiplexed streams
of differing lenghts.  The result of multiplexing these streams should
be thought of as a single mixed stream with a length that happens to
equal the longest of the three component streams.  Although it is also
possible to think of the multiplexed results as three concurrent
streams of different lenghts and it is possible to recover the three
original streams, it will also become obvious that once multiplexed,
it isn't possible to find the internal lenghts of the component
streams without a linear search of the whole bitstream section.
However, it is possible to find the length of the whole bitstream
section easily (in near-constant time per section) just as it is for a
single-media unmultiplexed stream.<p>

<h2>Granule Position</h2>

<h3>Description</h3>

The Granule Position is a signed 64 bit field appearing in the header
of every Ogg page.  Although the granule position represents absolute
time within a logical stream, its value does not necessarily directly
encode a simple timestamp.  It may represent frames elapsed (as in
Vorbis), a simple timestamp, or a more complex bit-division encoding
(such as in Theora).  The exact meaning of the granule position is up
to a specific codec.<p>

The granule position is governed by the following rules:
<ul>

<li>Granule Position must always increase forward or remain equal from
page to page, be unset, or be zero for a header page. The absolute
time to which any correct sequence of granule position maps must
similarly always increase forward or remain equal. <i>(A codec may
make use of data, such as a control sequence, that only affects codec
working state without producing data and thus advancing granule
position and time.  Although the packet sequence number increases in
this case, the granule position, and thus the time position, do
not.)</i><br>

<li>Granule position may only be unset if there no packet defining a
time boundary on the page (that is, if no packet in a continuous
stream ends on the page, or no packet in a discontinuous stream begins
on the page.  This will be discussed in more detail under Continuous
and Discontinuous streams).<br>

<li>A codec must be able to translate a given granule position value
to a unique, deterministic absolute time value through direct
calculation.  A codec is not required to be able to translate an
absolute time value into a unique granule position value.<br>

<li>Codecs shall choose a granule position definition that allows that
codec means to seek as directly as possible to an immediately
decodable point, such as the bit-divided granule position encoding of
Theora allows the codec to seek efficiently to keyframes without using
an index.  That is, additional information other than absolute time
may be encoded into a granule position value so long as the granule
position obeys the above points.
</ul>

<h4>Example: timestamp</h4>

In general, a codec/stream type should choose the simplest granule
position encoding that addresses its requirements.  The examples here
are by no means exhaustive of the possibilities within Ogg.<p>

A simple granule position could encode a timestamp directly. For
example, a granule position that encoded milliseconds from beginning
of stream would allow a logical stream length of over 100,000,000,000
days before beginning a new logical stream (to avoid the granule
position wrapping).<p>

<h4>Example: framestamp</h4>

A simple millisecond timestamp granule encoding might suit many stream
types, but a millisecond resolution is inappropriate to, eg, most
audio encodings where exact single-sample resolution is generally a
requirement.  A millisecond is both too large a granule and often does
not represent an integer number of samples.<p>

In the event that a audio frames always encode the same number of
samples, the granule position could simple be a linear count of frames
since beginning of stream. This has the advantages of being exact and
efficient.  Position in time would simply be <tt>[granule_position] *
[samples_per_frame] / [samples_per_second]</tt>.

<h4>Example: samplestamp (Vorbis)</h4>

Frame counting is insufficient in codecs such as Vorbis where an audio
frame [packet] encodes a variable number of samples.  In Vorbis's
case, the granule position is a count of the number of raw samples
from the beginning of stream; the absolute time of
a granule position is <tt>[granule_position] /
[samples_per_second]</tt>.
 
<h4>Example: bit-divided framestamp (Theora)</h4>

Some video codecs may be able to use the simple framestamp scheme for
granule position.  However, most modern video codecs introduce at
least the following complications:<p>
<ul>

<li>video frames are relatively far apart compared to audio samples;
for this reason, the point at which a video frame changes to the next
frame is usually a strictly defined offset within the frme 'period'.
That is, video at 50fps could just as easily define frame transitions
<.015, .035, .055...> as at <.00, .02, .04...>.

<li>frame rates often include drop-frames, leap-frames or other
rational-but-non-integer timings.

<li>Decode must begin at a 'keyframe' or 'I frame'.  Keyframes usually
occur relatively seldom.
</ul>

The first two points can be handled straightforwardly via the fact
that the codec has complete control mapping granule position to
absolute time; non-integer frame rates and offsets can be set in the
codec's initial header, and the rest is just arithmetic.<p>

The third point appears trickier at first glance, but it too can be
handled through the granule position mapping mechanism.  Here we
arrange the granule position in such a way that granule positions of
keyframes are easy to find.  Divide the granule position <p>


     Can seek quickly to any keyframe without index
     Naieve seeking algorithm still availble; juyst lower performance
     Bisection seeking used anyway

<h3>granule position, packets and pages</h3>

Although each packet of data in a logical stream theoretically has a
specific granule position, only one granule position is encoded
per page.  It is possible to encode a logical stream such that each
page contains only a single packet (so that granule positions are
preserved for each packet), however a one-to-one packet/page mapping
is not intended to be the general case.<p>

Because Ogg functions at the page, not packet, level, this
once-per-page time information provides Ogg with the finest-grained
time information is can use.  Ogg passes this granule positioning data
to the codec (along with the packets extracted from a page); it is
intended to be the responsibility of codecs to track timing
information at granularities finer than a single page.<p>

<h3>start-time and end-time positioning</h3>

A granule position represents the <em>instantaneous time location
between two pages</em>. In an "end-time" encoded page, the granulepos
represents the point in time immediately after the last data decoded
from a page.  In a "start-time" encoded page, it represents the point
in time immediately before the first data decoded from the page.<p>

Start-time or end-time positioning is flagged in bit 3 of byte 5 in the
Ogg page header. A set bit indicates start-time positioning.  Version 0
Ogg streams are restricted to using end-time positioning; version 1 may
use either or both start-time and end-time positioning. A single logical stream
within the multiplexed physical Ogg version 1 stream may also mix
start-time and end-time positioning.<p>

 Start- and end-time do not affect multiplexing sort-order; pages are
still sorted by the absolute time a given granulepos maps to
regardless of whether that granulepos prepresents start- or
end-time.<p>

<h4>use of end-time positioning</h4>

End-time positioning is most useful in unmultiplexed streams.  It allows
two useful features relatively more easily:
<ol>
<li>"short" beginning-of-stream and end-of-stream packets can be represented entirely using granulepos; the codec does not need to store auxiliary sizing information in the codec's data packets.<br>
<li>Retrieving the exact end-time of a stream is the trivial operation of inspecting the granule posiiton of the last page.<br>
</ol>

However, end-time coding results in sightly less efficient buffering
usage in a multiplexed stream.

<h4>use of start-time positioning</h4>

Multiplexed streams of start-time encoded pages yield optimal
buffering behavior; it requires the minimum theoretical buffer space
of any possible arrangement of pages.  This is the primary benefit of
start-time positioning.<p>

The drawbacks of start-time positioning mirror the benefits attributed to
end-time positioning.  Namely:<p>

<ol>
<li>

Codecs that generate short packets can no longer infer the presence of
a short packet from granulepos context; the 'shortness' of the packet
must be encoded in the packet itself.  This drawback is minor, however
it does mean that codecs like Vorbis (which relies on granpos context
to detect sort packets) absolutely must use end-time positioning to
handle short packets.<br>
<li>
Determining ending time position of a stream requires slightly more
work than in an end-time encoded stream; the packets of the final
stream page must be counted forward to find ending time.
<br>
</ol>

Despite these minor drawbacks, the additional buffer efficiency of
start-time positioning strongly recommends its use in both multiplexed
and unmultiplexed streams.  Use of end-time positioning should largely be
treated as a legacy means of supporting codecs that use
granulepos-context to determine short packets (such as Vorbis I).<p>

<h4>mixed start-time and end-time positioning</h4>

Mixed positioning may refer to either multiplexing two or more streams
that use different time positionings, or using more than one time
positioning within a logical stream. <p>

Mixed positioning mostly affects only buffer efficiency; although
end-time positioning is less efficient than start-time, mixed-time
positioning will often be less efficient than both.  The inefficiency is
relative however; buffer efficiency can still be excellent in all
three cases.<p>

One possible use of mixed-time positioning is combine the benefits of
end-time and start-time positioning, for example, use start-time positioning
for all but the last page of a stream, which is then coded in end-time
format.  This way, a short packet can be flagged using granulepos
context and the end-time position of the stream is immediately obvious
from inspecting the last granule position.<p>

[POINT OF DISCUSSION: the above suggestion looks like it may be worth
considering as the suggested way of positioning the stream, thus doing
away entirely with the need to 'count time forward through packets' on
the last page of a start-time encoded stream to find final steam
length.  However, a truncated stream will be missing the end-time last
page.

1) We could say 'mixed time is the way to go' and just let a
damaged/truncated stream suffer.

2) We could say 'counting time forward through packets is just the way
it has to be done' and do away with the possibility of mixed coding
entirely]

<h2>Multiplex/Demultiplex Division of Labor</h2>

The Ogg multiplex/deultiplex layer provides mechanisms for encoding
raw packets into Ogg pages, decoding Ogg pages back into the original
codec packets, determining the logical structure of an Ogg stream, and
navigating through and synchronizing with an Ogg stream at a desired
stream location.  Strict multiplex/demultiplex operations are entirely
in the Ogg domain and require no intervention from codecs.<p>

Implementation of more complex operations does require codec
knowledge, however.  Unlike other framing systems, Ogg maintains
strict seperation between framing and the framed bistream data; Ogg
does not replicate codec-specific information in the page/framing
data, nor does Ogg blur the line between framing and stream
data/metadata.  Because Ogg is fully data agnostic toward the data it
frames, operations which require specifics of bitstream data (such as
'seek to keyframe') also require interaction with the codec layer
(because, in this example, the Ogg layer is not aware of the concept
of keyframes).  This is different from systems that blur the
seperation between framing and stream data in order to simplify the
seperation of code.  The Ogg system purposely keeps the distinction in
data simple so that later codec innovations are not constrained by
framing design.<p>

For this reason, however, complex seeking operations require
interaction with the codecs in order to decode the granule position of
a given stream type back to absolute time or in order to find
'decodable points' such as keyframes in video.

<h2>Continuous and Discontinuous Streams</h2>

<h3>continuous description</h3>
A stream that provides a gapless, time-continuous media type is
considered to be 'Continuous'.  Clear examples of continuous data
types include broadcast audio and video. Such a stream should never
allow a playback buffer to starve, and Ogg implementations must buffer
ahead sufficient pages such that all continuous streams in a physical
stream have data ready to decode on demand.<p>

<h3>discontinuous description</h3>
A stream that delivers data in a potentially irregular pattern or with
widely spaced timing gaps is considered to be 'Discontinuous'.  An
examples of a discontinuous stream types would be captioning.
Although captions still occur on a regular basis, the timing of a
specific caption is impossible to predict with certainty in most
captioning systems.<p>

<h3>declaration</h3> An Ogg stream type is defined to be continuous or
discontinuous by its codec.  A given codec may support both continuous
and discontinuous operation so long as any given logical stream is
continuous or discontinuous for its entirety and the codec is able to
ascertain (and inform the Ogg layer) as to which after decoding the
initial stream header.  The majority of codecs will always be
continuous (such as Vorbis) or discontinuous (such as Writ).


<h2>Unsorted Discussion Points</h2>

flushes around keyframes?  RFC suggestion: repaginating or building a
stream this way is nice but not required


<h2>Appendix A: multiplexing examples</h2>