Updated doc to reflect current proposal...

Not as much a proposal at this point actually; this is the way I'm now
implementing it.  Although we're still in the 'RFC'/'look for horrible
lossage' stage, this is close to being set in stone unless we find
something horribly wrong with it.

Doc is still very light on detailed rationale and examples; I'd like
to subcontract that part of the writing and get on with code.


svn path=/trunk/ogg/; revision=6719
This commit is contained in:
Monty 2004-05-18 06:04:53 +00:00
parent 712dbb9b65
commit 5a42681ccc

View File

@ -6,7 +6,7 @@
Page Multiplexing and Ordering in a Physical Ogg Stream
</font></h1>
<em>Last update to this document: May 7, 2004</em><br>
<em>Last update to this document: May 17, 2004</em><br>
<p>
The low-level mechanisms of an Ogg stream (as described in the Ogg
@ -33,19 +33,6 @@ encoding) or interactive decoding (such as scrubbing or instant
replay) is not disallowed or discouraged, however no bitstream feature
must require nonlinear operation on the bitstream.<p>
<h3>Seeking</h3>
Ogg is designed to use a bisection search to implement exact
positional seeking rather than building an index; an index requires
two-pass encoding and as such is not acceptible according to original
design requirements. <p>
<i>Even making an index optional then requires an
application to support multiple methods (bisection search for a
one-pass stream, indexing for a two-pass stream), which adds no
additional functionality as bisection search delivers the same
functionality for both stream types.</i><p>
<h3>Multiplexing</h3>
Ogg bitstreams multiplex multiple logical streams into a single
@ -65,22 +52,93 @@ packets span multiple pages; the specifics of handling this special
case are described later under 'Continuous and Discontinuous
Streams'.<p>
<h3>Seeking</h3>
Ogg is designed to use a bisection search to implement exact
positional seeking rather than building an index; an index requires
two-pass encoding and as such is not acceptible given the requirement
for full-featured linear encoding.<p>
<i>Even making an index optional then requires an
application to support multiple methods (bisection search for a
one-pass stream, indexing for a two-pass stream), which adds no
additional functionality as bisection search delivers the same
functionality for both stream types.</i><p>
Seek operations are by absolute time; a direct bisection search must
find the exact time position requested. Information in the Ogg
bitstream is arranged such that all information to be presented for
playback fromt he desired seek point will occur at or after the
desired seek point. Seek operations are neither 'fuzzy' nor
heuristic.<p>
<i>Although keyframe handling in video appears to be an exception to
"all needed playback information lies ahead of a given seek",
keyframes can still be handled directly within this indexless
framework. Seeking to a keyframe in video (as well as seeking in other
media types with analagous restraints) is handled as two seeks; first
a seek to the desired time which extracts state information that
decodes to the time of the last keyframe, followed by a second seek
directly to the keyframe. The location of the previous keyframe is
embedded as state information in the granulepos; this mechanism is
described in more detail later.</i>
<h3>Continuous and Discontinuous Streams</h3>
Logical streams within a physical Ogg stream belong to one of two
categories, "Continuous" streams and "Discontinuous" streams.
Although these are discussed in more detail later, the distinction is
important to a high-level understanding of how to buffer an Ogg
stream.<p>
A stream that provides a gapless, time-continuous media type with a
fine-grained timebase is considered to be 'Continuous'. A continuous
stream should never be starved of data. Clear examples of continuous
data types include broadcast audio and video.<p>
A stream that delivers data in a potentially irregular pattern or with
widely spaced timing gaps is considered to be 'Discontinuous'. A
discontinuous stream may be best thought of as data representing
scattered events; although they happen in order, they are typically
unconnected data often located far apart. One possible example of a
discontinuous stream types would be captioning. Although it's
possible to design captions as a continuous stream type, it's most
natural to think of captions as widely spaced pieces of text with
little happing between.<p>
The fundamental design distinction between continuous and
discontinuous streams concerns buffering.<p>
<h3>Buffering</h3>
Ogg's multiplexing design minimizes extraneous buffering required to
maintain audio/video sync by arranging audio, video and other data in
chronological order. Thus, a normally streamed file delivers all
data for decode 'just in time'; pages arrive in the order they must
be consumed.<p>
Because a continuous stream is, by definition, gapless, Ogg buffering
is based on the simple premise of never allowing any active continuous
stream to starve for data during decode; buffering proceeds ahead
until all continuous streams in a physical stream have data ready to
decode on demand. <p>
Discontinuous stream data may occur on a farily regular basis, but the
timing of, for example, a specific caption is impossible to predict
with certainty in most captioning systems. Thus the buffering system
should take discontinuous data 'as it comes' rather than working ahead
(for a potentially unbounded period) to look for future discontinuous
data. As such, discontinuous streams are ingored when managing
buffering; their pages simply 'fall out' of the stream when continuous
streams are handled properly.<p>
Buffering requirements need not be explicitly declared or managed for
the encoded stream; the decoder simply reads as much data as is
necessary to keep all continuous stream types gapless (also ensuring
discontinuous data arrives in time) and no more, resulting in optimum
buffer usage for free. Because all pages of all data types are
stamped with absolute timing information within the stream,
inter-stream synchronization timing is always explicitly maintained
without the need for explicitly declared buffer-ahead hinting.<p>
implicit buffer usage for a given stream. Because all pages of all
data types are stamped with absolute timing information within the
stream, inter-stream synchronization timing is always explicitly
maintained without the need for explicitly declared buffer-ahead
hinting.<p>
Further details, mechanisms and reasons for the differing arrangement
and behavior of continuous and discontinuous streams is discussed
later.<p>
<h3>Whole-stream navigation</h3>
@ -90,19 +148,19 @@ navigating each interleaved stream as a seperate entity. <p>
First Example: seeking to a desired time position in a multiplexed (or
unmultiplexed) Ogg stream can be accomplished through a bisection
search on time position of all pages int he stream (as encoded in the
search on time position of all pages in the stream (as encoded in the
granule position). More powerful searches (such as a keyframe-aware
seek within video) are also possible with additional search
complexity, but similar computational compelxity.<p>
Second Example: A bitstream section may consist of three multiplexed
streams of differing lenghts. The result of multiplexing these
streams of differing lengths. The result of multiplexing these
streams should be thought of as a single mixed stream with a length
equal to the longest of the three component streams. Although it is
also possible to think of the multiplexed results as three concurrent
streams of different lenghts and it is possible to recover the three
original streams, it will also become obvious that once multiplexed,
it isn't possible to find the internal lenghts of the component
it isn't possible to find the internal lengths of the component
streams without a linear search of the whole bitstream section.
However, it is possible to find the length of the whole bitstream
section easily (in near-constant time per section) just as it is for a
@ -117,7 +175,7 @@ of every Ogg page. Although the granule position represents absolute
time within a logical stream, its value does not necessarily directly
encode a simple timestamp. It may represent frames elapsed (as in
Vorbis), a simple timestamp, or a more complex bit-division encoding
(such as in Theora). The exact meaning of the granule position is up
(such as in Theora). The exact encoding of the granule position is up
to a specific codec.<p>
The granule position is governed by the following rules:
@ -216,17 +274,22 @@ codec's initial header, and the rest is just arithmetic.<p>
The third point appears trickier at first glance, but it too can be
handled through the granule position mapping mechanism. Here we
arrange the granule position in such a way that granule positions of
keyframes are easy to find. Divide the granule position <p>
keyframes are easy to find. Divide the granule position into two
fields; the most-significant bits are an absolute frame counter, but
it's only updated at each keyframe. The least significant bits encode
the number of frames since the last keyframe. In this way, each
granule position both encodes the absolute time of the current frame
as well as the absolute time of the last keyframe.<p>
[FINISH DESCRIBING "THE GRANPOS HACK" HERE. ELOQUENCE IS CURRENTLY
ELUDING ME, BUT FOR NOW THE CORE TEAM UNDERSTANDS THIS ONE. Do be
sure to fill me in before this doc is public :-]
<pre>
Can seek quickly to any keyframe without index
Naieve seeking algorithm still availble; just lower performance
Bisection seeking used anyway
</pre>
Seeking to a most recent preceeding keyframe is then accomplished by
first seeking to the original desired point, inspecting the granulepos
of the resulting video page, extracting from that granulepos the
absolute time of the desired keyframe, and then seeking directly to
that keyframe's page. Of course, it's still possible for an
application to ignore keyframes and use a simpler seeking algorithm
(decode would be unable to present decoded video until the next
keyframe). Surprisingly many player applications do choose the
simpler approach.<p>
<h3>granule position, packets and pages</h3>
@ -240,116 +303,34 @@ is not intended to be the general case.<p>
Because Ogg functions at the page, not packet, level, this
once-per-page time information provides Ogg with the finest-grained
time information is can use. Ogg passes this granule positioning data
to the codec (along with the packets extracted from a page); it is
intended to be the responsibility of codecs to track timing
information at granularities finer than a single page.<p>
to the codec (along with the packets extracted from a page); it is the
responsibility of codecs to track timing information at granularities
finer than a single page.<p>
<h3>start-time and end-time positioning</h3>
A granule position represents the <em>instantaneous time location
between two pages</em>. In an "end-time" encoded page, the granulepos
represents the point in time immediately after the last data decoded
from a page. In a "start-time" encoded page, it represents the point
in time immediately before the first data decoded from the page.<p>
between two pages</em>. However, continuous streams and discontinuous
streams differ on whether the granulepos represents the end-time of
the data on a page or the start-time. Continuous streams are
'end-time' encoded; the granulepos represents the point in time
immediately after the last data decoded from a page. Discontinuous
streams are 'start-time' encoded; the granulepos represents the point
in time of the first data decoded from the page.<p>
Start-time or end-time positioning is flagged in bit 3 of byte 5 in
the Ogg page header. A set bit indicates start-time positioning.
Version 0 Ogg streams are restricted to using end-time positioning;
version 1 may use either or both start-time and end-time
positioning. A single logical stream within the multiplexed physical
Ogg version 1 stream may also mix start-time and end-time
positioning.<p>
An Ogg stream type is declared continuous or discontinuous by its
codec. A given codec may support both continuous and discontinuous
operation so long as any given logical stream is continuous or
discontinuous for its entirety and the codec is able to ascertain (and
inform the Ogg layer) as to which after decoding the initial stream
header. The majority of codecs will always be continuous (such as
Vorbis) or discontinuous (such as Writ).<p>
[POINT OF DISCUSSION: this flag can be added without upping the
bitstream revision. However, old software is unaware of start-time
ordering; the result is as harmless as seeking inaccuracies or as
serious as crashing poorly designed code. Upping the Ogg bitstream
revision would force old code to reject these new streams; although
old code generally doesn;t verify that any reserved flags are zero as
the spec mandates, the do check bitstream revision number]<p>
Start- and end-time do not affect multiplexing sort-order; pages are
still sorted by the absolute time a given granulepos maps to
Start- and end-time encoding do not affect multiplexing sort-order;
pages are still sorted by the absolute time a given granulepos maps to
regardless of whether that granulepos prepresents start- or
end-time.<p>
<h4>use of end-time positioning</h4>
End-time positioning is most useful in unmultiplexed streams. It allows
two useful features relatively more easily:
<ol>
<li>"short" beginning-of-stream and end-of-stream packets can be represented entirely using granulepos; the codec does not need to store auxiliary sizing information in the codec's data packets.<br>
<li>Retrieving the exact end-time of a stream is the trivial operation of inspecting the granule posiiton of the last page.<br>
</ol>
However, end-time coding results in sightly less efficient buffering
usage in a multiplexed stream.
<h4>use of start-time positioning</h4>
Multiplexed streams of start-time encoded pages yield optimal
buffering behavior; it requires the minimum theoretical buffer space
of any possible arrangement of pages. This is the primary benefit of
start-time positioning.<p>
The drawbacks of start-time positioning mirror the benefits attributed to
end-time positioning. Namely:<p>
<ol>
<li>
Codecs that generate short packets can no longer infer the presence of
a short packet from granulepos context; the 'shortness' of the packet
must be encoded in the packet itself. This drawback is minor, however
it does mean that codecs like Vorbis (which relies on granpos context
to detect sort packets) absolutely must use end-time positioning to
handle short packets.<br>
<li>
Determining ending time position of a stream requires slightly more
work than in an end-time encoded stream; the packets of the final
stream page must be counted forward to find ending time.
<br>
</ol>
Despite these minor drawbacks, the additional buffer efficiency of
start-time positioning strongly recommends its use in both multiplexed
and unmultiplexed streams. Use of end-time positioning should largely be
treated as a legacy means of supporting codecs that use
granulepos-context to determine short packets (such as Vorbis I).<p>
<h4>mixed start-time and end-time positioning</h4>
Mixed positioning may refer to either multiplexing two or more streams
that use different time positionings, or using more than one time
positioning within a logical stream. <p>
Mixed positioning mostly affects only buffer efficiency; although
end-time positioning is less efficient than start-time, mixed-time
positioning will often be less efficient than both. The inefficiency is
relative however; buffer efficiency can still be excellent in all
three cases.<p>
One possible use of mixed-time positioning is combine the benefits of
end-time and start-time positioning, for example, use start-time positioning
for all but the last page of a stream, which is then coded in end-time
format. This way, a short packet can be flagged using granulepos
context and the end-time position of the stream is immediately obvious
from inspecting the last granule position.<p>
[POINT OF DISCUSSION: the above suggestion looks like it may be worth
considering as the suggested way of positioning the stream, thus doing
away entirely with the need to 'count time forward through packets' on
the last page of a start-time encoded stream to find final steam
length. However, a truncated stream will be missing the end-time last
page.
1) We could say 'mixed time is the way to go' and just let a
damaged/truncated stream suffer.
2) We could say 'counting time forward through packets is just the way
it has to be done' and do away with the possibility of mixed coding
entirely]
<h2>Multiplex/Demultiplex Division of Labor</h2>
The Ogg multiplex/deultiplex layer provides mechanisms for encoding
@ -364,7 +345,7 @@ knowledge, however. Unlike other framing systems, Ogg maintains
strict seperation between framing and the framed bistream data; Ogg
does not replicate codec-specific information in the page/framing
data, nor does Ogg blur the line between framing and stream
data/metadata. Because Ogg is fully data agnostic toward the data it
data/metadata. Because Ogg is fully data-agnostic toward the data it
frames, operations which require specifics of bitstream data (such as
'seek to keyframe') also require interaction with the codec layer
(because, in this example, the Ogg layer is not aware of the concept
@ -379,33 +360,6 @@ interaction with the codecs in order to decode the granule position of
a given stream type back to absolute time or in order to find
'decodable points' such as keyframes in video.
<h2>Continuous and Discontinuous Streams</h2>
<h3>continuous description</h3>
A stream that provides a gapless, time-continuous media type is
considered to be 'Continuous'. Clear examples of continuous data
types include broadcast audio and video. Such a stream should never
allow a playback buffer to starve, and Ogg implementations must buffer
ahead sufficient pages such that all continuous streams in a physical
stream have data ready to decode on demand.<p>
<h3>discontinuous description</h3>
A stream that delivers data in a potentially irregular pattern or with
widely spaced timing gaps is considered to be 'Discontinuous'. An
examples of a discontinuous stream types would be captioning.
Although captions still occur on a regular basis, the timing of a
specific caption is impossible to predict with certainty in most
captioning systems.<p>
<h3>declaration</h3> An Ogg stream type is defined to be continuous or
discontinuous by its codec. A given codec may support both continuous
and discontinuous operation so long as any given logical stream is
continuous or discontinuous for its entirety and the codec is able to
ascertain (and inform the Ogg layer) as to which after decoding the
initial stream header. The majority of codecs will always be
continuous (such as Vorbis) or discontinuous (such as Writ).
<h2>Unsorted Discussion Points</h2>
flushes around keyframes? RFC suggestion: repaginating or building a