The Annodex exchange format for time-continuous bitstreams, Version 3.0 Commonwealth Scientific and Industrial Research Organisation CSIRO, Australia
PO Box 76 Epping NSW 1710 Australia +61 2 9372 4180 Silvia.Pfeiffer@csiro.au http://www.ict.csiro.au/
Commonwealth Scientific and Industrial Research Organisation CSIRO, Australia
PO Box 76 Epping NSW 1710 Australia +61 2 9372 4222 Conrad.Parker@csiro.au http://www.ict.csiro.au/
Commonwealth Scientific and Industrial Research Organisation CSIRO, Australia
PO Box 76 Epping NSW 1710 Australia +61 2 9372 4222 Andre.Pang@csiro.au http://www.ict.csiro.au/
This specification defines "Annodex", an exchange format for annotated and indexed time-continuous bitstreams. Annodex provides a bitstream format for exchanging multitrack interleaved time-continuous bitstreams and textual meta information attached to temporal fragments of the binary bitstreams. The meta information is given in the Continuous Media Markup Language (CMML). Annodex enables integration of time-continuous bitstreams into the browsing and searching functionality of the World Wide Web. The specification is not encumbered by patents. The Annodex format is protected by a trade mark to prevent the use of the term "Annodex" for any related but non-conformant and therefore non-interoperable technology. Conformant technology is encouraged to use the term "Annodex" when referring to the exchange format. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
When searching the World Wide Web, time-continuous data such as audio and video files are currently treated as "dark matter" outside the existing infrastructure of the World Wide Web: It is not possible to look inside such files, search for their content through common text-based search engines, or directly hyperlink to points of interest inside them. The file can generally only be consumed in its entirety. In addition, such files are "dead ends" in that by consuming their content the hyperlinking functionality of the Web is left behind. Text documents were enabled for the Web through definition of a markup language (HTML) for text documents to enable description of the structure of a document, and thus allow for the separation of content from presentation. This specification takes the same approach for time-continuous documents. The markup language for time-continuous documents is called CMML, short for Continuous Media Markup Language. It describes the structure of time-continuous documents and allows for a clean separation of content from presentation. To turn text documents into a Web resource that can be exchanged between different applications, HTML markup is added. Such an exchange format where CMML is merged with the time-continuous document(s) it describes is also necessary to turn the time-continuous document(s) into a Web resource and provide a standard exchange format between applications. This format is called "Annodex" for annotated and indexed documents and is defined here.
Annodex is using a container format that allows transport and storage of interleaved time-synchronous bitstreams. In a clean layering approach as is familiar from Internet protocols the functionality of the container format and CMML is explicitly separated. Each layer solves a specific problem without being dependent on layers that are further up in functionality. The container format of Annodex is the Ogg encapsulation format version 0. Annodex is an Ogg bitstream containing a "skeleton" and a CMML logical bitstream, in addition to other temporally interleaved data bitstreams. Ogg skeleton is a logical bitstream that describes all the other logical bitstreams contained in the Ogg physical bitstream (see section 4).It's purpose is to remove codec-specific information requirements from the multiplexing/demultiplexing process. Only an Annodex bitstream that contains a CMML bitstream can be regarded as a Web resource and as part of the Web, because it can be searched and browsed. An Ogg bitstream without a CMML bitstream is not an Annodex bitstream, but only an Ogg bitstream with a "skeleton" logical bitstream, which is still valuable as a multitrack media format that can be addressed through temporal hyperlinks, however it is not a first class citizen on the Web because Web search engines cannot index and crawl it. The file extension of Annodex files is ".anx". This document also applies for registration of the MIME type "application/annodex" for Annodex format bitstreams. In the meantime, "text/x-annodex" will be used. Further MIME types that this document applies for are "video/annodex" for Annodex format (possibly multitrack) video and "audio/annodex" for Annodex format (possibly multitrack) audio. Please note that this document assumes that the reader understands the Ogg encapsulation format version 0. Also, knowledge of the network protocols HTTP and RTP/RTSP as well as the extension of URIs to address temporal offsets into Web resources are a prerequisite to understanding this document. To find out more about the use of Annodex for creating searchable and surfable Web resources, refer to the specification of the Continuous Media Markup Language (CMML Version 2.0).
Annodex contains interleaved bitstreams of time-related data. It is designed to be used both as a persistent file format and as a streaming format to exchange temporally addressable bitstreams. It enables encapsulation of any type of time-continuous bitstream as long as it is streamable and is based on a regular data sampling rate (called granulerate). For variable sampling rate bitstreams, a least common multiple of the used sampling rates must be known. Using this container format, Annodex is designed to accommodate any current or future compression format for time-continuous bitstreams. The container format that Annodex is based on is designed to allow several tracks of temporally synchronous time-continuous data. Each track represents codec data for one type of time-continuous data stream. Here is an example Annodex bitstream with data bitstreams D1-D3 (for example, a video track and two audio tracks) and an annotation track A1 (a CMML bitstream).
__________________________________________________________________ D1 | | | | | | | | | | | __________________________________________________________________ D2 | | | | | | | __________________________________________________________________ D3 | | | | | | | | | | | | | | | | | | | | __________________________________________________________________ A1 | clip 1 | -- | clip 2 | clip 3 | __________________________________________________________________ The time axis t |----------------------------------------------------------------->
Bitstreams of time-continuous data are being regarded as a sequence of data packets that each have a timestamp representing the time at which the packet data ends. The packets contain all the data required to cover the interval from the last packet. If it doesn't cover the full period, it MUST cover the end part of the interval. Bitstreams that represent data that is to be presented in one single time instant are called time-instantaneous bitstreams. Their timestamp represents the time at which the packet's data starts and ends. The CMML track A1 above is one such bitstream. Its clips represent time-instantaneous data that is displayed at the given timestamp. The subsequent data packet replaces the information of the previous one. To insert a gap in a data bitstream (as in A1 above), a data packet MUST be inserted which explicitly annulls the data. Data bitstreams generally contain the following information: setup information for a codec content data The setup information is inserted at the start of a data bitstream before any content data. Distribution of Annodex format bitstreams is performed using a network protocol such as HTTP or RTP/RTSP. The basic process is the following: The client dispatches a download or streaming request to the server with a certain URI. The server resolves the URI and starts delivering Annodex format bitstreams, taking into account potential URI addressed offsets. Currently the distribution with HTTP is clear and discussed in this document, while the details of a distribution via RTP/RTSP are not yet examined and thus unspecified - in particular a RTP payload needs to be defined for Annodex. The following figure explains the protocol stack:
________ _________ _________ __________ \ | CMML | | Video | | Audio | | ... | | ________ _________ _________ __________ | | | skeleton | > Annodex _____________________________________________ | | | Ogg | | _____________________________________________ / | HTTP | RTSP | | _______________________| | | | RTP | _____________________________________________ | TCP | UDP | _____________________________________________ | IP | _____________________________________________
The Annodex format has been designed to accommodate for reliable and unreliable transport. In case of packet loss due to an unreliable transport, data may get lost; this may be important to the application or not and thus may need to be addressed. All data, including CMML data, is treated with the same importance. For instantaneous data tracks the loss of one packet implies that the next packet will restore the proper state. We envisage, however, that a client may require the current state information, so there should be a protocol request for re-sending the current state. This will be delivered by the server by inserting another copy of the instantaneous data into the Annodex bitstream. For example, clips within an annotation bitstream can be repeated in the Annodex bitstream by having the same "track" attribute and the same page_sequence_number as the previous "clip" element. This handling of unreliable transport relates mostly to the use of Annodex over RTP/RTSP and UDP and needs further elaboration. In short, the Annodex bitstream specific features are: index clips of Annodex content for retrieval, e.g. with a Web search engine. crawl Webs of Annodex and other Web resources, e.g. during an indexing operation of a Web search engine. directly address and retrieve temporal intervals inside the Annodex bitstream without a need to decode logical bitstreams aside from skeleton. directly address and retrieve named clips inside the Annodex bitstream without a need to decode any more than the skeleton and CMML logical bitstreams. extract, cache, and reuse temporal intervals or named clips while retaining the annotation and index information. browse through Webs of Annodex and other Web resources in an integrated manner making time-continuous content first class citizen on the World Wide Web.
For authoring of Annodex bitstream information, the CMML is defined. CMML's "stream" tag has been designed to author the skeleton bitstream and describe the data bitstreams to be interleaved into an Ogg bitstream. All other tags of a CMML file provide for authoring of the CMML bitstream. Use of a CMML bitstream without skeleton is strongly discouraged as the time referencing and clip recomposition functionality of Annodexing will get lost. An Annodex physical bitstream has the following mandatory order of Ogg pages: skeleton bos page. CMML bos page. bos pages of the other logical bitstreams. secondary header pages of all logical bitstreams, including fisbone. skeleton eos page. data and eos pages of logical bitstreams, excluding skeleton, multiplexed in a time-synchronous fashion. Such an Annodex bitstream is identified by the CMML bitstream's magic number which can be found at Byte position 104 for this version of the "skeleton" specification. This is calculated through the size of the skeleton bos page, which is fixed because the skeleton ident header is of fixed size and the Ogg page encapsulation header is also fixed size. The Ogg page header has 28 Bytes (including a one Byte segment table as this page has always less than 255 Bytes packet content), and the skeleton ident header has 48 Bytes (see further down). Then, the Byte position amounts to 28+48+28 = 104. The CMML bos page MUST thus also have less than 255 Bytes packet content, which is a sensible restriction. The CMML media mapping is defined in the CMML specification. However, for identification of an Annodex bitstream, the bos page of the CMML logical bitstream needs to be identifiable, which is provided through the first 12 Bytes of the CMML ident packet containing the magic numbers and the version information: Other fields exists and are described in the CMML specification. Identifier: a 8 Byte field that identifies this file to be of a CMML logical input bitstream. It contains the magic numbers: 0x43 'C' 0x4d 'M' 0x4d 'M' 0x4c 'L' 0x00 '\0' 0x00 '\0' 0x00 '\0' 0x00 '\0' Version major: 2 Byte unsigned integer signifying the major version number of the CMML format bitstream. Version minor: 2 Byte unsigned integer signifying the minor version number of the CMML format bitstream.
For the rest of the CMML media mapping refer to the specification of the CMML version that is being used (must be larger than 2.0).
The purpose of Ogg skeleton is to provide codec-specific knowledge that allows parsing, demultiplexing and remultiplexing of Ogg bitstreams without having to decode. While the Ogg encapsulation format by itself is capable of interleaving an unlimited number of time-continuous bitstreams, it is not possible to identify the type of bitstreams (e.g. audio or video) and their encoding format (e.g. Vorbis or Speex or Theora) without decoding at least the bos page of the logical bitstreams. Also, further general media type information such as the image dimensions of a frame in a video bitstream or the language of a speech bitstream may be provided in skeleton. Another limitation of Ogg is that each logical bitstream defines its own mapping of granule_position to time, which is therefore also given in the skeleton. This situation is not acceptable for Annodex, because an Annodex server must be able to return media format information for an Annodex resource without having to understand the codecs involved. And it must be able to return temporal subparts of an Annodex resource without needing to decode. An addition to the Ogg format is thus necessary, which describes all the logical bitstreams included in the Ogg stream. This is defined via a logical bitstream called the "skeleton". For Annodex bitstreams, use of a skeleton bitstream is mandatory. This section specifies the content of the "skeleton" logical bitstream and how it is mapped into Ogg. Knowledge of the Ogg bitstream format as specified in the Ogg RFC is presumed. Please also refer to that document for descriptions of the terms used in this document. The skeleton bitstream has the ability to generically describe Ogg bitstreams that consist of one or more time-continuous data bitstream and one or more time-instantaneous data bitstream concurrently interleaved (in Ogg terms: multiplexed). It does not describe sequentially multiplexed Ogg bitstreams, but rather expects that a sequentially multiplexed bitstream has its own skeleton logical bitstream. The skeleton logical bitstream provides the following functionality on top of Ogg: allows for the identification of the codec format and the content type of encapsulated logical bitstreams without the need to decode that bitstream's headers or data. allows for extraction of a temporal interval of the Ogg physical bitstream while retaining the original start time offset of that interval. allows for attachment of a real-world wall-clock time and a date to the Ogg physical bitstream, thus e.g. retaining creation date/time or first broadcast date/time. allows for temporal offset operations into an Ogg physical bitstream without a need to decode any data. allows generally for handling of content without a need to decode it, such as is necessary in a caching Web proxy. allows for attachment of message header fields given as name-value pairs that contain some sort of protocol messages about the logical bitstream, e.g. the screen size for a video bitstream or the number of channels for an audio bitstream. For authoring of the skeleton bitstream information the CMML can be used. CMML's "stream" tag has been designed with that purpose in mind. However, it is not mandatory to use CMML for authoring of skeleton information - that information may well originate from a different source and be written directly into the skeleton bitstream. See the CMML Internet-Draft for more details.
The skeleton logical bitstream starts with an ident header containing information for the complete Ogg physical bitstream. The ident header has the following format:
Fields with more than one Byte length are encoded LSB (least significant Byte) first. The fields in the skeleton ident header have the following meaning: Identifier: a 8 Byte field that identifies this bitstream as a skeleton. It contains the magic numbers: 0x66 'f' 0x69 'i' 0x73 's' 0x68 'h' 0x65 'e' 0x61 'a' 0x64 'd' 0x00 '\0' Version major: 2 Byte unsigned integer signifying the major version number of the skeleton bitstream. This document specifies the major version 3. Version minor: 2 Byte unsigned integer signifying the minor version number of the skeleton bitstream. This document specifies the minor version 0. Presentationtime numerator & denominator: 8 Byte signed integer each They represent together the time at which to start presenting the Ogg physical bitstream given as a rational number. The denominator represents the temporal resolution at which the presentationtime is given. E.g. 5 on 1000 results in a presentationtime of 0.005 sec. This enables a very high temporal resolution without having to store floating point numbers. In a newly created physical bitstream presentationtime and basetime are the same. When remultiplexing a subpart of the stream, this number MUST be adapted to the requested start time offset of the newly created stream. Basetime numerator & denominator: 8 Byte signed integer each They represent together the basetime of the Ogg physical bitstream given as a rational number like the presentationtime. This number is fixed once the physical bitstream is created and provides a mapping to time for the beginning of the physical bitstream when it starts with a granule position of 0. UTC: a 20 Byte string containing a UTC time in the form of YYYYMMDDTHHMMSS.sssZ. It associates a calendar date and a wall-clock time with the basetime. It is a sequence of 20 NUL Bytes if not in use, making this ident packet and thus the bos page of the skeleton bitstream constant length. Please note: The possible temporal resolution of the presentation- and basetime is on the order of 2^-64. For example, the time formats in use for media that are described in this document range from 1/24 to 1/60 for the different smpte formats. This resolution is enough for any one of these. It is also expected to accommodate any future needs of time resolution for any other time format and time-continuously sampled data.
The skeleton secondary headers are a sequence of packets that each contain information about one of the time-continuous or time-instantaneous other logical bitstreams contained within the Ogg physical bitstream. A skeleton secondary header packet has the following format:
Fields with more than one Byte length are encoded LSB (least significant Byte) first. The fields in a skeleton secondary header packet have the following meaning: Identifier: a 8 Byte field that identifies this packet as a skeleton secondary header for identifying other logical bitstreams. It contains the magic numbers: 0x66 'f' 0x69 'i' 0x73 's' 0x62 'b' 0x6f 'o' 0x6e 'n' 0x65 'e' 0x00 '\0' Offset to message header fields: 4 Byte unsigned integer that contains the number of Bytes used in this packet before the message header fields. For the version of the skeleton bitstream described in this document this number is fixed to 44. This field accommodates future changes to the skeleton bitstream allowing to parse message header fields even if more fields get inserted before them. Serial number: 4 Byte unsigned integer containing the bitstream_serial_number of the Ogg logical bitstream described by this skeleton secondary header packet and thus connecting it to the logical bitstream. Number of header packets: a 4 Byte unsigned integer that contains the number of header packets of that particular logical bitstream consisting of the bos page and the secondary header pages. Granulerate numerator & denominator: 8 Byte signed integer each They represent the temporal resolution of the logical bitstream in Hz given as a rational number in the same way as the basetime attribute above. Startgranule: 8 Byte signed integer that represents the granule number with which this logical bitstream starts, which is originally 0, but will be a positive offset when only a subpart of the stream is requested. Preroll: 4 Byte unsigned integer that contains the number of packets to pre-roll in order to decode a current packet correctly. This is for example the case with Ogg Vorbis, which requires a pre-roll of 2 packets. Granuleshift: a 1 Byte unsigned integer describing whether to partition the granule_position into two for that logical bitstream, and how many of the lower bits to use for the partitioning. The upper bits then still signify a time-continuous granule position for a directly decodable and presentable data granule. The lower bits allow for specification of a finer resolution such that for example predicted frames of a video can be addressed as well, though not decoded without tracing back to the last fully decodable data granule. This is e.g. the case with Ogg Theora. Padding/future use: 3 Bytes padding data that may be used for future requirements and are mandated to zero in this revision. Message header fields: header fields, following the generic Internet Message Format defined in RFC 2822. Each header field consists of a name followed by a colon (":") and the field value. Field names are case-insensitive. The field value MAY be preceded by any amount of LWS, though a single SP is preferred. Header fields can be extended over multiple lines by preceding each extra line with at least one SP or HT. There is one mandatory Message header field for all of the logical bitstreams: the "Content-type" header field. For an application that is parsing the Annodex bitstream, this field contains the MIME type and the character encoding of the data in the logical bitstream. E.g. for the annotation bitstream, this field will contain the value "Content-type: text/x-cmml; UTF-8" if the character set used in the CMML bitstream is UTF-8. E.g. for a bitstream containing Ogg Vorbis data the value is "Content-type: audio/x-vorbis". The Content-type message header field MUST come first for all of the Message header fields such that it can be found at a fixed location in the skeleton fisbone packet. As per RFC 2277, message header fields are considered protocol data, i.e. it is not expected to have human readable text in there, and they MUST be entirely encoded in UTF-8. In addition, the mandatory header fields MUST be encoded in US-ASCII and it is recommended to also use US-ASCII code points as much as possible for the optional header fields. User defined optional message header fields MUST follow the naming standard given in RFC2822.
The media mapping for skeleton into Ogg is as follows: The skeleton ident (fishead) header is mapped into the skeleton bos page. The secondary header pages of a skeleton logical bitstream consist of the fisbone header packets that each describe one particular logical data bitstream within the Ogg physical bitstream. There are no content pages or data packets. As the skeleton eos page is included before the first data page of any logical bitstream, there actually cannot be any content data packets. The skeleton eos page contains one packet of length zero. When using a skeleton logical bitstream in Ogg, a further restriction on the order in which Ogg pages appear is introduced to allow for easier identification: The skeleton bos page is the very first bos page. This allows its differentiation from other Ogg bitstreams that don't contain a skeleton logical bitstream. The bos pages of the other logical bitstreams come next as is a requirement of the Ogg bitstream format. The secondary header pages of all the logical bitstreams in the Ogg physical bitstream come next, as is also a requirement of Ogg. The skeleton secondary header pages are also included here. Before any data pages of any of the logical bitstreams appear in the Ogg physical bitstream, the skeleton eos page MUST end the skeleton logical bitstream. This is necessary to end the control section of the bitstream. If an Ogg stream parser reaches the skeleton eos page, it knows that it has received all the bos and secondary header pages and can start setting up its decoding or parsing environment.
With time-continuous data like Annodex, one needs to handle data at four different levels: at the Bytes level, upon seeking. at the packets level, upon encapsulating. at the granules level, upon recomposing. at the time level, upon displaying and addressing. This section explains how they all fit together.
Annodex bitstreams inherently represent one timeline only, where the different logical bitstreams can be thought of as content tracks on that timeline. All of these tracks relate to the same timeline which starts at a certain time point and ends when the last bitstream ends. An example bitstream can be seen in the following figure. It consists of an Annodex bitstream that contains 4 media bitstreams and one CMML bitstream. The picture is a conceptual representation of the time intervals covered by the different logical bitstreams and the Ogg pages used to encapsulate the data. In the flat representation these are multiplexed such that the data packets of each of these bitstreams occur at the correct time.
| ---------------------------------------------------------------------- |clip1 | clip 2 |/clip 3///////////////| clip 4 | ---------------------------------------------------------------------- CMML bitstream ---------------------------------------------- | | | | | | | | | | |//| | | | | ---------------------------------------------- audio bitstream 1 ------------------------------------------------------------- | | | |/////| | | | | | | ------------------------------------------------------------- video bitstream 1 ---------------------------------------------------- | | | | |//| | | | | | | | | | | | | ---------------------------------------------------- audio bitstream 2 ------------------------------- | |/////| | | | ------------------------------- video bitstream 2 ]]>
The time point at which an Annodex bitstream starts (t_0 in the above diagram) is called the "basetime" and represents the time in seconds associated with the granule position of 0 on all logical bitstreams. Typically, a newly created Annodex file starts all its logical bitstreams at granule position 0, and a typical extract of an Annodex bitstream, such as the one starting at t_url in the image above, starts each of its logical bitstreams at a different granule positions. These granule positions are stored in the "startgranule" field of the skeleton secondary header packets. The "basetime" of an Annodex bitstream may be 0, but it can also be any positive time. For example, in professional video production, the first frame of video of a program normally refers to a SMPTE basetime of 01:00:00:00, not 00:00:00:00 (see also the temporal URI addressing specification). Associating such a practice to a digital video resource requires a way to store that basetime with the resource and interpreting it correctly when addressing offsets such as t_uri. Annodex provides such a mapping through the basetime field in the skeleton ident header. Also associated with the basetime is a calendar date and wall-clock time (a "UTC base") which represent a real-world time giving some meaningful calendar date association to the content such as the creation time or the first presentation time. The UTC base is specified in the UTC field of the skeleton ident header.
Each one of the encapsulated data bitstreams and the CMML bitstream have their own temporal resolution at which they provide data to cover the given timeline. This temporal resolution is usually given through the sampling rate of the particular bitstream. For example, a raw audio bitstream at CD quality is sampled with a sampling rate of 44100 Hz. A video bitstream may be sampled with a frame rate of 25 frames per second. This temporal resolution is called the "granulerate". A granule is a data element that is based on a regular data rate specific to the content type, such as the frame rate for video or the sampling rate for audio. It even exists for bitstreams that are not sampled at a regular rate - then it is the highest resolution of any of the used sampling rates. The granulerate is specified in the skeleton secondary header packets for each logical bitstream. Each one of the bitstreams insert data into the Ogg bitstream through packets which have an associated temporal duration based on the encoder packaging. Packets are packaged into Ogg pages, which have a granule position associated with them. Not taking the special case of a granuleshift into account, the granule position specifies the number of granules that has been encapsulated since the implicit start of the original bitstream until and including the given Ogg page. The granule position together with the granulerate and granuleshift information of the skeleton secondary header packets for the particular logical bitstream are used for the calculation of the time position for which a data packet of the logical bitstream completes data. A granule position of -1 indicates a special case and MUST NOT be used for calculation of a mapping to time. In principle, the granule position of an Ogg page divided by the granulerate of this page's logical bitstream provides the time position that is reached in that bitstream after decoding all data packets finished on this page. However, the granule_position field in an Ogg page allows for a more fine-grained description of the temporal position. The following image explains the composition of the granule_position field in an Ogg page:
The granuleshift field of the skeleton secondary header packets describes how many of the granule_position's 64 bits are being used for the keyoffset. The keyoffset part of the granule_position is commonly used when the logical bitstream consists of packets that can only be fully decoded when referring back to a previous packet. For example, video streams often consist of inter and intra coded frames, where the intra frames are fully decodable and the inter frames are intermediate frames that require backtracking to the last inter frame for accurate decoding. Another example is a logical bitstream that is mapped as instantaneous information (i.e. their granuleposition represents the start time and the end time of the packet data), but actually has a duration associated to it, which is provided through a subsequent packet. CMML is such an example. The keyindex part of the granule_position is then used to provide the temporal position of the reference packet and the keyoffset part provides a counter for the data in between. The calculation of the temporal position of an Ogg page in Annodex is thus specified through the following algorithm: The basetime provides the time offset used at the beginning of the logical bitstream for the first data packet and thus MUST be added for a correct calculation of the temporal position. As an example regard an audio bitstream that has a granulerate of 44100 (i.e. 44100 samples per 1 sec), a granuleshift of 0, and starts at 4 sec. When reaching a granule_position of 88200, this maps to a time position of 6 seconds: This signifies that the bitstream has reached the second sec of the audio bitstream after the end of decoding this page's packets, but maps to 6 seconds because of the basetime. As another example consider a video bitstream that has a granulerate of 25 (i.e. 25 frames per 1 second), a granuleshift of 3 (because it encodes - say - 7 partial frames between each fully encoded frame), and starts at 0 sec. When reaching a granule_position of 997, i.e. a keyindex of 62 and a keyshift of 5, this maps to a fully decodable time position of 2.68 seconds: The granulerate of a time-instantaneous bitstream such as the CMML bitstream can be chosen arbitrarily by the bitstream multiplexer. Per default, a granulerate of 1000 is used, which is the resolution of npt. The resolution of all the time schemes is given as: npt: 1000 (milliseconds) smpte-24: 24 (24 fps) smpte-24-drop: 24/1.001 = 23.976 (approx. as per SMPTE) smpte-25: 25 smpte-30: 30 smpte-30-drop: 30/1.001 = 29.970 (approx. as per SMPTE) smpte-50: 50 smpte-60: 60 smpte-60-drop: 60/1.001 = 59.940 (approx. as per SMPTE) The granule position of the page finishing data of a time-instantaneous bitstream packet MUST signify the start time of that packet. For example, a CMML bitstream with a granulerate of 1000, a basetime of 0, and a clip that lasts from npt=12.020 till npt=15.0 will get a granule_position of 12020. In contrast, the granule_position of the page finishing data of e.g. an audio bitstream with granulerate 44100, basetime 0 and containing data from npt=12.020 to npt=15.0 will be 661500. A note about field overflows: an overflow of the granule position field can destroy the temporal integrity of the Annodex physical bitstream. In this case, a multiplexer MUST end the Annodex physical bitstream and restart a new one resetting the counter to 0 and adjusting the basetime appropriately. This is also called sequential multiplexing in Ogg. The same measure MUST be taken in case of an overflow of the page_sequence_number on one of the logical bitstreams.
Addressing into an Annodex bitstream is possible with the temporal URI addressing scheme. Time is specified as a temporal offset from the "beginning" of the stream, making use of the basetime field. Time offsets can also be specified as calendar dates and times. The UTC base is then used as a basis for offsetting. The basetime allows to correctly map a temporal offset point such as a temporal URI to a Byte position in the stream. In the above figure take t_uri=npt:14.0 as the temporal offset addressed on a stream with t_0=npt:5.0 as the basetime - this requires a stream offsetting of only 9 sec to the appropriate granule position in each of the bitstreams, in the figure marked through patterned pages. The seeking action is performed on the interleaved bitstream, in which, the data packets occur in a temporally consecutive order based on the time at which their data ends. These times are represented in the granule positions of the Ogg pages, which are only allowed to monotonically increase within one logical bitstream. This implies that when having found an Ogg page with a granule position that maps to a given seek time (i.e. covers the time or ends at it), the seek has found the right location. This applies over all logical bitstreams. In the above example, this means that the Byte position of the first occurring page of the patterned pages has been found. There is a complication to the seeking: some logical bitstreams have backwards dependencies in their data packets and these have to be taken into account for seeking. For example, a logical bitstream may require several of its previous packets to allow a correct and complete decoding of the actual packet that occurs at the seektime. This is the case for Theora which requires to go back to the previous keyframe when decoding from a time offset. It is also the case for Vorbis which requires the previous 2 packets for accurate setup of the frequency transform - Speex needs approximately 2 packets for similar reasons. Even instantaneous bitstreams such as CMML may require to go back to a previous packet to recover the last state information - the currently active clip in the case of CMML. Therefore, once seeking has located the correct Byte position that refers to the given temporal offset, it MUST seek back. For logical bitstreams that have a non-zero "granuleshift" in the skeleton, it MUST seek back to the Ogg page that has a "keyindex" granule position. For logical bitstreams that have a non-zero "preroll" in the skeleton, it MUST seek back that many packets. The earliest Byte position that satisfies all these requirements is the correct seek position. A player that presents from an offset MUST take into account that the bitstream may contain some packets that are only there to allow accurate decoding of the seek time. When the backwards dependencies were resolved for a specific logical bitstream, several non-relevant Ogg pages of may also have ended up in the intermediate. These have to be skipped by a player. The time that a player MJST start presenting from is given in the "presentationtime" in the skeleton ident header.
When a subpart of an Annodex bitstream is requested, such as through a temporal URI query request from a Web server, the bitstream MUST be recomposed and a remultiplexed bitstream served out. There are several aims for performing the remultiplexing with as little effort and therefore as little delay as possible: no decoding of the logical bitstreams is performed. no changes to the pages, in particular to the granule positions are made. changes occur only to the control section. The fields of the skeleton track allow achievement of all these aims. Remultiplexing is essentially achieved by seeking to the position as described above and then including from each logical bitstream only the relevant Ogg pages into the new stream. Changes to fields in the bitstream are restricted to the control section: the "presentationtime" MUST be adjusted to the requested start time the "startgranule" for each logical bitstream MUST be adjusted to the granule position at which each logical bitstream starts. This is not the first granule position of the Ogg pages included into the bitstream, but rather the last one that did not get included, as it represents the start time of the bitstream. Everything else, and in particular the Ogg pages, stay the same. This is important also to allow caching of such files as is required for Web proxies and described in temporal URI addressing.
This section contains the registration information for the "application/annodex" media type. While this media type is not approved by the IANA, "application/x-annodex" may be used. To: ietf-types@iana.org Subject: Registration of MIME media type application/annodex MIME media type name: application MIME subtype name: annodex Required parameters: none Optional parameters: none Encoding Considerations: Annodex is an exchange format for any type of encoded time-continuously sampled data stream. The authoring software MUST provide for the encoders, providing the MIME type (and potentially the charset for text-based formats) in the "Content-type" Message header field of each bitstream. The client software can select an appropriate decoder based on this information. Security considerations: see next section. Interoperability considerations: the Annodex bitstream format is a free specification that is independent of any media encoding format. It is designed to provide interoperability with the existing World Wide Web. Additional information: Magic numbers: "OggS" identifies an Ogg page at Byte position 0, "fishead\0" identifies a skeleton logical bitstream at Byte position 28. In the second Ogg page at Byte position 28 the magic number "CMML\0\0\0\0" can be found, identifying this as an Annodex bitstream. File extension: .anx Macintosh File Type Code: "ANDX" Intended usage: COMMON
As Annodex bitstreams are time-continuous Web resources, hyperlinking into Annodex bitstreams via URIs is possible with the temporal URI query and fragment specification. For the query case, an Annodex server must supports the "X-Accept-TimeURI" http header field (see the temporal URI query specification for more details). The "X-Accept-Range-Redirect" and "X-Range-Redirect" http header fields MAY also be supported by an Annodex server and user agent. As Annodex bitstreams contain CMML logical bitstreams, URI addressing of clips via their name given in the "id" tag is also supported. The same mechanisms as specified in the CMML specification apply to Annodex analogously. In particular, the id addressing is also regarded as an alias for a time offset and an Annodex conformant server that supports Annodex temporal URI addressing MUST also support named URI addressing (see the CMML specification for more details). Examples for valid URI addresses: http://example.com/sample.anx?t=npt:4 , which relates to an Annodex bitstream composed by the server from sample.anx by starting it at an offset of 4 seconds. http://example.com/sample.anx?id=dolphin --- relates to the clip whose id attribute value is "dolphin" and all further clips after that. http://example.com/sample.anx?id="dolphin/" --- relates only to the clip whose id attribute value is "dolphin". http://example.com/sample.anx?id="intro/goldfish" --- realtes to all the clips from the "intro" clip to the "goldfish" clip. http://example.com/sample.anx#t=npt:4 --- start using the Annodex bitstream from a 4 second offset. http://example.com/sample.anx#dolphin -- use the clip with id="dolphin" only.
The Annodex and the CMML file that can be extracted from it are very tightly related to each other: the CMML file contains all annotation and indexing information including basetime and UTC time about the Annodex file. Therefore, receiving the CMML file instead of the Annodex file is like receiving all information about the bitstreams in the Annodex file except for the data bitstreams themselves. This situation can be taken advantage of with the "Accept" header of HTTP. When an Annodex file is requested from a HTTP server and the acceptable content types given in the "Accept" message header field contains "text/x-cmml" with a higher priority than "application/x-annodex", then the HTTP server SHOULD return the CMML file instead of the requested Annodex file itself. As is standard, the HTTP response will contain a "Content-type" field indicating what content was actually returned. A Web crawler of a search engine, e.g., can thus avoid extra network load and retrieve more easily parsable information. It SHOULD set the "Accept" HTTP header to "Accept: text/x-cmml" for every requested Annodex URI. For example:
This section contains the registration information for the "video/annodex" media type. While this media type is not approved by the IANA, "video/x-annodex" may be used. To: ietf-types@iana.org Subject: Registration of MIME media type "video/annodex" MIME media type name: video MIME subtype name: annodex Required parameters: none Optional parameters: none Encoding Considerations: Annodex video is a subclass of Annodex data where there is at least on video track encpsulated together with the skeleton and CMML tracks, and a potentially unlimited number of other audio and video tracks. Security considerations: as in "application/annodex" MIME application. Interoperability considerations: as in "application/annodex" MIME application. Additional information: Magic numbers: as in "application/annodex" MIME application. File extension: .axv Macintosh File Type Code: "ANXV" Intended usage: COMMON URI addressing and HTTP header field use of "application/annodex" type content apply analogously to "video/annodex".
This section contains the registration information for the "audio/annodex" media type. While this media type is not approved by the IANA, "audio/x-annodex" may be used. To: ietf-types@iana.org Subject: Registration of MIME media type "audio/annodex" MIME media type name: audio MIME subtype name: annodex Required parameters: none Optional parameters: none Encoding Considerations: Annodex audio is a subclass of Annodex data where there is at least on audio track encpsulated together with the skeleton and CMML tracks, and a potentially unlimited number of other audio tracks. Security considerations: as in "application/annodex" MIME application. Interoperability considerations: as in "application/annodex" MIME application. Additional information: Magic numbers: as in "application/annodex" MIME application. File extension: .axa Macintosh File Type Code: "ANXA" Intended usage: COMMON URI addressing and HTTP header field use of "application/annodex" type content apply analogously to "audio/annodex".
Annodex format bitstreams contain several multiplexed binary media and one XML annotation bitstream. There is no generic encryption or signing mechanism provided for the complete bitstream or anyone of its parts. As the format of the encapsulated media bitstreams is not prescribed and is identified through the "Content-type" Message header field in that bitstream's skeleton secondary header packet, it is possible to encrypt or sign that media bitstream and then mark it accordingly with a MIME type that signifies the encryption. It is up to the applications that use this bitstream to provide an appropriate codec to handle such bitstreams. As Annodex format bitstreams contain binary media bitstreams, it is possible to include executable content in them. This can be an issue with applications that decode these bitstreams, especially when they are used in a network scenario. Such applications MUST ensure correct handling of manipulated bitstreams, of buffer overflow and the like.
draft-pfeiffer-annodex-01: Annodex version 2.0: changes because of renamings of CMML tags and changes to the temporal and named URI addressing. draft-pfeiffer-annodex-02: Annodex version 3.0: The changes pertain to the bitstream format to allow for a stronger decoupling of Annodex and CMML. The Annodex format is now using the Ogg format with a "skeleton" and a "CMML" logical bitstream. This change has reinforced a layered approach that fits better with existing practice in Internet protocols, where each layer solves a specific problem without being dependent on other layers further up.
Key words for use in RFCs to Indicate Requirements Levels Harvard University
29 Oxford Street Cambridge MA 02138 US +1 617 495 3864 sob@harvard.edu
Extensible Markup Language (XML) 1.0 World Wide Web Consortium
MIT Laboratory for Computer Science 545 Technology Square Cambridge MA 02139 US + 1 617 253 2613 + 1 617 258 5999 timbl@w3.org http://www.w3c.org
HTML 4.01 Specification World Wide Web Consortium
MIT Laboratory for Computer Science 545 Technology Square Cambridge MA 02139 US + 1 617 253 2613 + 1 617 258 5999 timbl@w3.org http://www.w3c.org
XHTML(TM) 1.0 The Extensible Hyper Text Markup Language World Wide Web Consortium
MIT Laboratory for Computer Science 545 Technology Square Cambridge MA 02139 US + 1 617 253 2613 + 1 617 258 5999 timbl@w3.org http://www.w3c.org
Uniform Resource Identifiers (URI): Generic Syntax World Wide Web Consortium
Massachusetts Institute of Technology 77 Massachusetts Avenue Cambridge MA 02139 US +1 617 253 5702 +1 617 258 5999 timbl@w3.org
Day Software
5251 California Ave., Suite 110 Irvine CA 92617 US +1 949 679 2960 +1 949 679 2927 fielding@gbiv.com
Adobe Systems Incorporated
345 Park Ave San Jose CA 95110 US +1 408 536 3024 LMM@acm.org
Hypertext Transfer Protocol -- HTTP/1.1 University of California, Irvine
Department of Information and Computer Science University of California, Irvine Irvine CA 92697-3425 US +1 949 824 7403 +1 949 824 1715 fielding@ics.uci.edu
World Wide Web Consortium
MIT Laboratory for Computer Science 545 Technology Square Cambridge MA 02139 US +1 617 258 8682 jg@w3.org
Western Research Laboratory
Compaq Computer Corporation 250 University Avenue Palo Alto CA 94305 US mogul@wrl.dec.com
World Wide Web Consortium
MIT Laboratory for Computer Science 545 Technology Square Cambridge MA 02139 US +1 617 258 8682 frystyk@w3.org
Xerox Corporation
3333 Coyote Hill Road Palo Alto CA 94034 US +1 650 812 4365 +1 650 812 4333 masinter@parc.xerox.com
Microsoft Corporation
1 Microsoft Way Redmond WA 98052 US paulle@microsoft.com
World Wide Web Consortium
MIT Laboratory for Computer Science 545 Technology Square Cambridge MA 02139 US +1 617 253 5702 +1 617 258 8682 timbl@w3.org
Internet Message Format QUALCOMM Incorporated
5775 Morehouse Drive San Diego CA 92121-1714 USA +1 858 651 4478 +1 858 651 1102 presnick@qualcomm.com
IETF Policy on Character Sets and Languages UNINETT
P.O.Box 6883 Elgeseter Trondheim 7002 Norway +47 73 59 70 94 Harald.T.Alvestrand@uninett.no
Real Time Streaming Protocol (RTSP) Columbia University
Dept. of Computer Science 1214 Amsterdam Avenue New York NY 10027 US schulzrinne@cs.columbia.edu
Netscape Communications Corp.
501 E. Middlefield Road Mountain View CA 94043 US anup@netscape.com
RealNetworks
1111 Third Avenue Suite 2900 Seattle WA 98101 US robla@real.com
Tags for the Identification of Languages UNINETT
Pb. 6883 Elgeseter Trondheim 7002 Norway Harald.T.Alvestrand@uninett.no
Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types Innosoft International, Inc.
1050 East Garvey Avenue South West Covina CA 91790 USA ned@innosoft.com
First Virtual Holdings
25 Washington Avenue Morristown NJ 07960 USA nsb@nsb.fv.com
XML Media Types University of California, Irvine
Department of Information and Computer Science Irvine CA 92697-3425 USA ejw@ics.uci.edu
Fuji Xerox Information Systems
KSP 9A7, 2-1, Sakado 3-chome, Takatsu-ku Kawasaki-shi Kanagawa-ken 213 Japan murata@fxis.fujixerox.co.jp
The Ogg encapsulation format version 0 Commonwealth Scientific and Industrial Research Organisation
Locked Bag 17 North Ryde NSW 2113 Australia + 61 2 9325 3100 + 61 2 9325 3200 Silvia.Pfeiffer@csiro.au http://www.annodex.net/
SMPTE STANDARD for Television, Audio and Film - Time and Control Code The Society of Motion Picture and Television Engineers
595 W. Hartsdale Ave. White Plains NY 10607 USA smpte@smpte.org
Data elements and interchange formats -- Information interchange -- Representation of dates and times International Organization for Standardization
1 rue de Varembre Case Postale 56 Geneva 20 1211 CH central@iso.org
Specifying time intervals in URI queries and fragments of time-based Web resources (work in progress) Commonwealth Scientific and Industrial Research Organisation CSIRO, Australia
PO Box 76 Epping NSW 1710 Australia +61 2 9372 4180 Silvia.Pfeiffer@csiro.au http://www.ict.csiro.au/
Commonwealth Scientific and Industrial Research Organisation CSIRO, Australia
PO Box 76 Epping NSW 1710 Australia +61 2 9372 4222 Conrad.Parker@csiro.au http://www.ict.csiro.au/
Commonwealth Scientific and Industrial Research Organisation CSIRO, Australia
PO Box 76 Epping NSW 1710 Australia +61 2 9372 4222 Andre.Pang@csiro.au http://www.ict.csiro.au/
The Continuous Media Markup Language (CMML), Version 2.0 (work in progress) Commonwealth Scientific and Industrial Research Organisation CSIRO, Australia
PO Box 76 Epping NSW 1710 Australia +61 2 9372 4180 Silvia.Pfeiffer@csiro.au http://www.ict.csiro.au/
Commonwealth Scientific and Industrial Research Organisation CSIRO, Australia
PO Box 76 Epping NSW 1710 Australia +61 2 9372 4222 Conrad.Parker@csiro.au http://www.ict.csiro.au/
Commonwealth Scientific and Industrial Research Organisation CSIRO, Australia
PO Box 76 Epping NSW 1710 Australia +61 2 9372 4222 Andre.Pang@csiro.au http://www.ict.csiro.au/
any sequence of binary data that represents an analog-time signal sampled in discrete time steps. In contrast to actual discrete-time signals as known from signal processing, time-continuously sampled data may also come in compressed form, such that a block of numbers represents an interval of time. a time-continuously sampled data stream where the components provide information for a specific time-instant. a time-continuously sampled data stream where the components provide ongoing information as time goes by. a temporal section of a time-continuous data stream. a free-text, unstructured description of a clip. a name-value pair that provides a structured, database-like description of the content. a Unified Resource Identifier (URI). collection of information about a data stream, which may include annotations, hyperlinks, and metadata. a subpart of a media document covering some temporal interval. XML tags and their content used to describe a media document. encapsulated time-continuous bitstream with head and clip elements. the task of giving textual descriptions to fragments of media documents. the task of identifying index points for media documents or fragments thereof. the task of linking from one Web resource to another. If a link has an offset into the resource, this is sometimes called deep hyperlinking. CMML data containing information on an Annodexed media file. a block of digital data that represents a temporal subpart of a stream of continuous media. Media packets of one continuous media file do not overlap in time. a sequence of time-continuous data.
Continuous Media Markup Language. Document Type Declaration. eXtensible Markup Language. Continuous Media Web. World Wide Web. Unified Resource Identifier.
The authors greatly acknowledge the contributions of Rob Collins, Zentaro Kavanagh, Andrew Nesbit and Simon Lai in developing this specification.