HK1038625A1

HK1038625A1 - Creating animation from a video

Info

Publication number: HK1038625A1
Application number: HK02100130.9A
Authority: HK
Inventors: Eric Chen Shenchang; Helen Tahn Whei-Tsu; Brandt Jonathan
Original assignee: Presenter.Com
Priority date: 1998-06-11
Filing date: 1999-06-09
Publication date: 2002-03-22
Also published as: AU4558899A; WO1999065224A3; CN1305620A; EP1097568A2; WO1999065224A2; JP2002518723A

Abstract

An apparatus and method for creating and storing an animation and for linking the animation with a video. A sequence of video images is inspected to identify a first transformation of a scene depicted in the sequence of video images. A first image and a second image are obtained from the sequence of video images, the first image representing the scene before the first transformation and the second image representing the scene after the first transformation. Information is generated that indicates the first transformation and that can be used to interpolate between the first image and the second image to produce a video effect that approximates display of the sequence of video images. Regarding the storing of an animation, a set of keyframes created from a video is stored in an animation object. One or more values that indicate a first sequence of selected keyframes from the set of keyframes is stored in the animation object along with information for interpolating between the keyframes of the first sequence. One or more values that indicate a second sequence of selected keyframes from the set of keyframes is also stored in the animation object along with information for interpolating between the keyframes of the second sequence. The number of keyframes in the second sequence is fewer than the number of keyframes in the first sequence. Regarding the linking of a video and an animation, a data structure containing elements that correspond to respective frames of a first video is generated. Information that indicates an image in an animation that has been created from a second video is stored in one or more of the elements of the data structure.

Description

Method for generating animation from video

Technical Field

The present invention relates to the field of image animation, and more particularly to automatically generating an animation from video.

Background

The internet has become an increasingly popular medium for delivering full motion video to end users. However, due to bandwidth limitations, most users cannot download and view high quality video on demand. For example, to transmit a compressed 640 x 480 pixel resolution video at thirty frames per second, the image data must be transmitted at a rate of about 8Mbs (megabits per second), a bandwidth requirement roughly three hundred times greater than the 28.8Kbs modem speed available to most internet users today. Even with industry standard compression techniques (e.g., MPEG-motion pictures experts group), video effects on the internet today often resemble lower quality slide shows more than television displays.

Animation using keyframes and interpolation to produce video effects may require less transmission bandwidth than video. As the performance of personal computers improves, television quality video effects can be synthesized in real time from relatively few key frames received using low bandwidth modems. An animation sequence requiring a key frame to be sent every few seconds can be transmitted, saving a lot of bandwidth with respect to video and also providing superior image quality.

In addition to requiring a small bandwidth, animations can be scaled more than videos, both in terms of playing image quality and frame rate. Because the video effects that are playing during the play time (on the fly) are composited, the frame rate and image quality may be dynamically adjusted based on a number of factors, such as the play processor speed, network bandwidth, and user preferences.

Animation is also much easier than video to add features for user interaction and other types of editing. For example, adjusting a camera panning path or object motion speed may only require changing the motion parameters associated with several key frames in the animation. Editing a video clip to achieve the same effect may require modification of hundreds of frames. Similarly, attaching a hot spot that tracks a moving object in time can be accomplished much more easily in animation than in video.

The animation has its drawbacks. Because skilled animators are traditionally required to render high quality animations, the animation process is often costly and expensive. Moreover, because animators often sketch key frames by hand, animations tend to take on sketching and often lack the realistic imagery needed to show real-world scenes. In some cases, animation is made using basic two-dimensional and three-dimensional objects as building blocks. Such animations also tend to have the effect of a composite product without a natural appearance and are often limited to representing graphical information.

Summary of The Invention

A method and apparatus for generating an animation are disclosed. A sequence of video images is examined to identify a first transformation of a scene depicted in the sequence of video images. A first image and a second image are obtained from the sequence of video images. The first image represents the scene before the first transformation and the second image represents the scene after the first transformation. Information indicative of the first transformation is generated and is available for interpolation between the first image and the second image to produce a video effect approximating a display of the sequence of video images.

A method and apparatus for storing an animation is also disclosed. A set of keyframes generated from a video is stored in an animation object. One or more values indicative of a first sequence of key frames selected from the set of key frames are stored in the animation object along with information for interpolating between the key frames in the first sequence. One or more values indicative of a second sequence of key frames selected from the set of key frames are stored in the animation object along with information for interpolating between the key frames in the first sequence. A set of keyframes generated from a video are stored in an animation object. The number of key frames in the second sequence is less than the number of key frames in the first sequence.

A method and apparatus for linking a video and an animation is also disclosed. A data structure is generated that includes elements corresponding to corresponding frames of a first video and information indicating an image in an animation that has been produced from a second video is stored in one or more of the elements of the data structure.

Other features and advantages of the present invention will be apparent from the accompanying drawings and from the detailed description given below. In the drawings:

FIG. 1 illustrates the generation and transmission of an animation;

FIG. 2 is a block diagram of an animation authoring system according to an embodiment;

FIG. 3 is a block diagram of a background track generator according to an embodiment;

fig. 4A illustrates a video segment that has been identified by a scene change estimator within a background track generator;

FIG. 4B is a flow chart describing the operation of the background motion estimator, background frame constructor and background blending estimator shown in FIG. 3;

FIG. 5 illustrates a background image set generated by the background frame constructor shown in FIG. 3;

FIG. 6 is a block diagram of a target trajectory generator according to one embodiment;

fig. 7A illustrates a video segment identified by the scene change estimator shown in fig. 3;

FIG. 7B is a flowchart 100 of the operation of a target trajectory generator according to one embodiment;

FIG. 8 is a schematic diagram of an animation object, according to an embodiment;

FIG. 9A illustrates an exemplary embodiment of a background frame mix data structure that may be used to perform background mixing;

FIG. 9B illustrates an intermittent mixing function;

FIG. 10 illustrates the manner in which a background track and an object track of an exemplary animation object are used to synthesize an interpolated frame during animation play-out;

FIG. 11 illustrates a technique for providing multiple temporal resolutions of animation keyframes;

FIG. 12 illustrates a technique for providing multiple spatial resolutions of animation keyframes;

FIG. 13 illustrates the use of a server system to control the content of the animation data stream that is transmitted to the playback system;

FIG. 14A illustrates the use of a cross-link generator to establish a cross-link between a video source and an animation generated from the video source;

FIG. 14B illustrates the use of a cross-link generator to establish a cross-link between a first video and an animation generated from a second video;

FIG. 15 illustrates a cross-linking data structure according to one embodiment;

FIG. 16 is a diagram of a cross-linking relationship between a sequence of video frames in a video source and background images from an animation;

FIG. 17 illustrates a display generated by a playback system;

FIG. 18 illustrates an alternative display generated by the playback of an animation in a playback system.

Detailed Description

According to embodiments described herein, a video is parsed to automatically generate an animation that includes key frames and information used to interpolate between the key frames. These key frames and interpolation information can be used to synthesize an image during animation playback. When displayed, these composite images produce a video effect that approximates the original video. Because video effects such as image motion and color changes can be represented with significantly less information in animation than in video, animation tends to consume much less bandwidth when transmitted over a communication network such as the internet. For example, using the methods and apparatus described herein, a video containing hundreds of image frames may be used to generate an animation containing only a few key frames and information used to interpolate between the key frames. When the animation is received in a playback system, such as a desktop computer with animation playback capabilities, the playback system may use the keyframes and interpolation information provided in the animation to compose and display an image while the animation is being received. Thus, an advantage of the embodiments disclosed herein is that an animation is automatically generated from a video that is more compact than the video so that a playback system that does not have the bandwidth to simultaneously receive and display the video can simultaneously receive and display the animation. Another advantage of embodiments disclosed herein is that an animation and a video are cross-linked to allow a user to switch between viewing the animation and viewing the video during playback. Another advantage of embodiments disclosed herein is to provide animations with selectable temporal and spatial resolutions and to provide a server system to select and transmit to a playback system an animation with a temporal and spatial resolution appropriate to the characteristics of the playback system.

These and other advantages will be described below. Term(s) for

As used herein, the term "video" refers to a sequence of images captured by a camera at a predetermined rate or generated by an image generator for playback at a predetermined rate. Each image in the sequence of images is included within a frame in the video and the real world object represented by the image is referred to as a scene. Video data is often stored so that for each frame there is data representing the image in that frame. The data may be in a compressed form or may be an uncompressed bitmap. Any capture rate can be used in theory, but often the rate is fast enough to capture human-perceptible motion in a scene (e.g., 10 frames per second or more).

A video may be provided from any source including, but not limited to, film, NTSC video (national television standard code), or any other existing or recorded video format. A video may be displayed on a variety of different displays including, but not limited to, a cathode ray tube display CRT, a liquid crystal display, a plasma display, etc.

The term "animation" refers to a data construct that includes key frames and information used to interpolate between the key frames. Keyframes are images that delineate or can be used to delineate incremental transformations within a scene. In one embodiment, a new keyframe is provided for each incremental transformation in the scene and the criteria used to determine what constitutes an incremental transformation can be adjusted according to system needs and user preferences. The more sensitive the criterion (i.e., the smaller the scene change), the more keyframes there are in the animation.

According to one embodiment, there may be two key frames in an animation: a background frame and a target frame. Background frames are key frames caused by background motion or color changes. Background motion is often caused by changes in the configuration of the camera used to record the scene. Typical changes to the camera configuration include, but are not limited to, translation, rotation, tilt, or zoom of the camera. Color variations are often caused by variations in the scene illumination (which may also be caused by configuration changes of the camera, e.g. aperture changes), but may also be caused by color variations of large areas within the scene.

Object frames are key frames caused by motion or color changes of objects within a scene but not changes in the configuration of the camera used to record the scene. Moving or color changing objects in a scene that are not related to camera motion are referred to herein as dynamic objects. It will be appreciated that whether a given object is a dynamic object or part of the background of a scene will depend on how large the object is in compared to the rest of the scene. When an object becomes large enough (e.g., because it is physically or optically close to the camera), the dynamic object actually becomes the background of the scene.

According to the disclosed embodiments, a sequence of background frames and information for interpolating between the background frames is stored in a data structure referred to as a background track. Similarly, a sequence of target frames and information for interpolating between these target frames is stored in a data structure called a target track. An animation produced using the methods and apparatus disclosed herein includes at least a background trajectory and zero or more target trajectories. These background tracks and object tracks are stored in a data structure called an animation object. An animation may be represented by an animation object, either stored in a memory or by an animation data stream transmitted point-to-point in a communication network or between subsystems in a device.

Animation generation and delivery

Fig. 1 illustrates the generation of an animation 14 and the transmission of the animation 14 to a playback system 18. An animation authoring system generates the animation using a video source 10. Either after or during production, the animation 14 is converted into an animation data stream 15 and transmitted to a playback system 18 via a communication network 20. Alternatively, the animation 14 may be transmitted to the playback system 18 on a distributable storage medium 21, the distributable storage medium 21 being readable by a subsystem in the playback system 18 to display the animation. Examples of distributable storage media include, but are not limited to, magnetic tape, magnetic disk, compact disk, read only memory (CD ROM), Digital Video Disk (DVD), and the like. The playback system may be a general-purpose computer system specifically designed for animation playback (e.g., a DVD or cassette tape player) or programmed to acquire the animation 14 (e.g., via a communication network or a distribution medium) and execute animation playback software to display the animation 14. For example, a web browsing application may be executed on any number of different types of computers to implement an animation playback system (e.g., apple Macintosh computer, IBM compatible personal computer, workstation, etc.). The program code for playing animation 14 may be included in the web browsing application itself or in an extension of the web browsing application that is downloaded into the working memory of the computer when the web browsing application determines that an animation data stream 15 is received.

A server system 16 may be used to control the delivery of the animation to the playback system on the network 20, as indicated by the dashed arrow 19 and the dashed transmission path 17. For example, server system 16 may be used to give priority to animation download requests from playback systems belonging to a class of users, or to limit access to available animations according to a service configuration or other criteria. As a more specific example, assume there is a Web site (i.e., server computer) that is used to provide educational animation (e.g., tiling, mounting doors, mounting ceiling fans, etc.) for home improvement purposes. The site provider may wish to have at least one animation used for free to allow interested visitors to learn about the usefulness of the service. Other animations may be made available for download on a pay-per-view basis. The site provider may also sell subscriptions to the site so that periodically paid users are given full download access to all animations. The server system 16 may be used to resolve download requests from these different categories of requesters and respond accordingly.

Another use of server system 16 is to provide animation 14 to playback system 18 in one of a number of different animation formats. The specific format used may be determined by the transmission network bandwidth and the capabilities of the playback system. For example, a given playback system 18 may require that the animation 14 be described in a particular format or language that it can understand (e.g., Java, dynamic Hypertext markup language D-HTML, virtual reality markup language VRML, the Macromedia Flash format — Macromedia Flash format, etc.). Moreover, the background and target frames in animation 14 may be sent with a particular spatial and temporal resolution to avoid exceeding the bandwidth of the transport network, which is typically limited by the download rate (e.g., modem speed) of playback system 18. In one embodiment, to accommodate the many possible permutations of animation language and network bandwidth, animation 14 is stored in a language and bandwidth independent format. Server system 16 may then be used to dynamically generate an animation data stream based on the format and bandwidth requirements of playback system 18. The operation of the server system will be described in more detail later.

Referring also to fig. 1, the playback system 18 may obtain an animation data stream either from the communication network 20 or by reading an animation object stored in a locally accessible storage medium (e.g., DVD, CD ROM, cassette tape, etc.). In one embodiment, the playback system 18 is a time-based controller that includes play, pause, forward fast forward, reverse fast forward, and stop functions. In another embodiment, the playback system 18 may be transitioned between animation and video playback modes to present either animation or video on the display. The playback system 18 may also include an interactive, non-time based playback mode to allow a user to tap on hot spots within an animation, pan and zoom within an animation frame, or download animation still frames. Other embodiments of the playback system will be described below.

Animation creation system

FIG. 2 is a block diagram of an animation authoring system 12 according to one embodiment. The animation authoring system 12 includes a background track generator 25, an object track generator 27, and an animation object generator 29. Video source 10 is initially received in background track generator 25, and the sequence of frames in video source 10 is parsed to generate a background track 33. The background track 33 comprises a sequence of background frames and information that can be used to interpolate between these background frames. In one embodiment, after the background track is completed, the background track generator 25 outputs the background track 33 to the object track generator 27 and animation object generator 29. In an alternative embodiment, the background track generator 25 outputs the background track 33 to the object track generator 27 and the animation object generator 29 after each new background frame within the background track 33 is completed.

As shown in FIG. 2, the object track generator 27 receives the background track 33 from the background track generator 25 and the video source 10. The object track generator 27 generates zero or more object tracks 35 from the background track 33 and the video source 10 and passes these object tracks 35 to the animation object generator 29. Each object track 35 includes a sequence of object frames and transformation information that can be used to interpolate between these object frames.

The animation object generator 29 receives the background track 33 from the background track generator 25 and zero or more object tracks 35 from the object track generator 27 and writes these tracks to an animation object 30. As discussed below, the animation object 30 may be formatted to include multiple temporal and spatial resolutions of the background tracks and the object tracks.

FIG. 3 is a block diagram of a background track generator 25 according to one embodiment. The background trajectory generator 25 comprises a scene change estimator 41, a background frame constructor 43, a background motion estimator 45 and a background blending estimator 47.

The scene change estimator 41 compares successive frames of the video source 10 to each other to determine when a transition of a scene in the video frames exceeds a threshold. When provided to an entire video source 10, the scene change estimator 41 functions to segment the sequence of frames in the video source 10 into one or more sub-sequences of video frames (i.e., video segments), each video segment exhibiting a scene transition less than a predetermined threshold. Each video segment can be processed by the background motion estimator 45, background frame constructor 43, and background blending estimator 47 to process each video segment identified by the scene change estimator 41 to generate a background frame and interpolation information for the video segment. Thus, the predetermined threshold provided by the scene change estimator 41 defines a delta transform of a scene resulting in the composition of a new background frame. In one embodiment, the background frames correspond approximately to the beginning and end of each video segment and the background frame corresponding to the end of a video segment corresponds to the beginning of the next video segment. Thus, video segments are delineated by background frames, and a background frame is constructed for each video segment in the video source 10, except for the first video segment for which a start and end background frame is constructed.

Fig. 4A shows a video segment 54 that has been identified by the scene change estimator 41 of fig. 3. According to one embodiment, the scene change estimator 41 determines a transform vector for each pair of adjacent video frames within the video segment 54. Here, a first frame is considered to be adjacent to a second frame if the first frame immediately precedes or follows the second frame in a time series of frames.

In fig. 4A, the transformation vector for each pair of adjacent video frames is represented by a corresponding increment (i.e., delta symbol). According to an embodiment, the transform vector comprises a plurality of scalar components, each scalar component indicating a measure of scene change from one video frame to the next in the video segment 54. For example, these scalar components of a transform vector may include measures of the following changes in the scene: pan, zoom, rotate, pan, tilt, skew, color change, and elapsed time.

According to one embodiment, the scene change estimator 41 applies a spatial low pass filter to the video segment 54 to increase the block traces of the images in the video segment 54 before computing the delta of the transform between adjacent frames. After low pass filtering, the individual images in the video segment 54 contain less information than before filtering so that fewer computations are required to determine the transform increments. In one implementation, the transform delta calculated for each pair of adjacent frames in the video segment 54 is added to the transform delta calculated for the previous pair of adjacent frames to accumulate a sum of the transform deltas. In effect, the sum of the transformation deltas represents a transformation between the first video frame 54A in the video segment 54 and the most recently compared video frame in the video segment 54. In one embodiment, the sum of the transform increments is compared to a predetermined transform threshold to determine whether the most recently compared video frame has caused the transform threshold to be exceeded. It will be appreciated that the transform threshold may be a vector comprising a plurality of scalar thresholds including thresholds for color change, translation, scaling, rotation, panning, tilting, skew, and elapsed time for the scene. In an alternative embodiment, the transform threshold is dynamically adjusted to achieve a desired proportion of video segments to frames in the video source 10. In another alternative embodiment, the transform threshold is dynamically adjusted to achieve a desired average video segment size (i.e., a desired number of video frames per video segment). In yet another alternative embodiment, a transform threshold is dynamically adjusted to achieve a desired average elapsed time per video segment. In general, any technique for dynamically adjusting the transform threshold may be used without departing from the spirit and scope of the present invention.

In one embodiment, if the most recently compared video frame 54C has caused the transformation threshold to be exceeded, the scene is deemed to have changed and the video frame 54B preceding the most recently compared video frame 54C is indicated to be the ending frame of the video segment 54. Thus, if a predetermined transform threshold is used, video segments of the video source 10 are guaranteed to have an overall transform less than the transform threshold. If a variable transform threshold is used, a large change in the overall transform delta for the respective video segment can result and the scene change estimator needs to be applied iteratively to reduce the change in these transform deltas.

According to the embodiment of fig. 3, the background track generator 25 invokes the background motion estimator 45, the background frame constructor 43 and the background blending estimator 47 when new video segments are defined (i.e., when new scene changes are detected). In an alternative embodiment the scene change estimator 41 is used to completely decompose the video into sub-sequences before any sub-sequences are processed by the background frame constructor 43, the background motion estimator 45 or the background blending estimator 47.

As shown in fig. 4A and described above, video frames in a given video segment continue to be selected and compared until the accumulated transition delta exceeds a transition threshold. In one embodiment, when the last frame of a video is reached, the last frame is automatically considered to end a video segment. Also, the sum of the transform deltas is cleared after each new video segment is processed by background frame constructor 43. In one embodiment where the scene change estimator 41 parses the entire video before any video segments are processed, the transform deltas associated with each video segment are recorded for later use by the background motion estimator 45 and the background frame constructor 43.

Fig. 4B is a block diagram 57 describing the operation of the background motion estimator 45, the background frame constructor 43 and the background blending estimator 47 shown in fig. 3. Beginning at block 59, the background motion estimator examines the video segment 54 indicated by the scene change estimator (i.e., the video frames 54 of the sub-sequence bounded by BFi and BFi +1 in FIG. 4A) to identify the dominant motion of the scene depicted in those frames. The primary motion is considered a background motion.

There are a variety of techniques that can be used to identify background motion in a video segment. One technique, referred to as feature tracking, includes identifying features in the video frames (e.g., using edge detection techniques) and tracking the motion of the features from one video frame to the next. Features that statistically exhibit abnormal motion relative to other features are considered dynamic targets and are ignored over time. The motion shared by a large number of features (or large features) is typically caused by a change in the configuration of the camera used to record the video and is considered background motion.

Another technique for identifying background motion in a video segment is to correlate the frames of the video segment according to a common region and then determine the frame-to-frame offsets for the regions. The frame-to-frame offset can then be used to determine a background motion for the video segment.

Another technique intended for use in identifying background motion in a video segment includes, but is not limited to, several coarse to fine retrieval methods: using a spatial hierarchy decomposition of frames in the video segment; a measure of the variation in temporal (over time) video frame histogram characteristics to identify scene changes; filtering to emphasize features in the video segment that can be used for motion recognition; optical flow measurement and analysis; pixel format conversion to alternate color representations (including grayscale) to achieve greater processing speed, greater reliability, or both; and robust estimation techniques, such as M-estimation, that estimate elements of these video frames that do not conform to an estimated dominant motion.

Referring also to the flow chart 57 of fig. 4B, at block 61 the background frame constructor receives background motion information from the background motion estimator and uses the background motion information to register the frames of the video segment relative to each other. Registration refers to correlating video frames in a manner that accounts for changes caused by background motion. By registering these video frames according to background motion information, regions of these frames (i.e. dynamic objects) that exhibit motion different from background motion will appear in fixed positions only in a few of the registered video frames. That is, the regions move from frame to frame relative to a static background. These regions are dynamic objects. In block 63, the background frame constructor removes dynamic objects from the video segment to produce a sequence of video frames that are processed. At block 65, the background frame constructor generates a background frame based on the processed sequence of video frames and the background motion information. Depending on the nature of the transformation, construction of the background frame may include combining two or more processed video frames into a single background image or selecting one of the processed video frames as the background frame. In one embodiment, the combined background frame may be a panoramic image or a high resolution still image. A panoramic image is generated by stitching together two or more processed video frames and may be used to represent a background scene that has been captured by panning, tilting or panning a camera. A high resolution still image is suitable when the object of a sequence of video frames being processed is a relatively static background scene (i.e. the configuration of the camera used to record the video source has not changed significantly). One technique for generating high resolution images is to parse the processed sequence of video frames to identify sub-pixel motion between the frames. Sub-pixel motion is caused by slight motion of the camera and can be used to produce a combined image having a higher resolution than any of the individual frames captured by the camera. High resolution still images are particularly useful because they can be displayed to provide details that cannot be presented by video source 10, as described below. Also, when multiple high resolution still images of the same object are constructed, these high resolution still images can be combined to form a still image having several regions of varying resolution. Such an image is referred to herein as a multi-resolution still image. As described below, the user may pause the animation playback to zoom in and zoom out on different areas of such a still image. Similarly, the user may pause animation playback to pan with respect to a panoramic image. A combination of panning and zooming is also possible. Furthermore, an animation may be cross-linked with its video source so that during playback of the video source, the user may be prompted to pause video playback to view a high resolution still image, a variable panorama, or a variable focus still image. Crosslinking is described in detail below.

Fig. 5 shows a background image set 70 generated by the background frame constructor 43 shown in fig. 3. The background frame BFi refers to a background image 71 that is a processed video frame rather than a combined image. Such background images are typically caused by scaling (i.e., zoom in or zoom out) or abrupt cuts between successive video frames. The background frame BFi + i refers to a high resolution still image 73 composed from a plurality of processed video frames of the virtually identical scene. As described above, such images are particularly useful for providing details that are not perceptible in the video source. The background frames BFi +2, BFi +3, and BFi +4 each refer to a different region of a panoramic background image 75. As shown, the panoramic image frame 75 is generated by stitching a portion 76 of one or more processed video frames onto another processed video frame. In this example, the camera has been panned down and left, or panned right and tilted down to incrementally capture more aspects of the scene. Other shapes of the combined background image may result from different types of camera motion.

Returning to the last block of flowchart 57 in fig. 4B, a background blending estimator (e.g., element 47 in fig. 3) generates background blending information based on the background motion information and the newly constructed background frame at block 67. The operation of the hybrid estimator is described in detail below.

FIG. 6 is a block diagram of the target trajectory generator 27 according to one embodiment. The object track generator 27 receives as input a background track 33 generated by the background track generator (e.g., element 25 in fig. 2) and the video source 10. The object track generator 27 identifies dynamic objects on the basis OF differences between the background track 33 and the video source 10 and records Object Frames (OF) containing the dynamic objects together with Object Motion (OM) and object mix (OB) information in an object track 35.

In one embodiment, the object trajectory generator 27 includes an object frame constructor 81, an object motion estimator 83 and an object blending estimator 85. The object frame constructor 81 compares the video frames in the video source 10 with the background frames in the background track 33 to construct Object Frames (OF). Each object frame formed by object frame constructor 81 contains a dynamic object, as described below. In one embodiment, at least one object frame is generated for each dynamic object detected in a given video segment (i.e., each dynamic object detected in a sequence of video frames identified by the scene change estimator 41 of fig. 3). The object motion estimator 83 tracks the motion of dynamic objects within a video segment to generate object motion information (OM), and the object blending estimator 85 generates object blending information (OB) from the object frames and object motion information generated by the object frame constructor 81 and the object motion estimator 83, respectively.

Fig. 7A and 7B show the operation of the target trajectory generator 27 of fig. 6 in detail. Fig. 7A shows a video segment 54 identified by the scene change estimator 41 of fig. 3. The video segment 54 is delimited by background frames BFi and BFi +1 and contains a dynamic object 56. Fig. 7B is a flowchart 100 of the operation of the target trajectory generator 27.

Beginning at block 101 of flowchart 100, a target frame constructor (e.g., element 81 of fig. 6) compares background frames BFi with video frames VFj of video segment 54 to generate a difference frame 91. As shown in fig. 7A, the small difference between BFi and VFj produces somewhat random differences (noise) in the difference frame 91. However, a relatively concentrated difference region 92 between BFi and VFj occurs where a dynamic object is removed from background frame BFi by the background frame constructor (e.g., elements of FIG. 3). At block 103 of the flowchart 100, a spatial low pass filter is applied to the difference frame 91 to produce a filtered difference frame 93. In the filtered difference frame 93, the random differences (i.e., high frequency components) have disappeared and the concentrated difference region 92 exhibits increased blockiness. As a result, the contours of the concentrated difference zone 92 can be more easily discerned. Accordingly, at block 105 of flowchart 100, the target frame constructor performs a feature search (e.g., using edge detection techniques) to identify the concentrated difference region 92 in the filtered difference frame 93. At block 107, the target frame constructor selects an area in video frame VFj corresponding to the concentrated difference region 92 in the filtered difference frame 93 as a target frame 56. In one embodiment, the target frame constructor selects the target frame 56 to be a rectangular region (e.g., with similar x, y offsets) corresponding to a rectangular region in the filtered difference frame 93 that contains the concentrated difference region 92. Other alternative target frame shapes may be used. It will be appreciated that if there are no concentrated difference regions 92 in the filtered difference frame 93, the target frame constructor will not select the target frame. Thus, if there are multiple concentrated difference regions 92 in the filtered difference frame 93, multiple target frames will be selected. The regions of difference 92 in each set in the filtered difference frame 93 are considered to correspond to dynamic objects in the sequence of video frames 54.

After a dynamic object in a target frame 56 has been identified and framed (framed) by the target frame constructor, the motion of the dynamic object is determined by tracking the change in position in the target frame 56 by the frame progression in the video segment 54. Thus, at block 109 of the flowchart 100, the object motion estimator (e.g., element 83 of FIG. 6) tracks the motion of the dynamic object identified and framed by the object frame constructor from one video frame to the next in the video segment 54. According to one embodiment, the object motion tracking is performed by performing feature retrieval within successive video frames of the video segment 54 to determine a new location of a dynamic object of interest. Using the frame-to-frame motion of the dynamic object, the motion estimator generates motion information that can be used to interpolate between successive object frames to approximate the motion of the dynamic object. At block 111 of flowchart 100, a target blending estimator (e.g., element 85 of FIG. 6) generates target blending information based on the target motion information and the target frames. In one embodiment, the operation of the target blending estimator is the same as the operation of the background blending estimator. However, other alternative techniques for generating information for blending successive target frames may be used without departing from the spirit and scope of the present invention.

As described above, in one embodiment of the object trajectory generator 27 in fig. 3, at least one object frame is generated for each dynamic object in a video segment identified by the object frame constructor 81. If the object motion estimator 83 determines that the motion of a dynamic object in a video segment is too complex to adequately represent by interpolation between the object frames bounding the video segment, the object motion estimator 83 can indicate that one or more additional object frames for the video segment need to be constructed. Using the techniques described above, the object frame constructor will then generate the additional object frames at the junctures within the video segment indicated by the object motion estimator. As described above with reference to background frame construction, the target frame may include image data extracted from a region of a combined image. If one or more additional object frames are constructed to represent a dynamic object that is performing a complex motion, the additional frames may be organized within the animation object such that the dynamic object overlaps other features in a scene during animation playback.

In a scenario, dynamic objects occasionally mask each other. According to an embodiment of the target track generator 27, when dynamic targets represented by separate target tracks mask each other, the target track for the masked dynamic target is ended and a new target track is generated if the masked target is re-revealed. Thus, if dynamic objects repeatedly mask each other, a large number of discrete object tracks may be generated. In an alternative embodiment of the target track generator, information may be associated with the target track to indicate which of the two dynamic targets will be displayed on top of the other if their screen positions converge.

When provided with a background image, the image of the dynamic object (i.e., the object image) may be combined from multiple video frames. The combined object image includes, but is not limited to, a panoramic object image, a high resolution still object image, and a multi-resolution still object image. In general, any combination of images that can be used to generate a combined background image can also be used to generate a combined object image.

FIG. 8 is a schematic diagram of an animation object 30, according to an embodiment. The animated object 30 includes a background track 33 and a plurality of object tracks 35A, 35B, 35C. As described above, the number of object tracks depends on the number of identified dynamic objects in the scene depicted in the video source, and if no dynamic objects are identified, there may be no object tracks in the animated object 30.

In one embodiment, the animation object 30 is implemented by a linked table 121 of a background track and object tracks. The background track itself is formed by a background track element BT and a sequence of background frames BF₁-BF_NIs implemented by a linked list. Each target track similarly passes through a target track element OT1, OT2, OTR and a corresponding sequence OF target frames (OF 1)₁-OF1_M、OF2₁-OF2_K、OFR₁-OFR_J) A linked list implementation of (a). In one embodiment, the background trace element BT and the target trace elements OT₁、OT₂、OT_RA pointer (pointer) that implements the animation target link table 121 is also included. That is, the background track element BI includes one to the first target track element OT₁The first target track element OT₁Including to the next target trajectory element OT₂Until the target trajectory OT is reached_R. In one embodiment, the end of the animated target link table 121 and the separate background and target track link tables are indicated by a corresponding null pointer in their last element. Other techniques for indicating the end of the linked lists may be used in alternative embodiments. For example, the animation object 30 may include a data structure that includes a head pointer to the background track 33 and a tail pointer to the last object track 35C in the animation object link table 121. Similarly, the background track element BT and each target track element OT₁、OT₂、OT_RTail pointers may be included that indicate the end of their respective linked lists. In another embodiment, a flag in an element of a linked list may be used to indicate the end of the list.

Referring again to FIG. 8, according to one embodiment, data structure 123 is used to implement a background frame. The background frame data structure 123 includes a NEXT pointer (NEXT PTR) to the NEXT background frame in the background track 33, a previous pointer (PREV PTR) to the previous background frame in the background track 33, an IMAGE pointer (IMAGE PTR) to the location of the IMAGE data for the background frame, an interpolation pointer (INTER PTR) to an interpolation data structure, and a time stamp (TIMESTAMP) indicating a relative play time for the background frame, as described below, the background frame data structure 123 may also include one or more component items (members) for cross-linking frames of a video source.

Since the image to be displayed for a given background frame may be derived from either a non-composite background image or a composite background image, the image pointer in the background frame data structure 123 may itself be a data structure indicating the location of the background image in a memory, the offset (e.g., rows and columns) in the background image from which to obtain image data for the background frame, and a pointer to a video segment used to generate the background frame. The pointer to the video segment is used to link an animation to a video source, as described below. In one implementation, the pointer to a video segment is a pointer to at least a first video frame in the video segment. Other techniques for linking the background frame to the video segment can be used without departing from the spirit and scope of the present invention.

In one embodiment, the background interpolation data structure 125 includes data for interpolating between a given background frame and its neighboring background frames. The information for interpolation between a given background frame and its neighboring subsequent background frame (i.e., the next background frame) includes FORWARD background motion information (BM FORWARD) and FORWARD background blending information (BB FORWARD). Similarly, the information used for interpolation between a given background frame and its neighboring previous background frames includes backward background motion information (BM REVERSE) and backward background blending information (BB REVERSE). The background motion information in a given direction (i.e., forward or backward) may itself be a data structure containing a plurality of constituent items. In the exemplary embodiment shown in fig. 8, the FORWARD background motion information (BM FORWARD) includes components indicating the translation of the background scene in the X and Y directions (i.e., horizontally and vertically in the image plane) to the next background frame, a zoom factor in the X and Y directions (i.e., to indicate zoom in and zoom out and aspect ratio of the camera), a rotation factor, a pan factor, a tilt factor, and a skew factor. It is understood that more or fewer motion parameters may be used in alternative embodiments. The backward background motion information (BM REVERSE) may be indicated by a set of similar motion parameters.

In one embodiment, each individual target frame is implemented by a target frame data structure 127 similar to the background frame data structure 123 described above. For example, the target frame data structure 127 includes a pointer to the NEXT target frame in the target track (NEXT PTR), a pointer to the previous target frame in the target track (PREV PTR), an IMAGE pointer (IMAGE PTR), an interpolation pointer (INTERP PTR), and a timestamp (TIMESTAMP), each of which performs functions similar to those of the same component items in the background track data structure 123. Naturally, the image pointer in the object frame data structure 127 indicates object image data replacing background image data and the interpolation pointer indicates object interpolation data replacing background interpolation data. As shown in fig. 8, an exemplary object interpolation data structure includes constituent items indicative of both FORWARD and backward object motion information (OM FORWARD and OM REVERSE, respectively) and both FORWARD and backward blend information (OB FORWARD and OB REVERSE, respectively).

Fig. 9A illustrates an exemplary embodiment of background frame mix data structures 135A, 137A that may be used to perform background mixing. It is to be understood that the target mix data may be similarly organized. In one embodiment, each blending data structure 135A, 137A includes a blending operator (A, B, C, D) in the form of coefficients of a polynomial, an interval fragment (INTV) indicating a portion of an interval between two successive background frames to which the blending operator is to be applied, and pointers to the next blending data structure to allow an interval between two successive background frames to be represented by multiple blending operators.

In FIG. 9A, for background frame BF_iForward background mixed data 135A and BF for background frame_i+1Is depicted along with a graph 139 showing the application of blending data to blend the background frame BF_IAnd BF_i+1The method (1). The blending operation depicted in the graph is referred to as a cross-decomposition operation because the background frame BF is during the blending interval (i.e., the time between the two background frames)_iIs effectively decomposed into background frames BF_i+1. At a time t_INTGenerating an interpolated frame based on the frame BF_iForward background motion information of, converting the background frame BF in the forward direction_iAnd according to the use for frame BF_i+1Backward background motion information of (1), transform background frame BF in backward direction_i+1. Using for frame BF_IAnd BF_i+1Calculates respective weights (i.e., multipliers) for the two frames. For frame BF_iIs based on the weight for the frame BF_iForward mixing information for background frame BF_i+1Is based on BF for background frames_i+1The backward mixing information of (1). And then used for frame BF_IAnd BF_i+1Are respectively applied to the background frame BF of the transformed version_iAnd BF_i+1And the resulting transformed, weighted images are combined (e.g., using pixel-by-pixel addition) to generate the interpolated frame.

As described above, in one embodiment, the blending operator is implemented by storing the coefficients of a polynomial and the portion of the blending interval to which the polynomial is to be applied. E.g. for frame BF_iIncludes a blending operator that indicates that the blending operator indicated by coefficient A, B, C, D of blending data 135A is to be applied over the entire blending interval (in this case t BF)_iAnd t BF_i+1Interval therebetween), an interval of oneFragment (INTV = 1). In summary, interval fractions smaller than one are used when the overall mixing function includes discontinuities that cannot be adequately represented by a finite order polynomial. However, in the blending operation depicted in graph 139, a successive, first-level blending operation is indicated, such that coefficients A, B, C and D specified in blending data structure 135A are applied to polynomial weights (T) = AT³+BT²+ CT + D, implementing the weight t BF_i(T) = 1-T. According to an embodiment, the value of T is normalized to the range of 0 to 1 over the fraction of the mixing interval in question so that the mixing operators a =0, B =0, C = -1, D =1 achieve a multiplier that decreases linearly over time over the entire mixing interval. For BF_iStarts at 1 and decreases linearly to 0 at the end of the mixing interval. See for frame BF_i+1The resulting weight t BF is implemented by applying the coefficients a =0, B =0, C =1, and D =0 specified in the mixed data structure 137A_i(T) = T. Thus, for frame BF_i+1Starts at 0 and increases linearly during the mixing interval.

Fig. 9B shows a discontinuous blending function 141. In this case, the background frame BF_iAnd BF_i+1The mixing interval in between is divided into three interval segments 146, 147 and 148. During the first segment 146 of the mixing interval, apply to the background frame BF_iIs stably maintained at one and applied to the background frame BF_i+1Is steadily held at zero. During the first two segments 147 of the mixing interval, a linear cross-decomposition occurs and during the third segment 148 of the mixing interval, a background frame BF_iAnd BF_i+1Again, the multiplier of (a) is held stable, but at an opposite value to the multiplier of the first segment 146 of the mixing interval. In one embodiment, the non-contiguous blending segment 141 is indicated by a linked list of blending data structures 135B, 135C, 135D, each of which indicates a segment of the blending interval on which the corresponding INTV parameter is to be applied. Thus, the first forward mixing data structure 135B for the background frame BFi contains the interval fragment INTV =0.25 and a mixing operator weight tbf indicating a single multiplier_i(T) =1 will be applied toFrame BF of a transformed version at the first 25% of the mixing interval, i.e. interval 146_i. For background frame BF_iThe second forward blending data structure 135C contains the interval fragment INTV =0.5, and a blending operator weight tbf_i(T) =1-T indicates that during the middle 50% of the mixing interval, i.e. interval 147, it is applied to the frame BF_iWill decrease linearly from 1 to 0. Note that for ease of illustration, the value of Y is assumed to be normalized to a range of 0 to 1 during each interval segment. Other representations are naturally possible and considered to be within the scope of the invention. For background frame BF_iContains the interval fragment INTV =0.25, and a forward mixing data structure 135D is formed by a weight tbf_i(T) =0 indicates that for the last 25% of the blending interval (i.e., interval 148), frame BF_iNo contribution to the interpolated background frame.

See also FIG. 9B for background frame BF_i+1Is indicated for the background frame BF, the link table indication of the hybrid data structure 137B, 137C, 137D_iAn inverse mixing function of the mixing function of (1). That is, a weight of 0 is applied to the frame BF during the first 25% of the mixing interval_i+1Indicating the frame BF during that time_iNo contribution to the interpolated background frame), in the middle 50% of the mixing interval, applied to the frame BF_i+1Increases the weight of the transformed version of (a) linearly from 0 to 1, and during the last 25% of the mixing interval, a single multiplier (i.e., weight =1) will be applied to the frame BF_i+1To produce an interpolated background frame.

The reason for applying a discontinuous blending function of the type shown in FIG. 9B is to reduce the distortion associated with blending successive keyframes. By stably maintaining the influence of a given key frame for a fraction of a mixing interval, from frame BF_iAnd BF_i+1The distortion caused by the difference between the forward and backward transforms of (b) can be reduced. In one embodiment, operator input is received at an animation authoring system (e.g., element 12 of FIG. 1) to select a fraction of a blending interval over which the impact of a given keyframe is to be affectedAnd keeping stable. In an alternative embodiment, a measure of image sharpness (e.g., image gradient) may be determined for both blended and unblended images to automatically determine the interval segments over which the effect on one or the other image should be kept stable. Moreover, although linear cross-factoring is described above, other types of cross-factoring may be determined by different polynomials. Also, instead of using polynomial coefficients to indicate the type of blending operation, other indicators may be used. For example, a value indicating whether to apply a linear, quadratic, transcendental, logarithmic, or other blending operation may be stored in the blending data structure. Although background blending is primarily described with the aid of cross-decomposition operations, other blending effects may also be used for transitioning from one background frame to another, including but not limited to fading and multiple screen wipes.

FIG. 10 illustrates the manner in which the background track 33 and object tracks 35A, 35B of an exemplary animation object 30 may be used to synthesize an interpolated frame IFt during animation playback.

At a given time t, an interpolated frame IFt is generated from each pair of adjacent frames in the background track 33 and the target tracks 35A, 35B. Using the background frame BF adjacent to the pair_iAnd BF_i+1The associated background motion and background blending information performs respective transformations and weightings on the pair of background frames. Based on and frame BF_iAssociated forward background motion information (BM) BF the background frame_iPerforms a transformation and then BF based on the sum frame_iThe associated forward background mixing information (BB) weights the background frames BFi. The effect is to BF the background frame according to the forward motion information (e.g. translation, rotation, scaling, panning, tilting or skew)_iThe pixels in (a) are transformed to the corresponding locations and the density level of each pixel value is then reduced by weighting the pixel values according to a blending operator. Also by using for frame BF_i+1Backward motion and mixing information versus background frame BF of_i+1The pixels in (1) are transformed and weighted. The resulting transformed images are then combined to produce an interpolated background frame 151A representing the background scene at time t. Using the same forward and forward respectivelyBackward target motion information (OM) versus target frame OF1_iAnd OF1_i+1Transforms the object frame OF1 using the forward and backward object mix information (OB) respectively_iAnd OF1_i+1Weighting is performed and then combining is performed. The resulting interpolated target frame is then overlaid on the interpolated background frame 151A to generate an interpolated frame 151B, the interpolated frame 151B comprising an interpolated background and an interpolated dynamic target. Use and target frame OF2_iAnd OF2_i+1The two object frames are transformed, weighted and combined with associated object motion and blending information (0M, OB) and then overlaid on the interpolated background. The result is a completed interpolated frame 151C. Subsequent interpolated frames are also generated using a time variable blending operator and different values of the progressive transformation of the background and target frames based on the motion information. The net effect of animation playback is to produce a video effect that approximates the original video used to produce the animated object 30. A soundtrack derived from the original video may also be played with the animation.

FIGS. 11 and 12 illustrate techniques for providing multiple resolutions of animation in an animation object. FIG. 11 illustrates a technique for providing multiple temporal resolutions of animation keyframes and FIG. 12 illustrates a technique for providing multiple spatial resolutions of animation keyframes. In one embodiment, an animation object is constructed to provide two types of multi-playback resolution, spatial and temporal. This provides the playback system user with the option to increase or decrease the resolution of the animation sequence in either the spatial or temporal domain, or both. If the playback system has sufficient download bandwidth and processing power, the maximum temporal and spatial resolution can be selected to present the highest resolution animation playback. If the playback system does not have sufficient download bandwidth or processing power to handle the maximum spatial and temporal resolution, the playback system can automatically reduce the spatial or temporal resolution of the animation being played back based on user-selected criteria. For example, if the user has indicated a desire to view the maximum spatial resolution image (i.e., a larger higher resolution image), even though it means fewer key frames and more interpolated frames, a maximum or near maximum spatial resolution key frame may be selected for display while a key frame track (i.e., a background track or target track) with fewer key frames per unit time is selected. Thus, if a user desires greater temporal resolution (i.e., more keyframes per unit time), even though the spatial resolution must be reduced, a maximum or near maximum temporal resolution keyframe track may be selected, but the keyframes are displayed at reduced spatial resolution.

Another contemplated use to reduce the temporal resolution of the animation is to scan quickly forward and backward within the animation. During animation playback, a user may signal a time multiplier (e.g., 2X, 5X, 10X, etc.) in a request to view the animation at a faster rate. In one embodiment, the request for a fast scan is satisfied by using the time multiplier along with the bandwidth capabilities of the playback system to select an appropriate time resolution for the animation. At very fast playback rates, the spatial resolution of the animation may also be reduced. A time multiplier may similarly be used to slow animation play to below natural rates to achieve a low-speed motion effect.

Fig. 11 shows a multi-time level background track 161. A target trajectory may be similarly configured. In a first level of background track 35A, a maximum number of background frames (each labeled "BF") are provided, along with background motion and blending information for interpolation between successive pairs of background frames. The number of background frames per unit time can range from a video frame rate (in which case the motion and blending information would indicate no information-switching to the next frame only) to a fraction of the video frame rate (small fraction). The second level background track 35B has fewer background frames than the first level background track 35A, a third level background track 35C has fewer background frames than the second level background track 35B and so on to the Nth level background track 35D. Although fig. 11 shows that the number of background frames in the second-level background track 35B is half that of the background frames in the first-level background track 35A, other ratios may be used. The blending and motion information (BM2, BB2) used for interpolation between successive pairs of background frames in the second level background track 35B is different from the blending and motion information (BM1, BB1) used for the first level background track 35A because the transformation from background frame to background frame at different levels of track is different. Also the third level background track 35C has fewer background frames than the second level background track 35B and thus different frame-to-frame motion and blending information (BM3, BB 3). The temporal resolution of the background track of the subsequent stage is incrementally reduced until the background track of minimum resolution at stage N is reached.

In one embodiment, the background track level above the first level 35A does not actually contain a separate sequence of background frames. Alternatively, a pointer to the background frame in the first level background track 35A is provided. For example, a first background frame 62B in the second level background track 35B is indicated by a pointer to a first background frame 62A in the first level background track 35A, a second background frame 63B in the second level background track 35B is indicated by a pointer to a third background frame 63A in the first level background track 35A, and so on. These respective pointers to the background frames in the first level background track 35A may be combined in a data structure with motion and blending information indicating the transformation to the next background frame and motion and blending information indicating the transformation to the previous background frame. Also, a linked list of these data structures may be used to indicate the sequence of background frames. Other data structures and techniques for indicating the sequence of background frames may be used without departing from the spirit and scope of the present invention.

In an alternative embodiment, each background track level 35A, 35B, 35C, 35D is formed by selecting a plurality of reference values for background frames from a set (or pool) of background frames. In this embodiment, the reference values used to form a background track of a given level effectively define a sequence of keyframes having a temporal resolution determined by the number of reference values. A reference value used to select a background frame may be a pointer to the background frame, an index indicating the location of the background frame in a table, or any other value that may be used to identify a background frame.

In one embodiment, the blending and motion information for the higher level background track in the multi-level background track 161 may be obtained by combining multiple sets of motion and blending information from a lower level background track. For example, the background motion and blending information used for transitioning between the background frames 62B in the background level two may be generated by combining the background motion and blending information for transitioning between the background frames 62A and 64 with the background motion and blending information for transitioning between the background frames 64 and 63A (BM2, BB 2). In an alternative embodiment, background motion and blending information for higher level background tracks may be generated from the background frames in the track without using blending and motion information from a lower level background track.

Fig. 12 shows a multi-spatial resolution background frame. Each background frame in a background track of a multi-spatial resolution background track may comprise different resolution background frames BF₁、BF₂To BF_N. Background frame BF₁Is a maximum spatial resolution background frame and, in one embodiment, includes the same number of pixels as an original video frame. Background frame BF₂Having a lower than BF₁Spatial resolution of (1), meaning BF₂Or has a specific BF₁A small number of pixels (i.e., a smaller image) or a larger block size. Block size refers to the unit of elements, usually pixels, of visual information used to represent an image. A smaller block size enables a larger spatial resolution image because finer units of elements are used to characterize the image. A larger block size achieves less spatial resolution but requires less overall information because a single pixel value is applied to a group of pixels in a block of pixels.

Fig. 13 illustrates the use of a server system to control the content of animation data transmitted to the playback systems 18A, 18B, 18C. According to one embodiment, the server system 16 receives requests to download different animation objects 14A, 14B, 14C stored in a computer readable storage device 170. Before downloading an animation object, the server system 16 first queries the playback systems 18A, 18B, 18C to determine the capabilities of those systems. For example, in response to a request from playback system 18A to download an animation object 30C, server system 16 may request that playback system 18A provide a set of playback system characteristics that may be used by server system 16 to generate an appropriate animation data stream. As shown in fig. 13, the set of playback system characteristics associated with the playback systems 18A, 18B, 18C may include, but is not limited to, the download bandwidth of the playback system or its network access medium, the processing capabilities of the playback system (e.g., number of processors, speed of processors, etc.), the graphics capabilities of the playback system, the software applications used by the playback system (e.g., type of web browser), the operating system executing the software applications, and a set of user preferences. User preferences may include a preference to sacrifice temporal resolution in favor of spatial resolution and vice versa. Moreover, user preferences may be dynamically adjusted by a user of the playback system during animation download and display.

In one embodiment, animation objects 14A, 14B, 14C are stored in a multi-temporal resolution and multi-spatial resolution format and server system 16 selects the background and object tracks from an animation object (e.g., animation object 30C) having a temporal and spatial resolution best suited to the characteristics provided by the playback system. Thus, as shown in graph 172, server system 16 may select different temporal/spatial resolution versions 174A, 174B, 174C of the same animation object 30C to download to playback systems 18A, 18B, 18C according to their respective characteristics. Moreover, the server system may dynamically adjust the temporal/spatial resolution of the animation provided to a given playback system 18A, 18B, 18C based on the characteristics of the playback system 18A, 18B, 18C.

Although fig. 13 illustrates the use of a server system to control the content of an animation data stream over a communication network. Similar techniques may be employed in a playback system to dynamically select between multiple temporal and spatial resolution animation tracks. For example, selection logic within a playback system may provide an animation data stream to display logic in the playback system having a temporal/spatial resolution appropriate for the characteristics of the playback system. For example, a DVD player may be designed to reduce the temporal or spatial resolution of an animation playback depending on whether one or more other videos or animations are being displayed (e.g., in another area of a display).

As described above, an advantage of embodiments of the present invention is that key frames of an animation are associated with video frames of a video source so that a user can switch between viewing the animation and viewing the video source during playback. This association between key frames and video frames is called "cross-linking" and is particularly useful when one representation, animation or video, has advantages over another. For example, in one embodiment of an animation playback system described below, during video playback. The user is notified when a sequence of video frames is linked to a still image that forms part of the animation. As described below, the still image may have a higher or more variable resolution, a wider field of view (e.g., a panoramic image), a higher dynamic range, or a different aspect ratio than the video frame. Also, the still image may contain stereoscopic time difference information or other depth information to allow stereoscopic three-dimensional (3D) display. When a still image is known to be available, the user may provide input to switch the video display to an animated display during playback to achieve the attendant advantages of animation (e.g., higher resolution images). Alternatively, the user may pause the video display to navigate within a panoramic image of the animation or zoom in or out on a still image of the animation. In another embodiment, the user may play an animation and a video in picture-in-picture mode or switch from an animated display to a cross-linked video.

In one embodiment, cross-linking comprises generating still images from a video and then creating cross-links between the still images and the frames of the video. In an alternative embodiment, a video may be used to generate the still images rather than a video that is cross-linked to the still images. Techniques for generating still images from a video are described below. It is to be understood that other similar techniques may be used to generate still images without departing from the spirit and scope of the present invention.

By integrating multiple video frames over time, a still image having a higher spatial resolution than the frames of the video source may be obtained. Images in video frames that are close together in time often exhibit small displacements (e.g., sub-pixel motion) as a result of panning, zooming, or other motion of the camera. The displacement allows multiple video frames to be spatially registered to produce a higher resolution image. A high resolution still image may then be generated by interpolating between adjacent pixels in these spatially registered video frames.

Alternatively, still images may be extracted from a second video source that presents a higher resolution than the video to which the still images are linked. Motion pictures, for example, are typically recorded on film, which has many times higher resolution than the NTSC video format typically used for videotape.

A still image may also have a wider dynamic range than a video frame to which it is cross-linked. Dynamic range refers to the range of discernible intensity levels of each color component of a pixel in an image. As the exposure settings of the camera can be changed from frame to accommodate changing illumination conditions (e.g., auto iris). A sequence of video frames may exhibit subtle variations in color that may be grouped into a still image with increased dynamic range relative to the individual video frames. Also, a still image can be generated from a video source (e.g., film) with a wide dynamic range and then cross-linked with a video with a narrow dynamic range.

A still image may also have a different aspect ratio than a video frame to which it is cross-linked. Aspect ratio refers to the ratio of the width to the height of an image. For example, a still image may be produced from a video source having a relatively wide aspect ratio, such as film, and then cross-linked with a different video source having a narrower aspect ratio, such as NTSC video. A typical aspect ratio for film is 2.2 x 1. In contrast, NTSC video has an aspect ratio of 4 × 3.

Video frames resulting from panning by the camera may be recorded and combined to produce a panorama. Video frames resulting from camera zoom may be recorded and combined to produce a large still image (i.e., a multi-resolution image) having different resolutions in different areas. The area of a multi-resolution image where camera zooming occurs will contain higher resolution than other areas of the multi-resolution image. Panoramic and multi-resolution still images are a type of image referred to herein as navigable images (navigatable images). In general, a navigable image is any image that can be panned or zoomed to provide different displays or contain three-dimensional images that can be used. Although panoramic and multi-resolution still images may be represented by a combined image, a panoramic image or multi-resolution still image may also be represented by discrete still images that are spatially registered.

Stereoscopic image pairs may be obtained from a sequence of video frames that exhibit horizontal camera tracking motion. Two video frames recorded from separate viewpoints (e.g., viewpoints separated by an interpupillary distance) may be selected from the video sequence as a stereo image pair. Stereoscopic images may be presented using a number of different stereoscopic viewing devices, such as stereoscopic 3D displays, stereoscopic glasses, and the like.

Additionally, the stereo images may be parsed using, for example, image correlation or feature matching techniques to identify corresponding pixels or image features in a given stereo image pair. These corresponding pixels or image features can then be used to establish the depth of these pixels and thus produce a 3D range image. The range image may be used to generate new views in a variety of applications including constructing 3D models and generating new views or scenes from and interpolating between images.

Fig. 14A illustrates the use of a cross-link generator 203 to establish cross-links between a video source 10 and an animation 14 generated from the video source 10 by an animation authoring system 12. The video source may be compressed by a video encoder 201 (e.g., a vector quantizer) before being received by the cross-link generator 203. According to one embodiment, the cross-link generator 203 generates a cross-link data structure that includes respective pointers to keyframes in the animation that correspond to frames in the video source.

Fig. 14B illustrates the use of the cross-link generator 203 to establish cross-links between a video source 10 and an animation 205 generated from a separate video source 204. The separate video source 204 may have been used to generate the video source 10, or the two video sources 10, 204 may not be correlated. If the two video sources 10, 204 are not correlated, operator assistance may be required to identify which images in the animation 205 are to be cross-linked with the frames of the video source 10. If two video sources 10, 204 are correlated (e.g., one is film and the other is NTSC formatted video), then temporal correlation or scene correlation can be used by the cross-link generator to automatically cross-link the images in the animation 205 and the frames in the video source 10.

FIG. 15 illustrates a cross-linking data structure 212 according to one embodiment. The data elements in the cross-linking data structure 212 are referred to as a Video Frame Element (VFE) and correspond to a corresponding frame of a video source. Thus, element VFE₁、VFE₂、VFE₃、VFE_iAnd VFE_i+1Frame VF corresponding to a video source₁、VF₂、VF₃、VF_iAnd VF_i+1(not shown). As shown, the cross-linking data structure 212 is implemented as a linked list, where each video frame element includes a pointer to the next video frame element and a pointer to a background frame 215, 216 in an animation. In an alternative embodiment, the cross-linking data structure 212 is implemented as an array of video frame elements rather than a linked list. In another alternative embodiment, the cross-linking data structure 212 is implemented as a tree data structure rather than a linked list. The tree data structure is useful for establishing associations between non-adjacent video segments and for searching to find specific video frames. In general, the cross-linked data structure 212 may be represented by any type of data construct without departing from the spirit and scope of the present invention.

In one embodiment, the background frames in an animation are represented by background frame data structures 215, each background frame data structure 215, 216 including a pair of pointers to a NEXT background frame data structure (NEXT PTR), a pair of pointers to a previous background frame data structure (PREV PTR), an IMAGE pointer (IMAGE PTR), a pair of pointers to interpolation information (INTERP PTR), a timestamp, and a pair of pointers to one or more elements in the cross-linked data structure 212 (VF PIR). These NEXT PTRs, PREV PTRs, IMAGE PTRs and INTERP PTR are described above with reference to fig. 8.

The VF PTR in a particular background frame data structure 215, 216 and the pointer to the background frame data structure in a corresponding element of the cross-link data structure 212 form a cross-link 217. That is, the background frame data structure and the video frame elements include respective mutual references. The reference may be a uniform resource locator, a memory address, an array index, or any other value used to correlate a background frame data structure and a video frame element.

Referring to the background frame data structure 215, although VF PTR is shown in FIG. 15 as referring only to a Video Frame Element (VFE) in the cross-linked data structure 212₁) The VF PTR may include separate pointers to their respective video frame elements. For example, the VF PTR may be a VFE including separate pairs of video frame elements₁、VFE₂、VFE₃The data structure of the pointer of (2). Alternatively, the VF PTR may be a data structure including a pointer to a video frame element (e.g., VFE1) and a value indicating the total number of video frame elements to which the background frame data structure 215 is linked. Other data constructs for cross-linking a background frame data structure and a sequence of video frame elements may be used in alternative embodiments.

In one embodiment, the image pointer (IMAGEPTR) in each background frame data structure 215, 216 includes an image type member entry indicating whether the background image from which image data for the background frame was obtained is, for example, a non-composed still image (i.e., a video frame from which dynamic objects, if any, have been removed), a high resolution still image, a panorama, or other composed image. The image pointer also includes a member entry indicating a location of the background image in memory and an offset within the background image where the image data for the background frame is located.

A TEXT descriptor (TEXT descriptor) may also be included as part of the background frame data structure 215, 216. In one embodiment, the text descriptor is a pointer to a text description (e.g., a string) that describes the portion of the animation that is spanned by the background frame. The textual description may be displayed as an overlay (e.g., a control bar) on the animation or elsewhere on the display. During crosslinking, appropriate default values may be assigned to the corresponding text descriptions depending on the type of motion being identified. Referring to fig. 16, for example, the default text descriptions for the three depicted animation segments 221, 223, 225 may be "camera still", "camera pan", "camera zoom", respectively. These default values may be edited by the user during cross-linking or later during video or animation playback. In an alternative embodiment, the TEXT descriptor (TEXT scr) in the background frame data structures 215, 216 is not a pointer but an index that can be used to select a TEXT description from a table of TEXT descriptions.

Using the above-described cross-linking configuration, when a video frame is being displayed, a corresponding video frame element of the cross-linking data structure 212 may be referenced to identify a cross-linked background frame data structure 215, 216 in the animation. The image pointers in the background frame data structures 215, 216 may then be referenced to determine whether the background frame is pulled from a combined image or a non-combined image. In the case of a composite image, the user may be notified (e.g., by visual or audio prompts) that a composite image is available during the video playback. The user may then choose to play the animation or view and navigate within the background images. For example, in the case of a panorama, the user may view the panorama using a panorama viewing tool (i.e., a general purpose computer may execute that a user-selected portion of a combined image has been rendered onto a display). Similarly, in the case of a high resolution still image, the user may wish to view the image as a still frame to recognize details that are not available or difficult to discern in the video source. In the case of a zoomable still image, the user may zoom in or out on the still frame. Other animation-implemented functions may also be performed, such as selecting a specified hot spot in the animation, isolating a dynamic object in the animation, directing object or background motion, and so forth.

FIG. 16 is a schematic representation of a cross-linking relationship between a sequence of video frames 230 in a video source and background images 231 from an animation that has been generated using the animation authoring techniques described above. As shown, the sequence of video frames includes four video segments 222, 224, 226, 228, each video segment being associated with a corresponding background image 221, 223, 225, 227 via cross-links 217. Video segment 222 depicts a stationary scene (i.e., stationary within some motion threshold) and is cross-linked to a corresponding stationary background image 221. Video segment 224 depicts a scene resulting from camera panning and is cross-linked to a corresponding panorama 223 generated by processing and stitching two or more frames from video segment 224. Video segment 226 depicts a scene resulting from camera zoom and is cross-linked to a high resolution, variable focus still image 225. Video segment 228 depicts a scene caused by motion around one or more 3D objects and is cross-linked to a 3D object image. As described above, high resolution still images and 3D object images are generated by processing and combining frames from a video segment (e.g., video segments 222, 224, 226, 228).

Fig. 17 depicts a display 241 generated by a playback system. According to one embodiment, the playback system may bring up a video or an animation to the display 241. As shown in FIG. 17, the playback system is presenting a video on the display 241. At the bottom of the display 241, a control bar is presented, including the rewind, play, pause, and stop buttons. According to one embodiment, as each video is rendered, the cross-link between the corresponding video frame element and a background frame in an animation is followed to determine whether the background frame is pulled from a high resolution still image, panoramic image, or zoom image. If, for example, the background frame is pulled out from a panoramic image, an icon indicated by PAN (PAN) in fig. 17 is displayed, highlighted or indicated as active. A tone may also be generated to indicate that a panoramic image is available. In response to an indication that a panoramic image is available, the user may tap or select the PAN icon (e.g., using a cursor control device such as a mouse or other handheld control device) to cause the display of the video to be paused and the panoramic image to be displayed. When the panoramic image is displayed, program code for navigating through the panoramic image may be automatically loaded into the working memory of the playback system, if not already resident, and executed to allow the user to pan, tilt and zoom the perspective of the animation. With PAN icons, when the STILL or ZOOM icons STILL (STILL), ZOOM (ZOOM) become active, the user can tap the appropriate STILL or ZOOM icon to view a high resolution STILL image or a zoomed image.

The video may also be linked to one or more three-dimensional objects or scenes associated with the video. When a link to a three-dimensional object is referenced during playback of the video in a manner similar to that described above, a particular view of the three-dimensional object is displayed. Program code is executed to allow a user to change the orientation and position of a virtual camera in a three-dimensional coordinate system to generate different perspective views of the object.

In one embodiment, the control bar further includes an icon ANIM/VIDEO that can be used to trigger between the display of a VIDEO and the display of an animation that is cross-linked to the VIDEO. When the user taps the ANIM/VIDEO button, the VIDEO frame element corresponding to the currently displayed VIDEO frame is examined to identify a cross-linked frame in the animation. The time stamp of the cross-linked frame in the animation is used to determine a relative start time within the background and the target trajectory of the animation and the playback system starts rendering the animation. If during playback of the animation the user taps the icon ANIM/VIDEO again, the current background track data structure is checked to identify a cross-linked frame in the VIDEO. Video playback is then resumed at the cross-linked frame.

FIG. 18 illustrates an alternative display 261 generated by the playback of an animation in a playback system. In one embodiment, a control bar 262 in the display 261 includes icons (i.e., icons REWIND, PLAY, PAUSE, and STOP) for playing back, playing, pausing, and stopping animation. The control bar also includes a resolution selector in the form of a slider bar to allow the playback system user to indicate relative preferences for temporal and spatial resolution in the playback of the animation. By selecting the slide 265 within the slider bar 264 with a cursor control device and moving the slide 265 left or right, the user can adjust the preference for spatial and temporal resolution. For example, when the slider 265 reaches the leftmost position within the slider 264, a preference for maximum spatial resolution is indicated, and when the slider 265 is moved to the rightmost position within the slider 264, a preference for maximum temporal resolution is indicated.

An icon ANIM/VIDEO is presented in control bar 262 to allow the user to toggle between a VIDEO display and an animated display that has been cross-linked. According to the embodiment shown in FIG. 18, when an animation has been selected for display, the cross-linked video is simultaneously displayed in a sub-window 268 according to a picture-in-picture format. When the user taps the icon ANIM/VIDEO, the VIDEO is presented in the main viewing area of the display 261 and the animation is presented in the sub-window 268. The pip function may be enabled or disabled from a menu (not shown) presented on the display 261.

The cross-linking between an animation and a video can be used to provide a variety of useful effects. For example, by cross-linking a store's guide map with frames comprising a video of a store along the street facade, a user viewing the video can be prompted to switch to the animated image to shop for goods and services depicted in the store. Transactions for goods and services may be effected electronically via a communications network. Cross-linking the guide map and a video is particularly effective when the guide map is a panoramic or other composite image of a location within a scene of the video. For example, if a video includes a navigable environment (e.g., an airplane, space shuttle, submarine, ocean vessel, building, etc.). Imagine, for example, a video scene where a character relating to an ocean vessel walks through a souvenir store. The viewer can stop the video and browse the souvenir shop in a natural and intuitive way.

Another useful application of cross-linking is to allow a user to shape a video. The user may link the animation sequence to the video such that the animation sequence is automatically referenced when a cross-linked frame of the video is reached. When the end of the animation sequence is reached, the display of the video may resume at another cross-linked video frame. The user may selectively add clips (out-take) to certain scenes in a video or replace certain portions of the video with animation sequences.

In the foregoing specification, the invention has been described with reference to specific exemplary embodiments thereof. It will, however, be evident that various modifications and changes may be made to these specific exemplary embodiments without departing from the broader spirit and scope of the invention as set forth in the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.

Claims

1. A computer-implemented method of generating an animation, the method comprising the steps of:

examining a sequence of video images to identify a first transformation of a scene depicted in the sequence of video images;

obtaining a first image and a second image from the sequence of video images, the first image representing the scene prior to the first transformation and the second image representing the scene after the first transformation; and

information is generated that indicates the first transformation and that can be used to interpolate between the first image and the second image to produce a video effect that approximates the display of the sequence of video images.

2. The method of claim 1, wherein the step of examining a sequence of video images to identify a first transformation of a scene comprises determining when a difference between a selected one of the video images and a subsequent one of the video images exceeds a threshold, the selected one of the video images and the subsequent one of the video images indicating a beginning image and an ending image, respectively, of a segment of the video images.

3. The method of claim 2, wherein the start picture of the segment of the video pictures indicates an end picture of a previous segment of the video pictures.

4. The method of claim 2, wherein the step of determining when the difference between the selected one of the video images and a subsequent one of the video images exceeds a threshold comprises:

selecting a video image subsequent to the start image from the video images of the sequence;

comparing the video image subsequent to the start image with an adjacent previous video image from the sequence of video images to generate an incremental difference value;

adding the incremental difference to a sum of the incremental differences; and

the acts of selecting, comparing, and adding are repeated until the sum of the incremental differences exceeds the threshold.

5. The method of claim 4, wherein the subsequent one of the video images is a video image used to generate an incremental difference value that when added to the sum of the incremental difference values causes the sum of the incremental difference values to exceed the threshold.

6. The method of claim 5, wherein the end picture of the set of video pictures is adjacent to the subsequent one of the video pictures.

7. The method of claim 2, wherein the difference between the selected one of the video images and a subsequent one of the video images comprises a difference caused by a change in configuration of a camera used to record the sequence of video images.

8. The method of claim 2, wherein the difference between the selected one of the video images and a subsequent one of the video images comprises a color difference.

9. The method of claim 2, wherein the difference between the selected one of the video images and a subsequent one of the video images comprises an elapsed time difference between the selected video image and the subsequent one of the video images.

10. The method of claim 2, wherein the step of obtaining the first picture and the second picture from the sequence of video pictures comprises selecting a starting picture and an ending picture of the set of video pictures as the first picture and the second picture, respectively.

11. The method of claim 2, wherein the step of obtaining a second image from the sequence of video images comprises identifying one or more dynamic objects in the ending image; and removing the one or more dynamic objects to generate the second image.

12. The method of claim 11, wherein the step of identifying one or more dynamic objects in the end image comprises identifying one or more features in the set of video images that are subject to a second transformation in the set of video images that is not indicated by the first transformation.

13. The method of claim 12, wherein the second transformation comprises a change in the configuration of the one or more dynamic objects not caused by a change in the configuration of a camera used to record the sequence of video images.

14. The method of claim 1, wherein the step of generating information indicative of the first transformation and usable for interpolation between the first image and the second image comprises:

generating a value indicative of a measure of variation between the first image and the second image;

a value is generated that indicates a time elapsed between the display of the first image and the display of the second image.

15. The method of claim 14, wherein the step of generating a value indicative of a measure of the change comprises generating a value indicative of a measure of the change caused by a change in configuration of a camera used to record the sequence of video images.

16. The method of claim 14, wherein the step of generating a value indicative of a measure of change comprises generating a value indicative of a measure of color change.

17. A computer-implemented method of generating an animation, the method comprising the steps of:

identifying a first transformation of a scene depicted in a sequence of video images, the first transformation indicating a change in configuration of a camera used to record the sequence of video images;

identifying a second transformation of a scene depicted in the sequence of video images, the second transformation indicating a change in configuration of an object in the scene;

removing respective regions including the object from first and second images of the sequence of video images to generate first and second background images; and

generating background information indicative of the first transformation and usable for interpolation between the first background image and the second background image to produce an interpolated background image, the interpolated background image being displayable to approximate the first transformation of the scene.

18. The method of claim 17, further comprising the steps of:

generating first and second object images containing respective areas removed from first and second images of the sequence of video images, the first object image representing the dynamic object prior to a second transformation and the second object image representing the dynamic object after the second transformation; and

generating object information indicating the second transformation and usable for interpolation between the first object image and the second object image to produce an interpolated object image, the interpolated object image being a change in configuration displayable to approximate the object in the scene.

19. The method of claim 18, further comprising the steps of:

storing first and second background images and background information in a background track in an animated object; and

first and second object images and object information are stored in an object track in an animated object.

20. The method of claim 19, further comprising the steps of: the animation object is sent over a computer network in response to a request from an animation playback device.

21. An animation authoring system comprising:

a background track generator for examining a sequence of video images and generating therefrom a background track comprising a sequence of background frames and transformation information which can be used to interpolate between the background frames to synthesize additional images;

an object track generator for examining a sequence of video images and generating therefrom an object track comprising a sequence of object frames and transformation information which can be used to interpolate between the object frames to synthesize additional object images.

22. The animation authoring system of claim 21 further comprising an animation object generator for storing the background track and the object track in an animation object for later recall.

23. An animation delivery system comprising the animation authoring system of claim 22 and further comprising a communication device for receiving a request from one or more client devices to download the animation object and responsively transmitting the animation object to the one or more client devices.

24. The animation authoring system of claim 22 wherein playback timing information is stored in the animation object to indicate relative playback times for the object track and the background track.

25. The animation authoring system of claim 21 wherein at least one of the background track generator and the target track generator is implemented by a programmed processor.

26. The animation authoring system of claim 21 wherein the background track generator comprises:

a scene change estimator for decomposing the sequence of video images into one or more video segments;

a background motion estimator for generating the transform information based on corresponding transforms in the one or more video segments; and

a background frame constructor for generating the sequence of background frames according to the corresponding transform in the one or more video segments.

27. The animation authoring system of claim 26 wherein the background track generator further comprises: a blending estimator for generating blending information for combining background frames in the sequence of background frames.

28. The animation authoring system of claim 27 wherein the blending information indicates a cross-factoring operation.

29. The animation authoring system of claim 26 wherein the background track constructor generates at least one background frame of the sequence of background frames by combining one or more images from the one or more video segments.

30. The animation authoring system of claim 29 wherein the background track constructor assembles the one or more images by stitching the one or more images into a panoramic image.

31. The animation authoring system of claim 29 wherein the background track constructor combines the one or more images into a high resolution image.

32. A computer-readable medium having stored thereon instructions that, when executed by a processor, cause the processor to:

33. The computer-readable medium of claim 32, wherein the computer-readable medium comprises one or more mass storage disks.

34. The computer readable medium of claim 33, wherein the computer readable medium is a computer data signal encoded in a carrier wave.

35. The computer readable medium of claim 33 wherein the instructions that cause the processor to examine a sequence of video images to identify a first transformation of a scene comprise instructions that, when executed, cause the processor to determine when a difference between a selected one of the video images and a subsequent one of the video images exceeds a threshold, the selected one of the video images and the subsequent one of the video images indicating a beginning image and an ending image, respectively, of a group of video images.

36. The computer readable medium of claim 35 wherein the instructions that cause the processor to determine when a difference between a selected one of the video images and a subsequent one of the video images exceeds a threshold comprise instructions that, when executed, cause the processor to:

adding the incremental difference to a sum of the incremental differences; and

37. A computer readable medium having stored thereon data for displaying a sequence of images from an animation, wherein the animation is generated by:

38. A method of linking a video and an animation, comprising:

generating a data structure containing elements corresponding to respective frames of the first video; and

information indicating an image in an animation that has been generated from the second video is stored in one or more of the elements of the data structure information.

39. The method of claim 38, wherein the step of generating a data structure comprises generating a data structure containing a corresponding pixel for each frame of the first video.

40. The method of claim 38 wherein the step of storing information indicative of an image in an animation includes storing a reference value indicative of a key frame of the animation.

41. The method of claim 40, wherein the step of storing a reference to a key frame of the animation comprises storing a reference value indicative of a background frame in an animation object.

42. The method of claim 41, wherein the step of storing a reference value indicating a background frame comprises storing an address of a background frame data structure including information indicating a background image and information indicating whether the background image is a combined image.

43. The method of claim 42, wherein the information indicating whether the background image is a combined image includes information indicating whether the background image is a panoramic image.

44. The method of claim 38, wherein the data structure is an array of the elements.

45. The method of claim 38, wherein the data structure is a linked list of the elements.

46. The method of claim 38, wherein the first video and the second video are the same video.

47. The method of claim 38, wherein the first video has been generated using the second video.

48. The method of claim 38 wherein the animation comprises a high resolution still image.

49. The method of claim 38 wherein the animation comprises a multi-resolution still image having first and second regions, the first region having a higher pixel resolution than the second region.

50. The method of claim 38 wherein the animation comprises a still image having a field of view wider than a frame of the first video.

51. The method of claim 38 wherein the animation comprises a still image having a wider dynamic range than a frame of the first video.

52. The method of claim 38, wherein the animation comprises a still image having an aspect ratio different from an aspect ratio of a frame of the first video.

53. The method of claim 38 wherein the animation includes having a pair of still images forming a stereoscopic image pair.

54. The method of claim 38 wherein the animation comprises an image, the image comprising a depth information.

55. The method of claim 38, wherein the animation includes an object having three-dimensional geometric characteristics.

56. The method of claim 38 wherein a textual description is associated with at least one image in the animation.

57. The method of claim 38 wherein the animation includes an animation object having a plurality of elements corresponding to a plurality of images in the animation, and wherein the method further comprises the steps of: information indicative of one or more frames in the first video is stored in one or more of the plurality of elements in the animation object.

58. The method of claim 38 wherein the animation includes an animation object having a plurality of elements corresponding to a plurality of images in the animation, and wherein the method further comprises the steps of: information indicative of a sequence of frames is stored in one or more of the plurality of elements in the animation object.

59. A method of displaying video on a playback system, the method comprising the steps of:

displaying a frame of the video on a display of the playback system;

examining a data element associated with the frame of the video to identify an animation key frame corresponding to the frame of the video, the animation key frame having been automatically generated using the frame of the animation; and

the user of the playback system is prompted to initiate the display of an image associated with the animation keyframe.

60. The method of claim 59, further comprising the step of:

determining whether the image associated with the animation keyframe is a combined image; and

if the image associated with the animation keyframe is a composite image, signaling to the user that a composite image is available for viewing.

61. The method of claim 60 wherein the step of determining whether the image associated with the animation keyframe is a combined image comprises determining whether the image associated with the animation keyframe is a panoramic image.

62. The method of claim 61, further comprising the steps of:

receiving a request of the user to view the panoramic image; and

in response to the request from the user, program code is executed to render a view of the panoramic image in response to a navigation input from the user.

63. The method of claim 62, wherein the navigational input from the user includes a command to pan a perspective of a scene depicted in the panoramic image in a horizontal direction.

64. The method of claim 62, wherein the navigational input from the user includes a command to tilt a perspective view of a scene depicted in the panoramic image.

65. The method of claim 59 wherein the step of determining whether the image associated with the animation keyframe is a combined image comprises determining whether the image associated with the animation keyframe is a high resolution still image.

66. The method of claim 65, further comprising the steps of:

receiving a request from the user to view the high resolution still image; and

in response to the request from the user, program code is executed to zoom the view of the high resolution still image in response to a zoom input from the user.

67. The method of claim 59 wherein the step of alerting the user of the playback system to initiate display of an image associated with the animation keyframe comprises displaying an indicator on a display of the playback system to signal to the user that the image associated with the animation keyframe is available for viewing.

68. The method of claim 59 wherein the step of alerting a user of the playback system to initiate display of an image associated with the animation keyframe comprises actuating an indicator on the playback system to signal to the user that the image associated with the animation keyframe is available for viewing.

69. The method according to claim 68, wherein the step of actuating an indicator on the playback system comprises actuating an indicator on a handheld controller of the playback system.

70. A method of displaying video on a playback system, the method comprising the steps of:

displaying a frame of the video on a display of the playback system;

examining a data element associated with the frame of the video to identify an animation key frame corresponding to the frame of the video, the animation key frame having been automatically generated using the frame of the video; and

displaying an image associated with the animation keyframe within a window on the display concurrently with the display of the frame of the video.

71. A playback system, comprising:

a processor

A display coupled to the processor;

a media reader connected to the processor; and

a memory coupled to the processor, the memory including program code that, when executed, causes the processor to:

signaling the media reader to provide video data from a machine-readable medium, the video data including a sequence of video frames and a data structure having elements associated with the video frames;

displaying the sequence of video frames on the display;

examining the data structure elements associated with the video frames to identify an animation key frame corresponding to one or more of the video frames, the animation key frame having been automatically generated using the one or more of the video frames; and

a user of the playback system is prompted to initiate display of an image associated with the animation keyframe.

72. A method comprising the steps of:

displaying a frame of a video on a display of a playback system;

receiving input from a user requesting a switch from displaying the video to displaying an image of a 3D object navigable image associated with the frame of the video; and

the navigable image is displayed.

73. The method of claim 72, further comprising the steps of: panning a perspective of the navigable image in response to input from the user.

74. The method of claim 72, further comprising the steps of: in response to input from the user, a sale of an item depicted in the navigable image is processed.

75. The method of claim 72, further comprising the steps of: in response to input from the user, a protocol is processed to execute a service indicated by one or more features in the navigable image.

76. The method of claim 72, further comprising the steps of: zooming a perspective view of the navigable image in response to input from the user.

77. The method according to claim 72, wherein the navigable image is a panoramic image of a market that includes items that can be purchased in an electronic transaction.

78. The method of claim 72, wherein the navigable image comprises one or more three-dimensional objects.

79. A method comprising the steps of:

displaying a frame of a video on a display of a playback system;

an input is received from a user requesting a switch from displaying the video to displaying a three-dimensional object associated with the frame of the video.

80. The method of claim 79, further comprising the steps of: in response to the user's input, the viewpoint from which the three-dimensional object is displayed is changed.

81. A computer readable medium having stored thereon data for displaying a sequence of images from an animation, wherein the animation has been linked to a video by:

generating a data structure comprising a plurality of elements corresponding to respective frames of the first video; and

information indicative of an image in an animation generated from the second video is stored in one or more of the elements of the data structure.

82. A method of storing an animation, comprising the steps of:

storing a set of keyframes generated from a video in an animation object;

storing, in the animation object, one or more values indicative of keyframes from a first sequence selection of the set of keyframes and information for interpolating between the keyframes of the first sequence; and

one or more values indicative of key frames from a second sequence selection of the set of key frames and information for interpolating between the key frames of the second sequence are stored in the animation object, the number of key frames in the second sequence being less than the number of key frames in the first sequence.

83. The method of claim 82, wherein the step of storing the set of key frames comprises the steps of: first and second sub-groups of key frames are stored in the animation object, the second sub-group of key frames including reduced resolution versions of images included in the first sub-group of key frames.

84. The method of claim 82, wherein each of the one or more values indicative of the selected key frame of the first sequence is a reference value identifying a corresponding key frame in the set of key frames.