By Mike Knee, Snell
Running a multi-channel TV broadcast installation brings new headaches when 3D is involved. Live monitoring of dozens of TV channels is difficult enough. Over the years several manufacturers have developed automated monitoring solutions covering a whole range of tasks of increasing complexity. With the advent of 3D, there is literally a new dimension of monitoring tasks, because we have to check not only the integrity of individual video signals but also the correct relationship between the left and right video signals in a stereo pair. In addition, manual monitoring of 3D is more difficult than 2D because the operator would need either to wear glasses or accept the limitations of autostereoscopic displays. For these reasons, there is a burgeoning interest and market in automatic monitoring of 3D television.
Overview of 3D Monitoring
Analysis and correction
One of the purposes of automatic analysis is to provide information to enable correction of any problems encountered. The techniques for correction are beyond the scope of this paper, though it is important to point out that correction of an upstream problem may be necessary before monitoring of further aspects can be carried out.
The correct use of metadata, for example to identify left and right channels or to signal how they are packed into a single container, can in theory remove the need for some analysis. However, metadata for 3D is not yet fully standardized, and even when it is there will still be cases of incorrect usage, so there will always be a place for techniques that avoid the requirement for metadata. Of course, the results of measurements performed at any point in the processing chain may in their turn be passed on downstream as metadata.
The first task when faced with a single video signal carrying a stereoscopic pair is to identify the format by which the two channels are packed into one container. For some formats this is an easy task, but there are some problems when the granularity of the packing is finer.
Matching left and right images
Having unpacked the signal into left and right channels, the next task is to check whether the two channels are correctly matched, particularly as regards timing, grey scale and colour balance. Grey scale and colour balance can be aligned using histogramming techniques. Relative timing can be measured using fingerprinting techniques similar to those used for lip-sync measurement. It is important to note that a timing mismatch will not only be detrimental to the 3D viewing experience but will also have an adverse effect on downstream analysis, particularly of 3D depth. Relative timing is thus a good example of the need to correct a problem before further analysis can reliably be performed.
Depth or disparity analysis
A more algorithmically challenging analysis task is to measure the 3D depth across the picture, which is directly related, via screen size and resolution, to disparity or relative displacement between the left and right representations of objects in the scene. Horizontal disparity that is outside a certain range, as well as undue vertical disparity, are known to cause significant problems of eye strain for some viewers. Disparity analysis is also important for checking the overall relative geometric alignment of the two images.
Higher level analysis
Finally, we shall look at two examples of detection tasks which require a higher level of analysis. The first is deceptively simple to state: can we tell whether the left and right channels have been inadvertently swapped? The second is: can we tell whether the 3D pair has come from a simple 2D to 3D converter? Ultimately, 3D analysis can extend to detecting or measuring any process that has been carried out on 3D signals, either with a view to improving, modifying or reversing the process, or simply in order to report or record what has been done.
There are many ways in which left and right signals may be packed into a single video channel. These include left-right or top-bottom juxtaposition (with or without reflection of one of the channels), line interleaved, column interleaved, checkerboard and frame interleaved formats. For the purposes of automatic detection, these formats may be classified into two groups. Left/right and top/bottom formats are “loose packed” because the two pictures are physically quite separate. The remaining formats are “close packed” because corresponding left and right channel pixels are close together in space or time.
Loose packed formats are quite easy to detect. One way is to carry out a trial unpacking with an assumed format and then detect whether the two resulting images are sufficiently similar to be a stereoscopic pair. And if the two images turn out to be identical, we may conclude that a 2D image is being transported in a 3D container; this is a simple case of disparity estimation in which we look for zero disparity across the image. Figure 1 shows the left-right differences for a small area of a picture when each of four possible trial formats is used to unpack each of four possible actual formats. Where the correct format has been used, the left-right difference contains only edge information arising from disparity.
We can summarise the detection of loose-packed formats by saying that we exploit the relative similarity of the left and right images when compared with unrelated, distant parts of the picture.
Close packed formats present more of a problem because the packed image looks increasingly like a single 2D image as the amount of 3D content in the scene decreases. So simply carrying out trial unpackings will often give a positive result, even if the wrong format is being tried. If there is significant 3D content, the detection becomes easier because a picture wrongly unpacked will look increasingly less like a pair of plausible images. The left half of Figure 2 shows a small part of the left image for some different combinations of packing and unpacking formats, and the right half shows the combined energy of horizontal and vertical high pass filtered versions of those outputs. The energy is clearly significantly lower when the correct unpacking format has been used.
We can summarise the detection of these close-packed formats by saying that we exploit the relative difference of the left and right images when compared with adjacent pixels or lines.
Temporal interleaving presents further difficulties because there is a higher chance that motion can be confused with left-right disparity. This could be overcome using motion compensated high-pass filtering, though care would have to be taken to use information from a single channel (albeit subsampled) for the purposes of motion estimation.
Depth of Disparity Analysis
One of the most important monitoring or analysis tasks in stereoscopic 3D is to measure the perceived depth of the various objects in the scene. Perceived depth is a function of disparity (the horizontal distance between left and right representations of the object, measured in pixels), display size and resolution, and viewing distance. In the context of signal monitoring, we can only measure disparity and then relate it to perceived depth for different display configurations.
Disparity measurement is useful for many monitoring purposes, the most important being to provide a warning if the viewer is likely to suffer eye strain. Other reasons for measuring disparity are to verify that the sequence really is 3D rather than just being 2D in a 3D container, to detect and correct for global geometric distortions between the two channels, and to assist in the insertion of captions or subtitles at suitable depths.
Eye strain can occur in 3D viewing when disparity exceeds certain limits – particularly if the eyes are being encouraged to diverge, an unnatural action. The limits depend on display size but it is also useful to measure how often and for how long extreme disparity values are observed, and possibly to identify where in the scene the extremes are occurring.
One class of disparity measurement methods involves performing a local correlation between the left and right images to generate a sparse disparity map. This approach is ideal for looking at the behaviour of different objects in the scene and for determining to what extent limits have been exceeded. Other methods seek to generate a dense disparity map, in which every pixel has an associated disparity value, or possibly an occlusion indicator if there is no corresponding point in the other picture. This approach would be necessary if the measurement were being used to drive post-processing, for example to change the effective camera spacing. Finally, for some applications an approximate, region-based approach to disparity measurement might be sufficient, for example to gather statistics about typical depth ranges used across a programme, or to drive a global spatial transform to correct for camera misalignment.
The impression of depth is conveyed by introducing horizontal disparity. If there is any vertical disparity present, it should be detected and corrected, both because it can be very disturbing to the eyes, and because it can interfere with correct measurement of horizontal disparity. Of course, horizontal and vertical disparity can be measured jointly using conventional motion estimation methods. However, it would be preferable to exploit the constraints arising from stereoscopy. For example, we would expect vertical disparity to be a combination of two components: one directly related to horizontal disparity, such as might arise from a vertical displacement between the cameras, and one which fits a simple global model, such as might arise from different zoom factors or axis directions between the cameras.
Disparity monitoring display
Figure 3 shows an example of a monitoring display that provides information about the distribution of disparity in various ways, including a left-right difference, a disparity histogram, an indication of vertical disparity and a colour coded warning of the possibility of eye strain from near and far objects for different display sizes. Such a tool makes good use of automatic analysis coupled with an operator’s skill in interpreting the results.
Dense disparity maps
Because of the difficulty and the usefulness of measuring dense disparity maps, there is some interest in standardising a format for dense disparity map metadata. For example, SMPTE has recently begun such an activity.
Higher Level Analysis
Left-right swap detection
Many people viewing 3D demonstrations have encountered the situation where the left and right images have been inadvertently swapped over. The result is very disturbing, but it is not always obvious even to a human observer what is wrong. It would be useful to be able to detect the swap automatically, but this turns out to be quite a difficult problem. Measurement of a disparity map is a good starting point, but a correctly arranged 3D pair will typically exhibit both negative disparity values for objects intended to be seen in front of the screen and positive values for objects behind the screen. So a simple analysis of the histogram of disparity values, for example, will not be enough.
One approach that works with reasonable reliability is based on the spatial distribution of disparity values. We observe that for most scenes objects at the centre and bottom of the screen are generally nearer than objects at the top and sides. Figure 4 shows the spatial disparity distribution measured over a set of varied clips comprising 6000 frames.
A possible left-right detection algorithm is to correlate measured disparity with the above template. A positive correlation indicates that the assumed left-right configuration is correct, while a negative correlation indicates that it is reversed.
Figure 5 shows the results of such an algorithm on 38,000 frames of (correctly ordered) 3D material. The blue line shows a 10-frame rolling average and the red line a 1000-frame rolling average of correlation coefficients between measured disparity and the template.
Whenever the graph is positive, the algorithm is giving a correct result. The last third of the material is professionally produced, well-behaved 3D material whereas the first two-thirds consists of test sequences of varying quality. Clearly, there is always some material that will defeat the algorithm, but on “normal” material it is quite reliable.
A potentially more reliable method of left-right detection is based on the observation that closer objects are expected to occlude objects that are further away. A dense disparity estimator will usually have some kind of confidence output which indicates whether a pixel or region in one view has no equivalent in the other view and is therefore an occluded background region.
As shown in Figure 6, we would expect occluded regions to extend to the left of transitions in the left-eye view and to the right in the right-eye view. The bottom part of the diagram shows where the transitions between foreground (green) and background (blue) are observed to be in relation to occlusions (red) in the two views. This observation allows us to determine automatically, on a statistical basis, which view is the left-eye view and which is the right-eye view. This approach is potentially more reliable than the method based on spatial disparity distribution, but it does depend on accurate dense disparity estimation including reliable location of occlusions.
Reliable analysis of the local relationship between depth and occlusions may be employed for other high-level monitoring tasks, for example to provide a warning that captions might have been inserted at an inappropriate location or depth relative to the other objects in the scene.
2D to 3D conversion detection
Our final example concerns the automatic detection of automatic 2D to 3D conversion.
One common technique in simple 2D to 3D conversion is the use of a fixed spatial disparity profile; for example the bottom and centre of the picture are made to appear closer than the top and sides, much as shown by Figure 4 above. Another technique is to introduce delay between two versions of the same moving sequence to give an impression of depth. This can work because a 3D camera rig tracking across a static scene will in fact generate two streams separated by a delay which corresponds to the time taken for the camera to move by the eye spacing distance.
The algorithm illustrated in Figure 7 detects the use of either or both of these techniques, to give a warning that a 2D to 3D converter might have been used.
Fingerprints are calculated separately on the left and right input picture signals. These could be as simple as the average luminance value over each frame, an average over each of a few regions, or any measure which when applied to correctly co-timed left and right signals would be expected to be similar to each other.
A correlation process is then applied to the two fingerprint signals to produce an estimated temporal offset between the input channels. This estimated offset is applied to a temporal low pass filter, which may for example be designed to detect piecewise constant inputs. The filtered temporal offset value is used to control a temporal alignment process on the left and right images; this would be done by applying a delay to one or other of the two inputs.
A disparity map between the temporally aligned left and right images is then calculated, producing a number of disparity values across the picture. A temporal high pass filter is applied to the disparity values, thereby looking for variation in time of the disparity observed in each part of the picture. The mean square value, or other average energy value, of the high pass filter output is calculated. In parallel, a spatial regression process is applied to the disparity map to see if the map fits a fixed spatial model. A low mean square output from the temporal high pass filter, or a close correlation to a fixed spatial model, both provide evidence for a final decision that simple 2D to 3D conversion might have been performed.
With automatic detection such as this, one can envisage a game of “cat and mouse” whereby detection algorithms have to become ever more sophisticated in order to keep up with the increasing complexity of automatic 2D to 3D converters.
This paper describes several techniques for the automatic monitoring of stereoscopic 3D video signals: format detection, disparity monitoring, left-right swap detection and automatic detection of the use of 2D to 3D conversion. There is a great deal of scope to “get machines to watch 3D for us” so that humans can concentrate on delivering and watching 3D content. Snell Ltd. is active in developing and implementing the algorithms described here for monitoring and correction of 3D video across its product range.
Mike Knee is Consultant Engineer for Research and Development at Snell. Snell is a leading innovator in digital media technology and provides broadcasters and global media companies with a comprehensive range of solutions to create, better manage, and streamline the distribution of content for today’s multi-screen world. Snell provides the tools necessary to transition seamlessly and cost-effectively to HDTV, stereoscopic 3D, and 3Gbps operations.
This is an edited version of the paper “Getting Machines To Watch 3D For You” presented at IBC 2011, the leading global tradeshow for professionals engaged in the creation, management and delivery of broadcasting media and entertainment.