Video Analytics is enabling a rapidly growing number of embedded video products such as smart cameras and intelligent digital video recorders (DVRs) with automated capabilities that just a few years ago would have required human monitoring. Broadly, video analytics is the extraction of meaningful and relevant information from digital video. As opposed to video compression, which attempts to exploit the redundancy in digital video for the purpose of reducing size, analytics is concerned with understanding the content of video. Video Analytics builds upon research in computer vision, pattern analysis and machine intelligence, and spans several industry segments including surveillance, retail and transportation. It is also called video content analysis (VCA) or intelligent video.
Similar to human vision, which has a perceptual and cognitive aspect, video analytics uses computer vision algorithms which enable it to perceive or see, and machine intelligence to interpret, learn and draw inferences. The goal of video analytics is scene understanding, which differs from motion detection. In addition to detecting motion, analytics qualifies the motion as an object, understands the context around the object, and is able to track the object through the scene.
For instance, video analytics is used for automated surveillance. Smart cameras with analytics continuously analyze video and can detect the presence of people and vehicles and interpret their activities. Suspicious activities such as loitering or moving into an unauthorized area are automatically flagged and forwarded to security personnel. In transportation, cameras can capture and recognize license plate numbers for enforcement and toll collection purposes. In retail, video analytics can count the number of people waiting in line or passing through an aisle. These applications are currently in commercial use, with more sophisticated analysis techniques and broader applications expected in the coming years.
This article provides an overview of video analytics, the underlying techniques and processing steps, and considerations for embedded implementations of video analytics.
The video analytics processing pipeline
Most video analytics applications comprise a series of processing steps. These processing steps provide increasingly detailed information about the activities in the scene. Fundamentally, analytics need to detect changes that are occurring over successive frames of video, qualify these changes in each frame, correlate qualified changes over multiple frames, and finally, interpret these correlated changes.
Segmentation is the process of detecting changes and extracting relevant changes for further analysis and qualification. Pixels that have changed are referred to as "Foreground Pixels"; those that do not change are called "Background Pixels". Therefore, segmentation is also called "Background Subtraction". Pixels remaining after the background has been subtracted are the foreground pixels. The degree of "change" which is used to identify foreground pixels is a key factor in segmentation and can vary depending on the application. The result of segmentation is one or more foreground blobs, a blob being a collection of connected pixels.
Classification is the process of qualifying each blob and assigning a class label to it. This results in a broad categorization of each blob into sufficiently distinct classes such as person, vehicle, animal, etc. Classification may be done on a single frame or may use information over multiple frames. Some combination of properties of features of each blob is used to assign the class label. These features need to be selected in a manner such that they provide sufficient discrimination between each valid class.
For some applications, classification may not be sufficient. Recognition is the process of identifying a specific instance, i.e. a license plate or the face of a specific individual. Recognition requires further analysis and prior knowledge of the object being recognized.
Tracking of classified foreground blobs takes place over multiple frames as objects move through the field of view. Tracking is a problem of blob association; for each blob in a starting frame, the position of that blob in successive frames needs to be identified. A trajectory can then be calculated for the object by connecting its position over multiple frames.
Activity Recognition is the final step that combines the results of classification and tracking, correlating the tracks of multiple blobs to infer the activity occurring in the video. For instance, if two blobs corresponding to people progressively come closer, this could be interpreted as converging people. If two blobs, one corresponding to a vehicle and another corresponding to a person happen to merge, this could be interpreted as a person getting into a vehicle.
View full size
Figure 1: Video Analytics Processing Pipeline
A typical video analytics processing pipeline is shown in Figure 1. Processing steps are shown as rectangular blocks, which include segmentation, classification, tracking and activity recognition. These processing blocks depend on models that could include a background model, a camera model, one or more appearance models, motion models and shape models. These models are generally updated over time as learning and adaptation take place.
As frames progress through these processing steps, intermediate output results are produced. These are shown in the bubbles in the top row of Figure 1. Analytics applications in specific domains may not employ all these steps, or may not apply them strictly in the order specified. Multiple trackers or classifiers may run in parallel. These steps are described in detail next.