In median filtering, the previous N frames of video are buffered, and the background is calculated as the median of buffered frames. Then (as with frame difference), the background is subtracted from the current frame and thresholded to determine the foreground pixels.
Median filtering has been shown to be very robust and to have performance comparable to higher complexity methods. However, storing and processing many frames of video (as is often required to track slower moving objects) requires an often prohibitively large amount of memory. This can be alleviated somewhat by storing and processing frames at a rate lower than the frame rate— thereby lowering storage and computation requirements at the expense of a slower adapting background.
A more efficient compromise was devised back in 1995 by UK researchers N.J.B. McFarlane and C.P. Schofield. While doing government funded research on piglet tracking in large commercial farms, they came up with an efficient recursive approximation of the median filter. Their 'approximate median' method, presented in their seminal paper, 'Segmentation and tracking of piglets in images', has since seen wide implementation in the background subtraction literature, and been applied to a wide range of background subtraction scenarios.
The approximate median method works as such: if a pixel in the current frame has a value larger than the corresponding background pixel, the background pixel is incremented by 1. Likewise, if the current pixel is less than the background pixel, the background is decremented by one. In this way, the background eventually converges to an estimate where half the input pixels are greater than the background, and half are less than the background—approximately the median (convergence time will vary based on frame rate and amount movement in the scene.) Figure 4 shows the approximate median method at work on the test video.
Figure 4. Approximate median output
As you can see, the approximate median method does a much better job at separating the entire object from the background. This is because the more slowly adapting background incorporates a longer history of the visual scene, achieving about the same result as if we had buffered and processed N frames.
We do see some trails behind the larger objects (the cars). This is due to updating the background at a relatively high rate (30 fps). In a real application, the frame rate would likely be lower (say, 15 fps). If you'd like to tinker with the update rate and eliminate the trails, the m-file can be downloaded here.
To get a feel for how the background model works, sometimes it's useful to visualize it. Below is a video of the background model. Rather ghostlike if you ask me.
Figure 5. Approximate median background
This method is a very good compromise. It offers performance near what you can achieve with higher-complexity methods (according to my research and the academic literature), and it costs not much more in computation and storage than frame differencing.