TIP: Please read the paper once and use the following notes to understand the algorithm better
The main idea of the paper is to model the appearance of the target object, frame to frame by constantly updating the correlation filter trained on example images and using the trained filter to update the location of the target in current frame. Correlation filter is trained in Fourier domain which makes it computationally efficient.
Few important aspects of this algorithm is that, it performs well under changes in rotation, scale, lighting and partial occlusion of the target object despite its simplicity.
We go through the code available in the
link
and understand the implementation. Glancing through the code, I found
that good place to start is by understanding MOSSE
class
class MOSSE:
def __init__(self, frame, rect):
x1, y1, x2, y2 = rect
w, h = map(cv2.getOptimalDFTSize, [x2-x1, y2-y1])
x1, y1 = (x1+x2-w)//2, (y1+y2-h)//2
self.pos = x, y = x1+0.5*(w-1), y1+0.5*(h-1)
self.size = w, h
img = cv2.getRectSubPix(frame, (w, h), (x, y))
Constructor __init__
of MOSSE
class takes frame
,rect
as inputs, where frame
is the input frame, rect
is the initial bounding box input marked by
the user that encloses the object of interest to track in rest of the frames.
The bounding box coordinatesx1, y1, x2, y2
are with respect to the frame.
# map function: map(f, iterable) <==> [f(x) for x in iterable]
w, h = map(cv2.getOptimalDFTSize, [x2-x1, y2-y1])
x1, y1 = (x1+x2-w)//2, (y1+y2-h)//2
cv2.getOptimalDFTSize(vecsize)
doc
returns the minimum number N
that is greater than equal to vecsize
,
so that the DFT of a vector of size N
can be processed efficiently.
Here w, h
are the updated width and height of the bounding box. Once
we find the new w, h
we update the coordinates x1, y1
. Rewrite
(x1+x2-w)/2
as (x2-w+x1)/2
to better understand the math this update
self.pos = x, y = x1+0.5*(w-1), y1+0.5*(h-1)
self.size = w, h
img = cv2.getRectSubPix(frame, (w, h), (x, y))
We find the center of bounding box x, y
using updated x1, y1
, but these co-ordinates are with
respect to the frame
. We extract the part of frame
which is centered
at x, y
with width w
and height h
using
cv2.getRectSubPix
doc
The next step in the algorithm is to generate the groundtruth for each
input image img
, and also preprocess the input.
self.win = cv2.createHanningWindow((w, h), cv2.CV_32F)
g = np.zeros((h, w), np.float32)
g[h//2, w//2] = 1
g = cv2.GaussianBlur(g, (-1, -1), 2.0)
g /= g.max()
self.G = cv2.dft(g, flags=cv2.DFT_COMPLEX_OUTPUT)
self.win = cv2.createHanningWindow((w, h), cv2.CV_32F)
Here cv2.createHanningWindow
doc
generates the hanning windows coefficients of size w, h
. The input
image signal is modulated with this function in order to reduce the edge
effects (high frequency).
Hanning window |
---|
NOTE::To understand the role of Hanning function, you need to understand the concept of DFT leakage. In short, when we deal with digital signals, if the input signal does not contain frequencies which are integer multiple of analysis frequencies, DFT leakage happens, This can be avoided by modulating the input signal with different window functions such as Hamming, Hanning and rectangular functions before taking Fourier transform
g = np.zeros((h, w), np.float32)
g[h//2, w//2] = 1
g = cv2.GaussianBlur(g, (-1, -1), 2.0)
g /= g.max()
Basic idea of the algorithm is to train correlation filter H*
, when
convolved with current input frame should give the new location of the
target. This filter is trained in Fourier domain because its
computationally efficient.
To train, we need input-groundtruth pair. For each img
, groundtruth
g
is Gaussian shaped peak centered on the target. Above code generates
normalized Gaussian with variance 2.0
as ground truth g
for each
input image img
Ground truth (Gaussian peak) |
---|
self.G = cv2.dft(g, flags=cv2.DFT_COMPLEX_OUTPUT)
Since we train the filter in Fourier domain, we take the Fourier transform of Gaussian generated in the previous step.
Once we generate the ground truth g
, now we generate more training
examples lets call num_images=128
in order to get good initial
estimate of H
, which is done by random warping the
input image img
using cv2.warpAffine
doc function.
for i in xrange(128):
a = self.preprocess(rnd_warp(img))
A = cv2.dft(a, flags=cv2.DFT_COMPLEX_OUTPUT)
self.H1 += cv2.mulSpectrums(self.G, A, 0, conjB=True)
self.H2 += cv2.mulSpectrums( A, A, 0, conjB=True)
self.update_kernel()
self.update(frame)
cv2.warpAffine
function needs 2 x 3
transformation matrix T
, where top T[:2, :2]
is 2 x 2
rotation matrixdoc and T[:, 2]
is 2 x 1
is translation matrix.
a = self.preprocess(rnd_warp(img))
def rnd_warp(a):
h, w = a.shape[:2]
T = np.zeros((2, 3))
coef = 0.2
ang = (np.random.rand()-0.5)*coef
c, s = np.cos(ang), np.sin(ang)
T[:2, :2] = [[c,-s], [s, c]]
T[:2, :2] += (np.random.rand(2, 2) - 0.5)*coef
c = (w/2, h/2)
T[:,2] = c - np.dot(T[:2, :2], c)
return cv2.warpAffine(a, T, (w, h), borderMode = cv2.BORDER_REFLECT)
The output of warping is shown below
input image | warped image |
---|---|
Next step is to preprocess the warped image.
def preprocess(self, img):
img = np.log(np.float32(img)+1.0)
img = (img-img.mean()) / (img.std()+eps)
return img*self.win
During preprocessing, input is transformed using a log
function which helps in low contrast
lighting situation, then we normalize the input by subtracting the mean
and dividing by the standard deviation. The input is also modulated
using the hanning window as we discussed before to reduce the edge
effect. The output of the above code is shown below
input image | log transformed with reduced edge effect# |
---|---|
# image values are normalized between 0-255
for display purpose
Next step includes implementing the equation
, where G
is Fourier transform of ground
truth g
.
self.H1 += cv2.mulSpectrums(self.G, A, 0, conjB=True)
self.H2 += cv2.mulSpectrums( A, A, 0, conjB=True)
F
is the Fourier transform of input a
(in the code F=A
)
*
indicates the complex conjugate and i
indicates the number of
image, which ranges from 1 - 128
in our case
self.H1, self.H2
implements the numerator and denominator part respectively, using the
cv2.mulSpectrums
doc.
conjB=True
option conjugates the second input before mulitplication.
self.update_kernel()
def update_kernel(self):
self.H = divSpec(self.H1, self.H2)
self.H[...,1] *= -1
def divSpec(A, B):
Ar, Ai = A[...,0], A[...,1]
Br, Bi = B[...,0], B[...,1]
C = (Ar+1j*Ai)/(Br+1j*Bi)
C = np.dstack([np.real(C), np.imag(C)]).copy()
return C
As next step, we get the initial estimation of H*
, by calling
self.update_kernel
. This function calls divSpec
which performs
element wise division grouping real and imaginary part of the data
self.update(frame)
def update(self, frame, rate = 0.125):
(x, y), (w, h) = self.pos, self.size
self.last_img = img = cv2.getRectSubPix(frame, (w, h), (x, y))
img = self.preprocess(img)
self.last_resp, (dx, dy), self.psr = self.correlate(img)
self.good = self.psr > 8.0
if not self.good:
return
self.pos = x+dx, y+dy
self.last_img = img = cv2.getRectSubPix(frame, (w, h), self.pos)
img = self.preprocess(img)
A = cv2.dft(img, flags=cv2.DFT_COMPLEX_OUTPUT)
H1 = cv2.mulSpectrums(self.G, A, 0, conjB=True)
H2 = cv2.mulSpectrums( A, A, 0, conjB=True)
self.H1 = self.H1 * (1.0-rate) + H1 * rate
self.H2 = self.H2 * (1.0-rate) + H2 * rate
self.update_kernel()
def correlate(self, img):
C = cv2.mulSpectrums(cv2.dft(img, flags=cv2.DFT_COMPLEX_OUTPUT), self.H, 0, conjB=True)
resp = cv2.idft(C, flags=cv2.DFT_SCALE | cv2.DFT_REAL_OUTPUT)
h, w = resp.shape
_, mval, _, (mx, my) = cv2.minMaxLoc(resp)
side_resp = resp.copy()
cv2.rectangle(side_resp, (mx-5, my-5), (mx+5, my+5), 0, -1)
smean, sstd = side_resp.mean(), side_resp.std()
psr = (mval-smean) / (sstd+eps)
return resp, (mx-w//2, my-h//2), psr
So far, we got the initial estimate of H*
. With this H*
, we update
the location of target in current frame. This is done by calling the
function correlate
.
C = cv2.mulSpectrums(cv2.dft(img, flags=cv2.DFT_COMPLEX_OUTPUT), self.H, 0, conjB=True)
correlates the input image with filter.
This is done in Fourier domain again. resp = cv2.idft(C, flags=cv2.DFT_SCALE | cv2.DFT_REAL_OUTPUT)
takes
the idft
of the FFT
output to get the response in spatial domain.
Then, we find the maximum response and its location using cv2.minMaxLoc
doc in the line _, mval, _, (mx, my) = cv2.minMaxLoc(resp)
We then calculate Peak-Sidelobe-Ratio (PSR), which is given by PSR =
(mval - mean_side_lobe) / std_side_lobe
, where mval
is maximum response.
The side lobe considered is 11x11
pixel around (mx, my)
psr = (mval-smean) / (sstd+eps)
. eps=1e-5
is used for numerical stability.
If psr <= 8.0
, then the object is considered to be occluded or
tracking has failed.
self.H1 = self.H1 * (1.0-rate) + H1 * rate
self.H2 = self.H2 * (1.0-rate) + H2 * rate
self.update_kernel()
Once we got the location self.pos
of target for the current frame, we
update our estimate H*
by calculating self.H1
and self.H2
and
updating the filter with learning rate of 0.125
. This continues for
each frame where we update the filter and the location of target hand in
hand.
Thats how tracking using correlation filter is done!. Hope you could understand the algorithm better. Thank you for reading.
Summary of the paper: