本文主要結合官方的示例代碼解釋MTCNN的Inference細節
###建立并加載MTCNN模型
模型結構如下:

建立模型:
with tf.Graph().as_default():
sess = tf.Session()
with sess.as_default():
pnet, rnet, onet = align.detect_face.create_mtcnn(sess, None)
其中,create_mtcnn加載了預訓練的模型資料并傳回pnet,rnet,onet的運作入口
def create_mtcnn(sess, model_path):
if not model_path:
model_path,_ = os.path.split(os.path.realpath(__file__))
with tf.variable_scope('pnet'):
data = tf.placeholder(tf.float32, (None,None,None,3), 'input')
pnet = PNet({'data':data})
pnet.load(os.path.join(model_path, 'det1.npy'), sess)
with tf.variable_scope('rnet'):
data = tf.placeholder(tf.float32, (None,24,24,3), 'input')
rnet = RNet({'data':data})
rnet.load(os.path.join(model_path, 'det2.npy'), sess)
with tf.variable_scope('onet'):
data = tf.placeholder(tf.float32, (None,48,48,3), 'input')
onet = ONet({'data':data})
onet.load(os.path.join(model_path, 'det3.npy'), sess)
pnet_fun = lambda img : sess.run(('pnet/conv4-2/BiasAdd:0', 'pnet/prob1:0'), feed_dict={'pnet/input:0':img})
rnet_fun = lambda img : sess.run(('rnet/conv5-2/conv5-2:0', 'rnet/prob1:0'), feed_dict={'rnet/input:0':img})
onet_fun = lambda img : sess.run(('onet/conv6-2/conv6-2:0', 'onet/conv6-3/conv6-3:0', 'onet/prob1:0'), feed_dict={'onet/input:0':img})
return pnet_fun, rnet_fun, onet_fun
MTCNN檢測
模型檢測:
檢測過程分為如下圖的四個部分:建立金字塔,pnet過程,rnet過程,onet過程
建立金字塔
factor_count=0
total_boxes=np.empty((0,9))
points=np.empty(0)
h=img.shape[0]
w=img.shape[1]
minl=np.amin([h, w])
m=12.0/minsize
minl=minl*m
# create scale pyramid
scales=[]
while minl>=12:
scales += [m*np.power(factor, factor_count)]
minl = minl*factor
factor_count += 1
下面解釋一下建立金字塔的過程:
我們認為圖像中人臉的最小尺寸為minsize,而圖像的尺寸以圖像的最短邊picsize來表示
假設圖像中的人臉大小為minl,則有:minsize<minl<picsize
由于訓練MTCNN時采用的圖檔大小為12*12是以希望minl與這個大小相仿
當minl=minsize時,是縮放尺度最大的時候,應該縮放的尺度是12/minsize
當minl=picsize時,是縮放尺度最小的時候,應該縮放的尺度是12/minl
在[12/picsize, 12/minsize]的這個區間内,可以增加更多的尺度來适配不同facesize的情況。尺度的細粒度可以根據factor來調節,factor越接近1,金字塔的層數越多,考慮的facesize的情況越細緻。一般情況下,factor取0.709,大概意思是每次将圖像面積縮小一半
P-NET
P-NET操作最為複雜,類似于一個RPN網絡
-
經過金字塔操作之後的圖檔輸入pnet
pnet是全卷積結構,不限制輸入圖像的大小,輸出一個人臉機率構成的heatmap,和對應的bbox regression
- 根據heatmap的機率值篩選出目标框,并還原到原圖大小,得到bbox:
# imap即heatmap,t為預設的threshold
y, x = np.where(imap >= t)
if y.shape[0]==1:
dx1 = np.flipud(dx1)
dy1 = np.flipud(dy1)
dx2 = np.flipud(dx2)
dy2 = np.flipud(dy2)
score = imap[(y,x)]
reg = np.transpose(np.vstack([ dx1[(y,x)], dy1[(y,x)], dx2[(y,x)], dy2[(y,x)] ]))
if reg.size==0:
reg = np.empty((0,3))
bb = np.transpose(np.vstack([y,x]))
q1 = np.fix((stride*bb+1)/scale)
q2 = np.fix((stride*bb+cellsize-1+1)/scale)
boundingbox = np.hstack([q1, q2, np.expand_dims(score,1), reg])
- 得到的bbox經過nms作初步篩選後,和金字塔其他層所生成的bbox彙總。彙總後,再作一次nms
- 對bbox這些作bbox regression進行位置微調
# bbox regression regw = total_boxes[:,2]-total_boxes[:,0] regh = total_boxes[:,3]-total_boxes[:,1] qq1 = total_boxes[:,0]+total_boxes[:,5]*regw qq2 = total_boxes[:,1]+total_boxes[:,6]*regh qq3 = total_boxes[:,2]+total_boxes[:,7]*regw qq4 = total_boxes[:,3]+total_boxes[:,8]*regh total_boxes = np.transpose(np.vstack([qq1, qq2, qq3, qq4, total_boxes[:,4]])) total_boxes = rerec(total_boxes.copy()) total_boxes[:,0:4] = np.fix(total_boxes[:,0:4]).astype(np.int32) dy, edy, dx, edx, y, ey, x, ex, tmpw, tmph = pad(total_boxes.copy(), w, h)
- rerec:将regression後的坐标轉化為矩形
- pad:後面兩個網絡的輸入需要将bbox對應的圖像(稱作src)從原圖中摳出來(稱作target),pad可以擷取了bbox對應的src坐标(x,ex,y,ey)和target坐标 (dx,edx,dy,edy)
R-Net
R-Net的輸入是P-net得到的roi,類似傳統two-stage網絡的head
numbox = total_boxes.shape[0]
if numbox>0:
# second stage
tempimg = np.zeros((24,24,3,numbox))
for k in range(0,numbox):
tmp = np.zeros((int(tmph[k]),int(tmpw[k]),3))
tmp[dy[k]-1:edy[k],dx[k]-1:edx[k],:] = img[y[k]-1:ey[k],x[k]-1:ex[k],:]
if tmp.shape[0]>0 and tmp.shape[1]>0 or tmp.shape[0]==0 and tmp.shape[1]==0:
tempimg[:,:,:,k] = imresample(tmp, (24, 24))
else:
return np.empty()
tempimg = (tempimg-127.5)*0.0078125
tempimg1 = np.transpose(tempimg, (3,1,0,2))
out = rnet(tempimg1)
out0 = np.transpose(out[0])
out1 = np.transpose(out[1])
score = out1[1,:]
ipass = np.where(score>threshold[1])
total_boxes = np.hstack([total_boxes[ipass[0],0:4].copy(), np.expand_dims(score[ipass].copy(),1)])
mv = out0[:,ipass[0]]
if total_boxes.shape[0]>0:
pick = nms(total_boxes, 0.7, 'Union')
total_boxes = total_boxes[pick,:]
total_boxes = bbreg(total_boxes.copy(), np.transpose(mv[:,pick]))
total_boxes = rerec(total_boxes.copy())
- 将P-net得到的bbox從原圖中摳出來,并作了歸一化處理,輸入到R-net中
- 根據score作初步篩選得到的結果,再通過nms進一步篩選
- 作bbox regreesion,修正位置,并通過rerec轉化為矩形
O-Net
O-Net與R-net操作基本一緻,最終輸出結果是bbox和特征點
由于MTCNN的輸出要進一步交給後續的CNN網絡進行識别,O-Net輸出之後又有一些後處理,可以看到這裡給bbox加了邊框,裁剪出的圖像經過了白化處理
det = np.squeeze(bounding_boxes[0,0:4])
bb = np.zeros(4, dtype=np.int32)
bb[0] = np.maximum(det[0]-margin/2, 0)
bb[1] = np.maximum(det[1]-margin/2, 0)
bb[2] = np.minimum(det[2]+margin/2, img_size[1])
bb[3] = np.minimum(det[3]+margin/2, img_size[0])
cropped = img[bb[1]:bb[3],bb[0]:bb[2],:]
aligned = misc.imresize(cropped, (image_size, image_size), interp='bilinear')
prewhitened = facenet_util.prewhiten(aligned)
官方的代碼中并沒有進行人臉對齊操作