上篇我們講解了如何進行資料預處理,讀取資料。接下來我們一起分析yolov3訓練過程與training procedure。想真正讀懂這個部分,要對inference部分有所了解。
資料加載
def train(
cfg,
data_cfg,
img_size=416,
resume=False,
epochs=273, # 500200 batches at bs 64, dataset length 117263
batch_size=16,
accumulate=1,
multi_scale=False,
freeze_backbone=False,
transfer=False # Transfer learning (train only YOLO layers)
):
init_seeds()
weights = 'weights' + os.sep #python是跨平台的。在Windows上,檔案的路徑分隔符是'\',在Linux上是'/'。
latest = weights + 'latest.pt'
best = weights + 'best.pt'
device = torch_utils.select_device()
# Configure run
train_path = parse_data_cfg(data_cfg)['train'] #data_cfg='data/coco.data' train=../coco/trainvalno5k.txt
# Initialize model
model = Darknet(cfg, img_size).to(device)
# Optimizer
optimizer = optim.SGD(model.parameters(), lr=hyp['lr0'], momentum=hyp['momentum'], weight_decay=hyp['weight_decay'])
cutoff = -1 # backbone reaches to cutoff layer
start_epoch = 0
best_loss = float('inf')
nf = int(model.module_defs[model.yolo_layers[0] - 1]['filters']) # yolo layer size (i.e. 255)
if resume: # Load previously saved model
if transfer: # Transfer learning
chkpt = torch.load(weights + 'yolov3.pt', map_location=device)
model.load_state_dict({k: v for k, v in chkpt['model'].items() if v.numel() > 1 and v.shape[0] != 255},
strict=False)
for p in model.parameters():
p.requires_grad = True if p.shape[0] == nf else False
else: # resume from latest.pt
chkpt = torch.load(latest, map_location=device) # load checkpoint
model.load_state_dict(chkpt['model'])
start_epoch = chkpt['epoch'] + 1
if chkpt['optimizer'] is not None:
optimizer.load_state_dict(chkpt['optimizer'])
best_loss = chkpt['best_loss']
del chkpt
else: # Initialize model with backbone (optional)
if '-tiny.cfg' in cfg:
cutoff = load_darknet_weights(model, weights + 'yolov3-tiny.conv.15')
else:
cutoff = load_darknet_weights(model, weights + 'darknet53.conv.74')
lf = lambda x: 1 - 10 ** (hyp['lrf'] * (1 - x / epochs)) # inv exp ramp to lr0 * 1e-2
scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lf, last_epoch=start_epoch - 1)
dataset = LoadImagesAndLabels(train_path, img_size=img_size, augment=True)
# dataset.__getitem__(0)
# Initialize distributed training
if torch.cuda.device_count() > 1:
dist.init_process_group(backend=opt.backend, init_method=opt.dist_url, world_size=opt.world_size, rank=opt.rank)
model = torch.nn.parallel.DistributedDataParallel(model)
sampler = torch.utils.data.distributed.DistributedSampler(dataset)
else:
sampler = None
# Dataloader
dataloader = DataLoader(dataset,
batch_size=batch_size,
num_workers=opt.num_workers,
shuffle=True,
pin_memory=True,
collate_fn=dataset.collate_fn,
sampler=sampler)
這一段代碼實作了上一篇講的資料加載過程,以及路徑配置和模式選擇,具體我不贅述,如果想了解的話可以在評論問我。
train以及loss
for epoch in range(start_epoch, epochs):
model.train()
print(('\n%8s%12s' + '%10s' * 7) % ('Epoch', 'Batch', 'xy', 'wh', 'conf', 'cls', 'total', 'nTargets', 'time'))
# Update scheduler
scheduler.step()
# Freeze backbone at epoch 0, unfreeze at epoch 1
if freeze_backbone and epoch < 2:
for name, p in model.named_parameters():
if int(name.split('.')[1]) < cutoff: # if layer < 75
p.requires_grad = False if epoch == 0 else True
mloss = torch.zeros(5).to(device) # mean losses
for i, (imgs, targets, _, _) in enumerate(dataloader):
imgs = imgs.to(device)
targets = targets.to(device)
nt = len(targets)
# if nt == 0: # if no targets continue
# continue
# Plot images with bounding boxes
if epoch == 0 and i == 0:
plot_images(imgs=imgs, targets=targets, fname='train_batch0.jpg')
# SGD burn-in
if epoch == 0 and i <= n_burnin:
lr = hyp['lr0'] * (i / n_burnin) ** 4
for x in optimizer.param_groups:
x['lr'] = lr
# Run model
pred = model(imgs)
# Compute loss
loss, loss_items = compute_loss(pred, targets, model)
train的過程:
1:設定epoch參數,它決定了所有資料所需要訓練的輪數。
2:進入epoch的for循環後,講model設定為train,然後for i, (imgs, targets, _, _) in enumerate(dataloader):擷取資料預處理後的資料和labels,這裡要注意資料和labels都resize成416*416了(與txt中的不同)。
3:将取出的資料imgs傳入model中,model就是yolov3的darknet,它有3個yolo層,每個yolo層都會輸出一個特征映射圖(dimention如(bs, 3, 13, 13, 85))bs=batch_size,3指每一個像素點存在3種anchor,13*13是它的次元,85=xywh+conf+classes。
class YOLOLayer(nn.Module):#x = module[0](x, img_size)
def __init__(self, anchors, nc, img_size, yolo_layer, cfg):
super(YOLOLayer, self).__init__()
self.anchors = torch.Tensor(anchors)
self.na = len(anchors) # number of anchors (3)
self.nc = nc # number of classes (80)
self.img_size = 0
if ONNX_EXPORT: # grids must be computed in __init__
stride = [32, 16, 8][yolo_layer] # stride of this layer
if cfg.endswith('yolov3-tiny.cfg'):
stride *= 2
ng = (int(img_size[0] / stride), int(img_size[1] / stride)) # number grid points
create_grids(self, max(img_size), ng)
def forward(self, p, img_size, var=None):
if ONNX_EXPORT:
bs = 1 # batch size
else:
bs, nx, ny = p.shape[0], p.shape[-2], p.shape[-1]
if self.img_size != img_size:
create_grids(self, img_size, (nx, ny), p.device)
# p.view(bs, 255, 13, 13) -- > (bs, 3, 13, 13, 85) # (bs, anchors, grid, grid, classes + xywh)
p = p.view(bs, self.na, self.nc + 5, self.nx, self.ny).permute(0, 1, 3, 4, 2).contiguous() # prediction
if self.training:
return p
4:loss, loss_items = compute_loss(pred, targets, model),三個yolo層的輸出最終與labels産生的targets運算得到loss。
5:用設定好的optimizer對loss進行BP。
6:儲存最終的參數。
loss的構成:
首先,我從yolov3的算法思想講起,讓大家對整體思路有所了解,再具體到代碼層面,這樣大家可以感受代碼複現算法的過程,進而真正了解yolov3。這對大家再相應背景下訓練自己的資料也好,改造yolov3也好,都有直接的幫助。
yolov3算法思想:首先設計darknet,darknet是resnet的變形,在imagenet資料集上進行訓練。然後去除最後的全連接配接,并在模型上增加了三個次元13,26,52的輸出,即yolo層(為了增加小模型的檢測精度),再用coco資料集進行微調。coco資料集的标簽包括了class與ground truth(xywh)。loss由四個部分組成lxy,lwh,lcls,lconf組成(代碼中會解釋每個成分)。
這裡就有個重要的疑問了,一個尺度的feature map有三個anchors,那麼對于某個ground truth框,究竟是哪個anchor負責比對它呢?和YOLOv1一樣,對于訓練圖檔中的ground truth,若其中心點落在某個cell内,那麼該cell内的3個anchor box負責預測它,具體是哪個anchor box預測它,需要在訓練中确定,即由那個與ground truth的IOU最大的anchor box預測它,而剩餘的2個anchor box不與該ground truth比對。YOLOv3需要假定每個cell至多含有一個grounth truth,而在實際上基本不會出現多于1個的情況。與ground truth比對的anchor box計算坐标誤差、置信度誤差(此時target為1)以及分類誤差,而其它的anchor box隻計算置信度誤差(此時target為0)。
def compute_loss(p, targets, model): # predictions, targets, model
ft = torch.cuda.FloatTensor if p[0].is_cuda else torch.Tensor
lxy, lwh, lcls, lconf = ft([0]), ft([0]), ft([0]), ft([0])
txy, twh, tcls, indices = build_targets(model, targets)#在13 26 52次元中找到大于iou門檻值最适合的anchor box 作為targets
#txy[次元(0:2),(x,y)] twh[次元(0:2),(w,h)] indices=[0,anchor索引,gi,gj]
# Define criteria
MSE = nn.MSELoss()
CE = nn.CrossEntropyLoss()
BCE = nn.BCEWithLogitsLoss()
# Compute losses
h = model.hyp # hyperparameters
bs = p[0].shape[0] # batch size
k = h['k'] * bs # loss gain
for i, pi0 in enumerate(p): # layer i predictions, i
b, a, gj, gi = indices[i] # image, anchor, gridx, gridy
tconf = torch.zeros_like(pi0[..., 0]) # conf
# Compute losses
if len(b): # number of targets
pi = pi0[b, a, gj, gi] # predictions closest to anchors 找到p中與targets對應的資料lxy
tconf[b, a, gj, gi] = 1 # conf
# pi[..., 2:4] = torch.sigmoid(pi[..., 2:4]) # wh power loss (uncomment)
lxy += (k * h['xy']) * MSE(torch.sigmoid(pi[..., 0:2]), txy[i]) # xy loss
lwh += (k * h['wh']) * MSE(pi[..., 2:4], twh[i]) # wh yolo loss
lcls += (k * h['cls']) * CE(pi[..., 5:], tcls[i]) # class_conf loss
# pos_weight = ft([gp[i] / min(gp) * 4.])
# BCE = nn.BCEWithLogitsLoss(pos_weight=pos_weight)
lconf += (k * h['conf']) * BCE(pi0[..., 4], tconf) # obj_conf loss
loss = lxy + lwh + lconf + lcls
return loss, torch.cat((lxy, lwh, lconf, lcls, loss)).detach()
下面我們具體看看代碼。
txy, twh, tcls, indices = build_targets(model, targets),build_targets函數傳回了我們需要參數,我們看看這個函數具體再操作什麼。
def build_targets(model, targets):
# targets = [image, class, x(歸一後的中心), y, w(歸一後的寬), h] [ 0.00000, 20.00000, 0.72913, 0.48770, 0.13595, 0.08381]
iou_thres = model.hyp['iou_t'] # hyperparameter
if type(model) in (nn.parallel.DataParallel, nn.parallel.DistributedDataParallel):
model = model.module
nt = len(targets)
txy, twh, tcls, indices = [], [], [], []
for i in model.yolo_layers:
layer = model.module_list[i][0]
#layer->YOLOLayer()
# iou of targets-anchors
t, a = targets, []
gwh = targets[:, 4:6] * layer.ng #layer.ng就是yolo層輸出次元13 26 52,gwh将原來的wh還原到13*13的圖上
if nt:
iou = [wh_iou(x, gwh) for x in layer.anchor_vec] #anchor_vec是anchor box的wh
iou, a = torch.stack(iou, 0).max(0) # best iou and anchor找到每一層與label->wh,iou最大的anchor,a是anchor的索引
# reject below threshold ious (OPTIONAL, increases P, lowers R)
reject = True
if reject:
j = iou > iou_thres
t, a, gwh = targets[j], a[j], gwh[j]
# Indices targets = [image, class, x(歸一後的中心), y, w(歸一後的寬), h]
b, c = t[:, :2].long().t() # target image, class
gxy = t[:, 2:4] * layer.ng
gi, gj = gxy.long().t() # grid_i, grid_j
indices.append((b, a, gj, gi))
# XY coordinates
txy.append(gxy - gxy.floor())#在yolov3裡是Gx,Gy減去grid cell左上角坐标Cx,Cy
# Width and height
twh.append(torch.log(gwh / layer.anchor_vec[a])) # wh yolo method
# twh.append((gwh / layer.anchor_vec[a]) ** (1 / 3) / 2) # wh power method
# Class
tcls.append(c)
if c.shape[0]:
assert c.max() <= layer.nc, 'Target classes exceed model classes'
return txy, twh, tcls, indices
build_targets(model, targets)需要兩個參數,model是模型,targets是從labels讀取resize後的标簽參數,具體的targets = [image, class, x(歸一後的中心), y, w(歸一後的寬), h],iou_thres是我們需要設定的超參數,作用我們後面會講。
for i in model.yolo_layers:我們知道有3個yolo層,i是層數的索引,targets[:, 4:6]是ground truth的寬和高,是以gwh将原來的wh還原到13*13的特征圖上。
iou = [wh_iou(x, gwh) for x in layer.anchor_vec],anchor_vec是anchor box的wh,我們求出每一層三個anchor box與ground truth的iou,然後找到iou最大的anchor box,用a來記錄它的索引,删去小于iou_thres的anchor。用b記錄imges,c記錄calss類别,gxy記錄對應feature map的中心點,gi,gj記錄哪個格子負責這個ground truth。
![](https://img.laitimes.com/img/9ZDMuAjOiMmIsIjOiQnIsICM38FdsYkRGZkRG9lcvx2bjxiNx8VZ6l2cs0TPRpFeRhEZwhnMMBjVtJWd0ckW65UbM5WOHJWa5kHT20ESjBjUIF2X0hXZ0xCMx81dvRWYoNHLrdEZwZ1Rh5WNXp1bwNjW1ZUba9VZwlHdssmch1mclRXY39CXldWYtlWPzNXZj9mcw1ycz9WL49zZuBnL2QDNxUzN0ETM4ETNwkTMwIzLc52YucWbp5GZzNmLn9Gbi1yZtl2Lc9CX6MHc0RHaiojIsJye.png)
txy=gxy - gxy.floor(),在yolov3裡是Gx,Gy減去grid cell左上角坐标Cx,Cy得到txy。
twh=torch.log(gwh / layer.anchor_vec[a])對應上面公式反推過來的。txy,twh是我們loss的标簽,它是真正坐标的offset,是以我們bp優化得到的ixy還需要做相應的變換才能得到真正的box。
tcls=c,c就是calss類别。這樣我們就了解build_targets傳回的标簽參數意義了。
for i, pi0 in enumerate(p): # layer i predictions, i
b, a, gj, gi = indices[i] # image, anchor, gridx, gridy
tconf = torch.zeros_like(pi0[..., 0]) # conf
# Compute losses
if len(b): # number of targets
pi = pi0[b, a, gj, gi] # predictions closest to anchors 找到p中與targets對應的資料lxy
tconf[b, a, gj, gi] = 1 # conf
# pi[..., 2:4] = torch.sigmoid(pi[..., 2:4]) # wh power loss (uncomment)
lxy += (k * h['xy']) * MSE(torch.sigmoid(pi[..., 0:2]), txy[i]) # xy loss
lwh += (k * h['wh']) * MSE(pi[..., 2:4], twh[i]) # wh yolo loss
lcls += (k * h['cls']) * CE(pi[..., 5:], tcls[i]) # class_conf loss
# pos_weight = ft([gp[i] / min(gp) * 4.])
# BCE = nn.BCEWithLogitsLoss(pos_weight=pos_weight)
lconf += (k * h['conf']) * BCE(pi0[..., 4], tconf) # obj_conf loss
loss = lxy + lwh + lconf + lcls
for i, pi0 in enumerate( p ):p是我們yolo層傳回的特征圖,i=0時,p=[[bs,3,13,13,85],[bs,3,26,26,85],[bs,3,52,52,85]].
b, a, gj, gi = indices[i],從傳回的标簽中讀取最佳anchor,與格子坐标,因為我們知道每個ground truth隻對應一個anchor box,這些索引就是為了找到那個anchor box。
pi = pi0[b, a, gj, gi],pi0=(b,3,13,13,85)找到pi0中對應ground truth的資料。
tconf[b, a, gj, gi] = 1 ,将對應位置的confience設定為1,其餘都為0。
lxy += (k * h[‘xy’]) * MSE(torch.sigmoid(pi[…, 0:2]), txy[i]),與ground truth比對的anchor box計算坐标誤差、置信度誤差(此時target為1)以及分類誤差,而其它的anchor box隻計算置信度誤差(此時target為0)。這裡pi隻記錄了與ground truth比對的anchor box的資訊。sigmoid是将xy限制在0-1,因為中心坐标落在格子内。MSE就是均方差。
lwh += (k * h[‘wh’]) * MSE(pi[…, 2:4], twh[i]) 如上。
lcls += (k * h[‘cls’]) * CE(pi[…, 5:], tcls[i]):隻計算與ground truth比對的anchor box的分類誤差。CE是交叉熵。
lconf += (k * h[‘conf’]) * BCE(pi0[…, 4], tconf)。tconf把對應ground truth的置信度設定為1,沒有與ground truth對應都設定為0.
這樣我們就求得了loss,有些地方可能描述不太好,當然也有了解不到位的,歡迎讨論。