出處:有關語義分割的奇技淫巧有哪些? - 知乎
作者:AlexL
作者的 Github 項目:liaopeiyuan/ml-arsenal-public.
項目裡會有作者所有參與過的Kaggle競賽的源代碼,目前有兩個Top 1% solution:TGS Salt和Quick Draw Doodle.
1. 如何優化 IoU
在分割中我們有時會去用 IoU(intersection over union)去衡量模型的表現,具體定義如下:
在有了這個定義以後, 我們可以規定比如說對于predicted instance和actual instance,IoU大于0.5算一個positive.
在這基礎之上可以做一些F1,F2之類其他的更宏觀的metric.
是以說怎麼去優化IoU呢?
拿二分類問題舉例,做baseline的時先扔上個binary-crossentropy看下效果,于是就有了以下的實作(PyTorch):
class BCELoss2d(nn.Module):
def __init__(self, weight=None, size_average=True):
super(BCELoss2d, self).__init__()
self.bce_loss = nn.BCELoss(weight, size_average)
def forward(self, logits, targets):
probs = F.sigmoid(logits)
probs_flat = probs.view (-1)
targets_flat = targets.view(-1)
return self.bce_loss(probs_flat, targets_flat)
但是問題在于,優化BCE不等價于優化IoU.
參考論文: The Lovasz-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks - 2018
論文實作:Github - LovaszSoftmax
直覺來說, 在一個minibatch裡, 每個pixel的權重其實是不一樣的. 兩張圖檔,一張正樣本有1000個pixels,另一張隻有4個,第二張一個pixel帶來的IoU損失就能頂得上第一張中250個pixel的損失.
那能不能直接優化IoU?
可以,但這肯定不是最優的:
def iou_coef(y_true, y_pred, smooth=1):
"""
IoU = (|X and Y|)/ (|X or Y|)
"""
intersection = K.sum(K.abs(y_true * y_pred), axis=-1)
union = K.sum((y_true,-1) + K.sum(y_pred,-1) - intersection
return (intersection + smooth) / ( union + smooth)
def iou_coef_loss(y_true, y_pred):
return -iou_coef(y_true, y_pred)
這次的問題在于訓練過程的不穩定. 一個模型從壞到好,我們希望監督它的loss/metric的過渡是平滑的,但直接暴力套用IoU顯然不行. . . .
于是就有了 LovaszSoftmax. 具體為什麼這個 loss 比 BCE/Jaccard 要好,不敢瞎說......但從個人使用體驗來看效果拔群.
還有一個很有意思的細節是:
原implementation中這一段:
loss = torch.dot(F.relu(errors_sorted), Variable(grad))
如果把relu換成elu+1的話,有時效果更好. 作者猜測可能是因為elu+1比relu更平滑一些?
1.1. 如果不在乎訓練時間的話
試試這個:
def symmetric_lovasz(outputs, targets):
return (lovasz_hinge(outputs, targets) +
lovasz_hinge(-outputs, 1 - targets)) / 2
1.2. 如果模型鬥不過 Hard Examples 的話
在你的loss後面加上這個:
def focal_loss(self, output, target, alpha, gamma, OHEM_percent):
output = output.contiguous().view(-1)
target = target.contiguous().view(-1)
max_val = (-output).clamp(min=0)
loss = output - output * target + max_val + ((-max_val).exp() + (-output - max_val).exp()).log()
# This formula gives us the log sigmoid of 1-p if y is 0 and of p if y is 1
invprobs = F.logsigmoid(-output * (target * 2 - 1))
focal_loss = alpha * (invprobs * gamma).exp() * loss
# Online Hard Example Mining: top x% losses (pixel-wise).
# Refer to http://www.robots.ox.ac.uk/~tvg/publications/2017/0026.pdf
OHEM, _ = focal_loss.topk(k=int(OHEM_percent * [*focal_loss.shape][0]))
return OHEM.mean()
2. 魔改 U-Net
原始 Unet (Keras):
def conv_block(neurons, block_input, bn=False, dropout=None):
conv1 = Conv2D(neurons, (3,3), padding='same',
kernel_initializer='glorot_normal')(block_input)
if bn:
conv1 = BatchNormalization()(conv1)
conv1 = Activation('relu')(conv1)
if dropout is not None:
conv1 = SpatialDropout2D(dropout)(conv1)
conv2 = Conv2D(neurons, (3,3), padding='same',
kernel_initializer='glorot_normal')(conv1)
if bn:
conv2 = BatchNormalization()(conv2)
conv2 = Activation('relu')(conv2)
if dropout is not None:
conv2 = SpatialDropout2D(dropout)(conv2)
pool = MaxPooling2D((2,2))(conv2)
return pool, conv2
# returns the block output and the shortcut to use in the uppooling blocks
def middle_block(neurons, block_input, bn=False, dropout=None):
conv1 = Conv2D(neurons, (3,3), padding='same',
kernel_initializer='glorot_normal')(block_input)
if bn:
conv1 = BatchNormalization()(conv1)
conv1 = Activation('relu')(conv1)
if dropout is not None:
conv1 = SpatialDropout2D(dropout)(conv1)
conv2 = Conv2D(neurons, (3,3), padding='same',
kernel_initializer='glorot_normal')(conv1)
if bn:
conv2 = BatchNormalization()(conv2)
conv2 = Activation('relu')(conv2)
if dropout is not None:
conv2 = SpatialDropout2D(dropout)(conv2)
return conv2
def deconv_block(neurons, block_input, shortcut, bn=False, dropout=None):
deconv = Conv2DTranspose(neurons,
(3, 3),
strides=(2, 2),
padding="same")(block_input)
uconv = concatenate([deconv, shortcut])
uconv = Conv2D(neurons, (3, 3), padding="same",
kernel_initializer='glorot_normal')(uconv)
if bn:
uconv = BatchNormalization()(uconv)
uconv = Activation('relu')(uconv)
if dropout is not None:
uconv = SpatialDropout2D(dropout)(uconv)
uconv = Conv2D(neurons, (3, 3), padding="same",
kernel_initializer='glorot_normal')(uconv)
if bn:
uconv = BatchNormalization()(uconv)
uconv = Activation('relu')(uconv)
if dropout is not None:
uconv = SpatialDropout2D(dropout)(uconv)
return uconv
def build_model(start_neurons, bn=False, dropout=None):
input_layer = Input((128, 128, 1))
# 128 -> 64
conv1, shortcut1 = conv_block(start_neurons, input_layer, bn, dropout)
# 64 -> 32
conv2, shortcut2 = conv_block(start_neurons * 2, conv1, bn, dropout)
# 32 -> 16
conv3, shortcut3 = conv_block(start_neurons * 4, conv2, bn, dropout)
# 16 -> 8
conv4, shortcut4 = conv_block(start_neurons * 8, conv3, bn, dropout)
#Middle
convm = middle_block(start_neurons * 16, conv4, bn, dropout)
# 8 -> 16
deconv4 = deconv_block(start_neurons * 8, convm, shortcut4, bn, dropout)
# 16 -> 32
deconv3 = deconv_block(start_neurons * 4, deconv4, shortcut3, bn, dropout)
# 32 -> 64
deconv2 = deconv_block(start_neurons * 2, deconv3, shortcut2, bn, dropout)
# 64 -> 128
deconv1 = deconv_block(start_neurons, deconv2, shortcut1, bn, dropout)
#uconv1 = Dropout(0.5)(uconv1)
output_layer = Conv2D(1, (1,1), padding="same", activation="sigmoid")(deconv1)
model = Model(input_layer, output_layer)
return model
但一般與其是用 transposed convolution,我們會選擇用 upsampling+3*3 conv,具體原因請見這篇文章:Deconvolution and Checkerboard Artifacts (強烈安利distill,blog品質奇高)
再往下說,在實際做project的時候往往沒有那麼多的訓練資源,是以我們得想辦法把那些 classification 預訓練模型嵌入到 Unet中.
把 encoder 替換預訓練的模型的訣竅在于,如何很好的提取出 pretrained models 在不同尺度上提取出來的資訊,并且如何把它們高效的接在decoder上.
常見的用于嫁接的模型有 Inception和 Mobilenet,但在這裡就分析一下更直覺一些的 ResNet/ResNeXt 這一類的模型:
def forward(self, x):
x = self.conv1(x)
x = self.bn1(x)
x = self.relu(x)
x = self.maxpool(x)
x = self.layer1(x)
x = self.layer2(x)
x = self.layer3(x)
x = self.layer4(x)
x = self.avgpool(x)
x = x.view(x.size(0), -1)
x = self.fc(x)
return x
我們可以很明顯的看出不同尺度的 feature map 分别是由不同的 layer 來提取的,我們就可以從中選出幾個來做concat,upsample,conv. 唯一一點要注意的是千萬不要錯位 concat,否則最後出來的 output 可能會和輸入圖大小不同.
下面分享一個可行的搭法,其中為了提升各 feature map 的 resolution, 作者移去了原 resnet conv1中的pool:
def __init__(self):
super().__init__()
self.resnet = models.resnet34(pretrained=True)
self.conv1 = nn.Sequential(
self.resnet.conv1,
self.resnet.bn1,
self.resnet.relu,
)
self.encoder2 = self.resnet.layer1 # 64
self.encoder3 = self.resnet.layer2 #128
self.encoder4 = self.resnet.layer3 #256
self.encoder5 = self.resnet.layer4 #512
self.center = nn.Sequential(
ConvBn2d(512,512,kernel_size=3,padding=1),
nn.ReLU(inplace=True),
ConvBn2d(512,256,kernel_size=3,padding=1),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=2,stride=2),
)
self.decoder5 = Decoder(256+512,512,64)
self.decoder4 = Decoder(64 +256,256,64)
self.decoder3 = Decoder(64 +128,128,64)
self.decoder2 = Decoder(64 +64 ,64 ,64)
self.decoder1 = Decoder(64 ,32 ,64)
self.logit = nn.Sequential(
nn.Conv2d(384, 64, kernel_size=3, padding=1),
nn.ELU(inplace=True),
nn.Conv2d(64, 1, kernel_size=1, padding=0),
)
def forward(self, x):
mean=[0.485, 0.456, 0.406]
std=[0.229,0.224,0.225]
x=torch.cat([
(x-mean[2])/std[2],
(x-mean[1])/std[1],
(x-mean[0])/std[0],
],1)
e1 = self.conv1(x)
e2 = self.encoder2(e1)
e3 = self.encoder3(e2)
e4 = self.encoder4(e3)
e5 = self.encoder5(e4)
f = self.center(e5)
d5 = self.decoder5(f, e5)
d4 = self.decoder4(d5,e4)
d3 = self.decoder3(d4,e3)
d2 = self.decoder2(d3,e2)
d1 = self.decoder1(d2)
關于decoder的設計方法,還有兩個可以參考的小技巧:
一是 Concurrent Spatial and Channel Squeeze & Excitation in Fully Convolutional Networks - 2018,可以了解為是一種attention,用很少的參數來校準feature map,詳情請見論文,但實作細節可參考以下的PyTorch代碼:
class sSE(nn.Module):
def __init__(self, out_channels):
super(sSE, self).__init__()
self.conv = ConvBn2d(in_channels=out_channels,
out_channels=1,
kernel_size=1,
padding=0)
def forward(self,x):
x=self.conv(x)
#print('spatial',x.size())
x=F.sigmoid(x)
return x
class cSE(nn.Module):
def __init__(self, out_channels):
super(cSE, self).__init__()
self.conv1 = ConvBn2d(in_channels=out_channels,
out_channels=int(out_channels/2),
kernel_size=1,
padding=0)
self.conv2 = ConvBn2d(in_channels=int(out_channels/2),
out_channels=out_channels,
kernel_size=1,
padding=0)
def forward(self,x):
x=nn.AvgPool2d(x.size()[2:])(x)
#print('channel',x.size())
x=self.conv1(x)
x=F.relu(x)
x=self.conv2(x)
x=F.sigmoid(x)
return x
class Decoder(nn.Module):
def __init__(self, in_channels, channels, out_channels):
super(Decoder, self).__init__()
self.conv1 = ConvBn2d(in_channels, channels,
kernel_size=3, padding=1)
self.conv2 = ConvBn2d(channels, out_channels,
kernel_size=3, padding=1)
self.spatial_gate = sSE(out_channels)
self.channel_gate = cSE(out_channels)
def forward(self, x, e=None):
x = F.upsample(x, scale_factor=2, mode='bilinear', align_corners=True)
#print('x',x.size())
#print('e',e.size())
if e is not None:
x = torch.cat([x,e],1)
x = F.relu(self.conv1(x),inplace=True)
x = F.relu(self.conv2(x),inplace=True)
#print('x_new',x.size())
g1 = self.spatial_gate(x)
#print('g1',g1.size())
g2 = self.channel_gate(x)
#print('g2',g2.size())
x = g1*x + g2*x
return x
還有一個就是為了進一步鼓勵模型在多尺度上的魯棒性,我們可以引入Hypercolumn去直接把各個scale的feature map concatenate起來:
f = torch.cat((
F.upsample(e1,scale_factor= 2,
mode='bilinear',align_corners=False),
d1,
F.upsample(d2,scale_factor= 2,
mode='bilinear',align_corners=False),
F.upsample(d3,scale_factor= 4,
mode='bilinear',align_corners=False),
F.upsample(d4,scale_factor= 8,
mode='bilinear',align_corners=False),
F.upsample(d5,scale_factor=16,
mode='bilinear',align_corners=False),
),1)
f = F.dropout2d(f,p=0.50)
logit = self.logit(f)
更神奇的方法就是直接把每個scale的feature map和downsized gt進行比較計算loss,最後各個尺度的loss進行權重平均. 詳情請見這裡的讨論:Deep semi-supervised learning | Kaggle 這裡就不再贅述了.
3. Training
其實訓練我覺得真的是 case by case,在task A上用的 heuristics 放到task B效果就反而沒那麼好,是以就介紹一個大多場合下都能用的trick:Cosine Annealing w. Snapshot Ensemble
聽上去聽酷炫的,實際上就是 每隔一段時間warm restart學習率,這樣在機關時間内能得到多個而不是一個 converged local minina,做融合的話手上的模型會多很多.
放幾張圖上來感受一下:
實作的話,其實挺簡單的:
CYCLE=8000
LR_INIT=0.1
LR_MIN=0.001
scheduler = lambda x: ((LR_INIT-LR_MIN)/2)*(np.cos(PI*(np.mod(x-1,CYCLE)/(CYCLE)))+1)+LR_MIN
然後每個batch/epoch去用scheduler(iteration)去更新學習率就可以了
4. 其他的一些小tricks(持續更新)
目前能想到的就是DSB2018 第一名的solution. 與其是用mask rcnn去做instance segmentation,他們選擇了U-Net生成class probability map+watershed小心翼翼分離離得比較近的instances. 最後也是取得了領先第二名一截的成績. 不得不說有時候比起研究模型,研究資料并精煉出關鍵的insight往往能帶來更多的收益......
最後修改:2019 年 01 月 16 日 11 : 44 AM