VoxCeleb 說話人識别挑戰

VoxSRC 消息：

2020 VoxCeleb Speaker Recognition Challenge (VoxSRC) 将聯合 Interspeech 國際會議于 2020 年 10 月 30 日在上海舉辦。

文章目錄

VoxCeleb 說話人識别挑戰
- 摘要
- VoxSRC
- 度量學習與編碼器
- 高次元資料可視化 TSNE
- 深度學習平台 NSML
- 參考文獻：

摘要

“Speaker recognition in the wild” 是一項非常具有挑戰性的任務，需要面對語音中各種不确定性，例如複雜的噪聲、不同程度的背景音、短促的笑聲等情況。針對這一問題，可以在 VoxSRC 提供的語料及其各種模型的實驗結果，尋找合适的語音段編碼器，設計合理的度量學習模型，分析造成性能降低的資料因素，都将成為提升識别性能的潛在解決方案。本文就 VoxSRC 提供的實驗結果和相關的論文進行歸納、總結與展望。

VoxSRC

2020 VoxCeleb Speaker Recognition Challenge (VoxSRC) 旨在研究現有的說話人識别方法對來自 “in the wild” 語音資料的識别效果。這次挑戰提供了來自 YouTube 名人通路視訊的語音語料。相對傳統的電話、麥克風語音，這類資料集包含更多的幹擾與不确定性。

此次挑戰分為 3 項任務，分别是：

特定訓練資料的說話人确認監督任務(Fixed-Full)：VoxCeleb2 dev 資料集作為訓練資料；
訓練資料不受限的說話人确認監督任務(Open-Full)：訓練資料可以使用 VoxSRC 測試資料以外的任意資料集；
特定訓練資料說話人确認自監督任務(Fixed-Self)：VoxCeleb2 dev 資料集作為訓練資料，但無法使用說話人的标簽，但可以使用除此以外的其它标簽，例如跨模态的視覺幀，但無法使用任意模态的預訓練模型。

競賽舉辦方為任務 1 與 2 提供了說話人确認監督學習的基準，為任務 3 提供了說話人确認自監督學習的基準。

根據 3 個任務場景，不難看出主辦方對于競賽的想法，針對固定的評估資料：

針對任務 1，訓練集是固定的，該任務旨在設計最佳的學習算法；
針對任務 2，訓練集是開放的，該任務除了設計合理的學習算法，還需要選擇能夠提高評估資料性能的訓練資料，是以，該任務旨在跨領域的知識遷移；
針對任務 3，訓練集是固定的，無說話人标簽，存在跨域标簽，該任務旨在跨任務的知識遷移。

根據對 3 個任務的分析，可以發現三個任務是依次遞進、逐漸複雜的。為了解決這些問題，學習方法的設計、遷移學習方法、跨領域/任務的方法會有利于改善這些問題。

度量學習與編碼器

論文 Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System 讨論了幾種(段層次)編碼器和幾種損失函數對說話人識别性能的影響，其中編碼器包含 temporal average pooling (TAP)、self-attentive pooling (SAP) 和 learnable dictionary encoding (LDE)，損失函數包含 Softmax、Center 和 augular softmax (ASoftmax)，并将這些編碼器和損失函數整合到端到端模型中，評估算法在 VoxCeleb1 資料集上的效果。以 Cosine 作為評分函數，性能(低于 4.90% EER)的排名分别是：

LDE-ASoftmax (4.56) > TAP-Center (4.75) > SAP-ASoftmax (4.90)。

論文 In defence of metric learning for speaker recognition 讨論了多種損失函數(包含分類損失和度量學習)對 CNN 學習算法的影響，并在 VoxCeleb 資料集上分别評估 VGG-M-40 模型和 Thin ResNet-34 模型的性能，該評估方式與 VoxSRC 任務 1 (Fixed-Full) 一緻，其中損失函數包含：

分類目标：Softmax、AM-Softmax (CosFace) 和 AAM-Softmax (ArcFace)；
度量學習目标：Triplet、Prototypical、Generalised end-to-end (GE2E) 和 Angular Prototypical。

以 10 × 10 10 \times 10 10×10 對的 ∥ ⋅ ∥ \Vert\cdot\Vert ∥⋅∥ 的平均值作為評分函數，不同損失函數的性能(僅考慮 Thin ResNet-34，因為這裡 VGG-M-40 性能較差)排名分别是：

分類目标：AAM-Softmax (2.36) > AM-Softmax (2.40) > Softmax (5.82)

度量學習目标：Angular Prototypical (2.21) > Prototypical (2.34) > GE2E (2.52) > Triplet (2.53)

分類目标中，相比較 AM-Softmax，AAM-Softmax 對算法參數更加敏感，從在 2.36 ～ 10.55 的波動；對比分類損失，度量學習能夠更實作更優的性能。

從資料集上看，VoxCeleb2 作為訓練資料，對于 VoxCeleb1 的提升效果非常明顯，即從 4.56% EER 改善到 2.21% EER，50% 的提升量，可以猜想：資料集的補充，有利于學習算法的改進。

高次元資料可視化 TSNE

說話人的特征表示，在解釋性上，仍然存在很大的障礙，很多時候，很難了解學習到的說話人特征是怎麼樣的。2008 年釋出的 TSNE 可視化方法，提供了一種高維資料轉化為低維流形的方法，為說話人表示提供了一種可行的可視化方案。

TSNE 提供了一種高維特征距離投影為低維特征距離的方法，采用了基于機率的模型來刻畫資料點上的距離，其學習過程類似一種資料合成的疊代方法，可以大膽地想象：如果直接将這類方法引入說話人模組化，能夠改善說話人特征的解釋性。

考慮到這類方法的實用性，筆者尋找了

sklearn

關于 TSNE 的實作，它提供了一個手寫數字的案例：

from time import time
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import offsetbox
from sklearn import manifold, datasets, discriminant_analysis

# Prepare digits dataset
digits = datasets.load_digits(n_class=6)
X = digits.data
y = digits.target
n_samples, n_features = X.shape
n_neighbors = 30

# Scale and visualize the embedding vectors
def plot_embedding(X, title=None, sub_num=111):
    x_min, x_max = np.min(X, 0), np.max(X, 0)
    X = (X - x_min) / (x_max - x_min)

    # plt.figure()
    ax = plt.subplot(sub_num)
    for i in range(X.shape[0]):
        plt.text(X[i, 0], X[i, 1], str(y[i]),
                 color=plt.cm.Set1(y[i] / 10.),
                 fontdict={'weight': 'bold', 'size': 9})

    if hasattr(offsetbox, 'AnnotationBbox'):
        # only print thumbnails with matplotlib > 1.0
        shown_images = np.array([[1., 1.]])  # just something big
        for i in range(X.shape[0]):
            dist = np.sum((X[i] - shown_images) ** 2, 1)
            if np.min(dist) < 4e-3:
                # don't show points that are too close
                continue
            shown_images = np.r_[shown_images, [X[i]]]
            imagebox = offsetbox.AnnotationBbox(
                offsetbox.OffsetImage(digits.images[i], cmap=plt.cm.gray_r),
                X[i])
            ax.add_artist(imagebox)
    plt.xticks([]), plt.yticks([])
    if title is not None:
        plt.title(title)

# Plot images of the digits
print("Showing selected digits")
n_img_per_row = 20
img = np.zeros((10 * n_img_per_row, 10 * n_img_per_row))
for i in range(n_img_per_row):
    ix = 10 * i + 1
    for j in range(n_img_per_row):
        iy = 10 * j + 1
        img[ix:ix + 8, iy:iy + 8] = X[i * n_img_per_row + j].reshape((8, 8))
plt.figure(figsize=(12, 10))
plt.subplot(2,2,1)
plt.imshow(img, cmap=plt.cm.binary)
plt.xticks([])
plt.yticks([])
plt.title('A selection from the 64-dimensional digits dataset')

# t-SNE embedding of the digits dataset
print("Computing t-SNE embedding")
tsne = manifold.TSNE(n_components=2, init='pca', random_state=0)
t0 = time()
X_tsne = tsne.fit_transform(X)
plot_embedding(X_tsne,
               "t-SNE embedding of the digits (time %.2fs)" %
               (time() - t0), sub_num=222)

# Projection on to the first 2 linear discriminant components
print("Computing Linear Discriminant Analysis projection")
X2 = X.copy()
X2.flat[::X.shape[1] + 1] += 0.01  # Make X invertible
t0 = time()
X_lda = discriminant_analysis.LinearDiscriminantAnalysis(n_components=2
                                                         ).fit_transform(X2, y)
plot_embedding(X_lda,
               "Linear Discriminant projection of the digits (time %.2fs)" %
               (time() - t0), sub_num=223)

# Isomap projection of the digits dataset
print("Computing Isomap projection")
t0 = time()
X_iso = manifold.Isomap(n_neighbors, n_components=2).fit_transform(X)
plot_embedding(X_iso,
               "Isomap projection of the digits (time %.2fs)" %
               (time() - t0), sub_num=224)

print("0 and 1 are Red.\n2 is Blue.\n3 is Green.\n4 is Purple.\n5 is Orange.")
plt.tight_layout()
plt.savefig('t-SNE.png')

VoxCeleb 說話人識别挑戰VoxCeleb 說話人識别挑戰

深度學習平台 NSML

VoxSRC 采用了南韓 NSML 平台，這個平台提供了研究者很多自動化的功能，使開發者可以更專注模型的設計。這與深度學習平台的開發需求是非常吻合的。在國内，也有非常多的深度學習競賽擁有這這類平台，例如阿裡雲、騰訊雲、百度雲、京東雲、華為雲、ucloud 雲。

盡管筆者在單機上的深度學習平台上有所嘗試，但是高門檻成為了平台建設的主要困難，這些困難包含技術上的，和設計思路上的。這方面非常希望有讀者願意加入到筆者到團隊中來，一起研究。

參考文獻：

VoxCeleb Speaker Recognition Challenge (VoxSRC): Chung, J.S., Huh, J., Mun, S., Lee, M., Heo, H.S., Choe, S., Ham, C., Jung, S., Lee, B.-J., Han, I., 2020. In defence of metric learning for speaker recognition. arXiv Prepr. arXiv2003.11982.
編碼器與損失函數對說話人/語音識别的讨論: Cai, W., Chen, J., Li, M., 2018. Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System, in: Odyssey 2018 The Speaker and Language Recognition Workshop. ISCA, Les Sables d’Olonne, France, pp. 74–81. https://doi.org/10.21437/odyssey.2018-11
高維資料可視化 TSNE: Van Der Maaten, L., Hinton, G., 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2625.
深度學習平台: Sung, N., Kim, M., Jo, H., Yang, Y., Kim, J., Lausen, L., Kim, Y., Lee, G., Kwak, D.-H., Ha, J.-W., Kim, S., 2017. NSML: A Machine Learning Platform That Enables You to Focus on Your Models. CoRR arXiv prep.

作者：王瑞同濟大學計算機系博士研究所學生

郵箱：[email protected]

CSDN：https://blog.csdn.net/i_love_home

Github：https://github.com/mechanicalsea

如果大家有興趣參加 2020 VoxSRC 競賽，歡迎一起交流～

VoxCeleb 說話人識别挑戰VoxCeleb 說話人識别挑戰

VoxCeleb 說話人識别挑戰

文章目錄

摘要

VoxSRC

度量學習與編碼器

高次元資料可視化 TSNE

深度學習平台 NSML

參考文獻：

繼續閱讀

TestLink導出用例轉換工具(XML2Excel)

解碼器用于語義分割：資料依賴的解碼可以實作靈活的特征聚合

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

cs231n斯坦福基于卷積神經網絡的CV學習筆記（一）KNN和線性分類器/分類器損失/反向傳播一，KNN圖像分類算法二，線性分類器三，線性分類器損失四，反向傳播五，神經網絡

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入