大廠技術實作 | 騰訊資訊流推薦排序中的并聯雙塔CTR結構 @推薦與計算廣告系列 ShowMeAIShowMeAI

雙塔模型是推薦、搜尋、廣告等多個領域的算法實作中最常用和經典的結構，實際各公司應用時，雙塔結構中的每個塔會做結構更新，用CTR預估中的新網絡結構替代全連接配接DNN，本期看到的是騰訊浏覽器團隊的推薦場景下，巧妙并聯CTR模型應用于雙塔的方案。

💡 作者：韓信子@ShowMeAI，Joan@騰訊

📘 大廠解決方案系列教程：https://www.showmeai.tech/tutorials/50

📘 本文位址：https://www.showmeai.tech/article-detail/64

📢 聲明：版權所有，轉載請聯系平台與作者并注明出處

📢 收藏 ShowMeAI 檢視更多精彩内容

一圖讀懂全文

本篇内容使用到的資料集為 🏆CTR預估方法實作資料集與代碼，大家可以通過 ShowMeAI 的百度網盤位址快速下載下傳。資料集和代碼的整理花費了很多心思，歡迎大家 PR 和 Star！

🏆 大廠技術實作的資料集下載下傳（百度網盤）：公衆号『ShowMeAI研究中心』回複『大廠』，或者點選這裡擷取本文騰訊資訊流推薦排序中的并聯雙塔CTR結構『CTR預估方法實作資料集與代碼』

⭐ ShowMeAI官方GitHub：https://github.com/ShowMeAI-Hub/multi-task-learning

一、雙塔模型結構

1.1 模型結構介紹

雙塔模型廣泛應用于推薦、搜尋、廣告等多個領域的召回和排序階段。雙塔模型結構中，左側是User塔，右側是Item塔，對應的，我們也可以将特征拆分為兩大類：

User相關特征：使用者基本資訊、群體統計屬性以及互動過的Item序列等；如果有上下文特征（Context feature）可以放入使用者側塔。
Item相關特征：Item基本資訊、屬性資訊等。

最初版本的結構中，這兩個塔中間都是經典的 DNN 模型（即全連接配接結構），從特征 Embedding 經過若幹層 MLP 隐層，兩個塔分别輸出 User Embedding 和 Item Embedding 編碼。

在訓練過程中，User Embedding 和 Item Embedding 做内積或者Cosine相似度計算，使得目前 User 和正例 Item 在 Embedding 空間更接近，和負例 Item 在 Embedding 空間距離拉遠。損失函數則可用标準交叉熵損失（将問題當作一個分類問題），或者采用 BPR 或者 Hinge Loss（将問題當作一個表示學習問題）。

1.2 雙塔模型優缺點

雙塔模型優點很明顯：

結構清晰。分别對 User 和 Item 模組化學習之後，再互動完成預估。
訓練完成之後，線上 inference 過程高效，性能優秀。線上 serving 階段，Item 向量是預先計算好的，可根據變化特征計算一次 User 向量，再計算内積或者 cosine 即可。

**雙塔模型也存在缺點 **：

原始的雙塔模型結構，特征受限，無法使用交叉特征。
模型結構限制下，User 和 Item 是分開建構，隻能通過最後的内積來互動，不利于 User-Item 互動的學習。

1.3 雙塔模型的優化

騰訊資訊流團隊（QQ 浏覽器小說推薦場景）基于以上限制對雙塔模型結構進行優化，增強模型結構與效果上，取得了不錯的收益，具體做法為：

把雙塔結構中的DNN簡單結構，替換有效CTR子產品（MLP、DCN、FM、FFM、CIN）的＂并聯＂結構，充分利用不同結構的特征交叉優勢，拓寬模型的＂寬度＂來緩解雙塔内積的瓶頸。
使用LR學習＂并聯＂的多個雙塔的權重，LR 權重最終融入到 User Embedding 中，使得最終的模型仍然保持的内積形式。

二、并聯雙塔模型結構

并聯的雙塔模型可以分總分為三層：輸入層、表示層和比對層。對應圖中的3個層次，分别的處理和操作如下。

2.1 輸入層（Input Layer）

騰訊QQ浏覽器小說場景下有以下兩大類特征：

User 特征：使用者 id、使用者畫像（年齡、性别、城市）、行為序列（點選、閱讀、收藏）、外部行為（浏覽器資訊、騰訊視訊等）。
Item 特征：小說内容特征（小說 id、分類、标簽等）、統計類特征等。

将 User 和 Item 特征都經過離散化後映射成 Feature Embedding，友善在表示層進行網絡建構。

2.2 表示層（Representation Layer）

對輸入應用深度神經網絡CTR子產品（MLP、DCN、FM、CIN 等）進行學習，不同的子產品可以以不同方式學習輸入層 feature 的融合和互動。
對不同子產品學習的表征，建構并聯結構用于比對層計算。
表示層的 User-User 和 Item-Item 的特征互動（塔内資訊交叉）在本塔分支就可以做到，而 User-Item 的特征互動隻能通過上層操作實作。

2.3 比對層（Matching Layer）

将表示層得到的 User 和 Item 向量，按照不同并聯模型分别進行 hadamard 積，拼接後再經過LR 進行結果融合計算最後score。
線上 serving 階段 LR 的每一維的權重可預先融合到 User Embedding 裡，進而保持線上打分仍然是内積操作。

三、雙塔的表示層結構 -MLP/DCN結構

雙塔内一般都會使用 MLP 結構（多層全連接配接），騰訊QQ浏覽器團隊還引入了 DCN 中的 Cross Network 結構用于顯式的構造高階特征互動，參考的結構是 Google 論文改進版 DCN-Mix。

3.1 DCN 結構

DCN 的特點是引入 Cross Network這種交叉網絡結構，提取交叉組合特征，避免傳統機器學習中的人工手造特征的過程，網絡結構簡單複雜度可控，随深度增加獲得多階交叉特征。DCN模型具體結構如圖：

底層是 Embedding layer 并對 Embedding 做了stack。
上層是并行的 Cross Network 和 Deep Network。
頭部是 Combination Layer 把 Cross Network 和 Deep Network 的結果 stack 得到 Output。

3.2 優化的DCN-V2結構引入

Google在DCN的基礎上提出改進版 DCN-Mix/DCN-V2，針對 Cross Network 進行了改進，我們主要關注 Cross Network 的計算方式變更：

1）原始 Cross Network 計算方式

原始計算公式下，經過多層計算，可以顯式地學習到高維的特征互動，存在的問題是被證明最終的 k 階互動結果 \(x_{k}\) 等于 \(x_{0}\) 和一個标量的乘積（但不同的 \(x_{0}\) 這個标量不同，\(x_{0}\) 和 \(x_{k}\) 并不是線性關系），這個計算方式下 Cross Network 的表達受限。

2）改進版 Cross Network 計算方式

Google改進版的 DCN-Mix 做的處理如下：

\(W\) 由向量變更為矩陣，更大的參數量帶來了更強的表達能力（實際W 矩陣也可以進行矩陣分解）。
變更特征互動方式：不再使用外積，應用哈達瑪積(Hadamard product)。

3）DCN-V2代碼參考

DCN-v2的代碼實作和ctr應用案例可以參考 Google官方實作 https://github.com/tensorflow/models/tree/master/official/recommendation/ranking

其中核心的改進後的 deep cross layer代碼如下：

class Cross(tf.keras.layers.Layer):
  """Cross Layer in Deep & Cross Network to learn explicit feature interactions.
A layer that creates explicit and bounded-degree feature interactions
efficiently. The `call` method accepts `inputs` as a tuple of size 2
tensors. The first input `x0` is the base layer that contains the original
features (usually the embedding layer); the second input `xi` is the output
of the previous `Cross` layer in the stack, i.e., the i-th `Cross`
layer. For the first `Cross` layer in the stack, x0 = xi.
The output is x_{i+1} = x0 .* (W * xi + bias + diag_scale * xi) + xi,
where .* designates elementwise multiplication, W could be a full-rank
matrix, or a low-rank matrix U*V to reduce the computational cost, and
diag_scale increases the diagonal of W to improve training stability (
especially for the low-rank case).
References:
    1. [R. Wang et al.](https://arxiv.org/pdf/2008.13535.pdf)
      See Eq. (1) for full-rank and Eq. (2) for low-rank version.
    2. [R. Wang et al.](https://arxiv.org/pdf/1708.05123.pdf)
Example:
    ```python
    # after embedding layer in a functional model:
    input = tf.keras.Input(shape=(None,), name='index', dtype=tf.int64)
    x0 = tf.keras.layers.Embedding(input_dim=32, output_dim=6)
    x1 = Cross()(x0, x0)
    x2 = Cross()(x0, x1)
    logits = tf.keras.layers.Dense(units=10)(x2)
    model = tf.keras.Model(input, logits)
    ```
Args:
    projection_dim: project dimension to reduce the computational cost.
      Default is `None` such that a full (`input_dim` by `input_dim`) matrix
      W is used. If enabled, a low-rank matrix W = U*V will be used, where U
      is of size `input_dim` by `projection_dim` and V is of size
      `projection_dim` by `input_dim`. `projection_dim` need to be smaller
      than `input_dim`/2 to improve the model efficiency. In practice, we've
      observed that `projection_dim` = d/4 consistently preserved the
      accuracy of a full-rank version.
    diag_scale: a non-negative float used to increase the diagonal of the
      kernel W by `diag_scale`, that is, W + diag_scale * I, where I is an
      identity matrix.
    use_bias: whether to add a bias term for this layer. If set to False,
      no bias term will be used.
    kernel_initializer: Initializer to use on the kernel matrix.
    bias_initializer: Initializer to use on the bias vector.
    kernel_regularizer: Regularizer to use on the kernel matrix.
    bias_regularizer: Regularizer to use on bias vector.
Input shape: A tuple of 2 (batch_size, `input_dim`) dimensional inputs.
Output shape: A single (batch_size, `input_dim`) dimensional output.
  """
  def init(
  self,
  projection_dim: Optional[int] = None,
  diag_scale: Optional[float] = 0.0,
  use_bias: bool = True,
  kernel_initializer: Union[
      Text, tf.keras.initializers.Initializer] = "truncated_normal",
  bias_initializer: Union[Text,
                          tf.keras.initializers.Initializer] = "zeros",
  kernel_regularizer: Union[Text, None,
                            tf.keras.regularizers.Regularizer] = None,
  bias_regularizer: Union[Text, None,
                          tf.keras.regularizers.Regularizer] = None,
  **kwargs):
super(Cross, self).__init__(**kwargs)
self._projection_dim = projection_dim
self._diag_scale = diag_scale
self._use_bias = use_bias
self._kernel_initializer = tf.keras.initializers.get(kernel_initializer)
self._bias_initializer = tf.keras.initializers.get(bias_initializer)
self._kernel_regularizer = tf.keras.regularizers.get(kernel_regularizer)
self._bias_regularizer = tf.keras.regularizers.get(bias_regularizer)
self._input_dim = None
self._supports_masking = True
if self._diag_scale < 0:
  raise ValueError(
      "`diag_scale` should be non-negative. Got `diag_scale` = {}".format(
          self._diag_scale))
  def build(self, input_shape):
last_dim = input_shape[-1]
if self._projection_dim is None:
  self._dense = tf.keras.layers.Dense(
      last_dim,
      kernel_initializer=self._kernel_initializer,
      bias_initializer=self._bias_initializer,
      kernel_regularizer=self._kernel_regularizer,
      bias_regularizer=self._bias_regularizer,
      use_bias=self._use_bias,
  )
else:
  self._dense_u = tf.keras.layers.Dense(
      self._projection_dim,
      kernel_initializer=self._kernel_initializer,
      kernel_regularizer=self._kernel_regularizer,
      use_bias=False,
  )
  self._dense_v = tf.keras.layers.Dense(
      last_dim,
      kernel_initializer=self._kernel_initializer,
      bias_initializer=self._bias_initializer,
      kernel_regularizer=self._kernel_regularizer,
      bias_regularizer=self._bias_regularizer,
      use_bias=self._use_bias,
  )
self.built = True
  def call(self, x0: tf.Tensor, x: Optionaltf.Tensor = None) -> tf.Tensor:
"""Computes the feature cross.
Args:
  x0: The input tensor
  x: Optional second input tensor. If provided, the layer will compute
    crosses between x0 and x; if not provided, the layer will compute
    crosses between x0 and itself.
Returns:
 Tensor of crosses.
"""
if not self.built:
  self.build(x0.shape)
if x is None:
  x = x0
if x0.shape[-1] != x.shape[-1]:
  raise ValueError(
      "`x0` and `x` dimension mismatch! Got `x0` dimension {}, and x "
      "dimension {}. This case is not supported yet.".format(
          x0.shape[-1], x.shape[-1]))
if self._projection_dim is None:
  prod_output = self._dense(x)
else:
  prod_output = self._dense_v(self._dense_u(x))
if self._diag_scale:
  prod_output = prod_output + self._diag_scale * x
return x0 * prod_output + x
  def get_config(self):
config = {
    "projection_dim":
        self._projection_dim,
    "diag_scale":
        self._diag_scale,
    "use_bias":
        self._use_bias,
    "kernel_initializer":
        tf.keras.initializers.serialize(self._kernel_initializer),
    "bias_initializer":
        tf.keras.initializers.serialize(self._bias_initializer),
    "kernel_regularizer":
        tf.keras.regularizers.serialize(self._kernel_regularizer),
    "bias_regularizer":
        tf.keras.regularizers.serialize(self._bias_regularizer),
}
base_config = super(Cross, self).get_config()
return dict(list(base_config.items()) + list(config.items()))

四、雙塔的表示層結構 - FM/FFM/CIN結構

另一類在CTR預估中常用的結構是FM系列的結構，典型的模型包括FM、FFM、DeepFM、xDeepFM。他們特殊的模組化方式也能挖掘有效的資訊，騰訊QQ浏覽器團隊的最終模型上，也使用了上述模型的子結構。

上文提到的MLP和DCN的特征互動交叉，無法顯式指定某些特征互動，而FM系列模型中的FM / FFM / CIN結構可以對特征粒度的互動做顯式操作，且從計算公式上看，它們都具備很好的内積形式，從能友善直接地實作雙塔模組化 User-Item 的特征粒度的互動。

4.1 FM結構引入

\[y = \omega_{0}+\sum_{i=1}^{n} \omega_{i} x_{i}+\sum_{i=1}^{n-1} \sum_{j=i+1}^{n}<v_{i}, v_{j}>x_{i} x_{j}

FM是CTR預估中最常見的模型結構，它通過矩陣分解的方法建構特征的二階互動。計算公式上表現為特征向量 \(vi\) 和 \(vj\) 的兩兩内積操作再求和（在深度學習裡可以看做特征Embedding的組對内積），通過内積運算配置設定率可以轉換成求和再内積的形式。

\[\begin{array}{c}

y=\sum_{i} \sum_{j}\left\langle V_{i}, V_{j}\right\rangle=\left\langle\sum_{i} V_{i}, \sum_{j} V_{j}\right\rangle \\

i \in \text { user fea, } \quad j \in \text { item fea }

\end{array}

在騰訊QQ浏覽器團隊小說推薦場景中，隻考慮 User-Item 的互動（因為User内部或者Item内部的特征二階互動上文提到的模型已捕捉到）。

如上公式所示，\(i\) 是 User 側的特征，\(j\) 是 Item 側的特征，通過内積計算配置設定率的轉換。User-Item 的二階特征互動也可以轉化為 User、Item 特征向量先求和（神經網絡中展現為sum pooling）再做内積，很友善可以轉為雙塔結構處理。

4.2 FFM結構引入

FFM 模型是 FM 的更新版本，相比 FM，它多了 field 的概念。FFM 把相同性質的特征歸于同一個field，建構的隐向量不僅與特征相關，也與field相關，最終的特征互動可以在不同的隐向量空間，進而提升區分能力加強效果，FFM 也可以通過一些方法轉換成雙塔内積的結構。

\[y(\mathbf{x})=w_{0}+\sum_{i=1}^{n} w_{i} x_{i}+\sum_{i=1}^{n} \sum_{j=i+1}^{n}\left\langle\mathbf{v}_{i f_{j}}, \mathbf{v}_{j f_{i}}\right\rangle x_{i} x_{j}

User 有 2 個特征 field、Item 有 3 個特征 field，圖中任意2個特征互動都有獨立的 Embedding 向量。根據 FFM 公式，計算 User-Item 的二階互動，需要将所有的内積計算出來并求和。一個轉換的例子如下：

我們将User、Item 的特征 Embedding 做重新排序，再進行拼接，可以把 FFM 也轉換成雙塔内積形式。FFM 内的 User-User 和 Item-Item 都在塔内，是以我們可預先算好放入一階項裡。

騰訊QQ浏覽器團隊實踐應用中發現：應用 FFM 的雙塔，訓練資料上 AUC 提升明顯，但參數量的增加帶來了嚴重的過拟合，且上述結構調整後雙塔的寬度極寬（可能達到萬級别），對性能效率影響較大，進一步嘗試的優化方式如下：

人工篩選參與 FFM 訓練特征互動的 User 和 Item 特征 field，控制雙塔寬度（1000左右）。
調整 FFM 的 Embedding 參數初始化方式（接近 0）及學習率（降低）。

最終效果不是很理想，是以團隊實際線上并未使用 FFM。

4.3 CIN結構引入

前面提到的FM和FFM能完成二階特征互動，而xDeepFM模型中提出的 CIN 結構可以實作更高階的特征互動（比如 User-User-Item、User-User-Item-Item、User-Item-Item 等3階），騰訊QQ浏覽器團隊嘗試了兩種用法把CIN應用在雙塔結構中：

1）CIN(User) * CIN(Item)

雙塔每個塔内生成 User、Item 的自身多階 CIN 結果，再分别 sum pooling 生成 User/Item 向量，然後User 與 Item 向量内積。

根據配置設定率，我們對 sum pooling 再内積的公式進行拆解，會發現這個計算方式内部其實已經實作了 User-Item 的多階互動：

\[\left(U^{1}+U^{2}+U^{3}\right) * \left (I^{1}+I^{2}+I^{3}\right)

\[U^{1} I^{1}+U^{1} I^{2}+U^{1} I^{3}+U^{2} I^{1}+U^{2} I^{2}+U^{2} I^{3}+U^{3} I^{1}+U^{3} I^{2}+U^{3} I^{3}

這個用法實作過程也比較簡單，針對雙塔結構，在兩側塔内做 CIN 生成各階結果，再對結果做 sumpooling，最後類似 FM 原理通過内積實作 User-Item 的各階互動。

這個處理方式有一定的缺點：生成的 User-Item 二階及以上的特征互動，有着和 FM 類似的局限性（例U1 是由 User 側提供的多個特征sumpooling所得結果，U1 與 Item 側的結果内積計算，受限于sum pooling的計算，每個 User 特征在這裡重要度就變成一樣的了）。

2）CIN( CIN(User) ， CIN(Item) )

第2種處理方式是：雙塔每側塔内生成 User、Item 的多階 CIN 結果後，對 User、Item 的 CIN 結果再次兩兩使用 CIN 顯式互動（而非 sum pooling 後計算内積），并轉成雙塔内積，如下圖所示：

下圖為 CIN 計算的公式表示，多個卷積結果做 sum pooling 後形式保持不變（兩兩 hadamard 積權重求和）。

CIN 的形式和 FFM 類似，同樣可以通過『重新排列+拼接』操作轉換成雙塔内積形式，生成的雙塔寬度也非常大（萬級别）。但與 FFM 不同的是：CIN 的所有特征互動，底層使用的 feature Embedding 是共享的，而 FFM 對每個二階互動都有獨立的 Embedding。

是以騰訊QQ浏覽器團隊的實踐嘗試中基本沒有出現過拟合問題，實驗效果上第②種方式第①種用法略好。

五、騰訊業務效果

以下為騰訊QQ浏覽器小說推薦業務上的方法實驗效果（對比各種單CTR模型和并聯雙塔結構）：

5.1 團隊給出的一些分析如下

① CIN2 在單結構的雙塔模型中的效果是最好的，其次是 DCN 和 CIN1的雙塔結構。

② 并聯的雙塔結構相比于單一的雙塔結構在效果上也有明顯提升。

③ 并聯方案二使用了 CIN2 的結構，雙塔寬度達到了 2萬+，對線上 serving 的性能有一定的挑戰，綜合考慮效果和部署效率可以選擇并聯雙塔方案一。

5.2 團隊給出的一些訓練細節和經驗

① 考慮到FM/FFM/CIN 等結構的計算複雜度，都隻在精選特征子集上面訓練，選取次元更高的 category 特征為主，比如使用者id、行為曆史id、小說id、标簽id 等，還有少量統計特征，User 側、Item 側大概各選了不到 20 個特征field。

② 并聯的各雙塔結構，各模型不共享底層 feature Embedding，分别訓練自己的 Embedding。

③ feature Embedding 次元選擇，MLP/DCN 對 category 特征次元為 \(16\)，非 category特征次元是 \(32\)。

④ FM/FFM/CIN 的 feature Embedding 次元統一為 \(32\)。

六、騰訊團隊實驗效果

在小說推薦場景的粗排階段上線了 A/B Test 實驗，實驗組的點選率、閱讀轉化率模型使用了『并聯雙塔方案一』，對照組為『MLP 雙塔模型』，如下圖所示，有明顯的業務名額提升：

點選轉化率 \(+6.8752\%\)
閱讀轉化率 \(+6.2250\%\)
加書轉化率 \(+6.5775\%\)
閱讀總時長 \(+3.3796\%\)

參考文獻

[1] Huang, Po-Sen, et al. "Learning deep structured semantic models for web search using clickthrough data." Proceedings of the 22nd ACM international conference on Information & Knowledge Management. 2013.

[2] S. Rendle, “Factorization machines,” in Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 995–1000, 2010.

[3] Yuchin Juan, et al. "Field-aware Factorization Machines for CTR Prediction." Proceedings of the 10th ACM Conference on Recommender SystemsSeptember 2016 Pages 43–

[4] Jianxun Lian, et al. "xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems" Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningJuly 2018 Pages 1754–1763

[5] Ruoxi Wang, et al. "Deep & Cross Network for Ad Click Predictions" Proceedings of the ADKDD'17August 2017 Article No.: 12Pages 1–

[6] Wang, Ruoxi, et al. "DCN V2: Improved Deep & Cross Network and Practical Lessons for Webscale Learning to Rank Systems" In Proceedings of the Web Conference 2021 (WWW '21); doi:10.1145/3442381.3450078

ShowMeAI 大廠技術實作方案推薦

大廠解決方案系列 | 資料集&代碼集（持續更新中）：https://www.showmeai.tech/tutorials/50
ShowMeAI官方GitHub（實作代碼）：https://github.com/ShowMeAI-Hub/
『推薦與廣告』大廠解決方案
- 大廠技術實作 | 多目标優化及應用（含代碼實作）@推薦與廣告計算系列
- 大廠技術實作 | 愛奇藝短視訊推薦業務中的多目标優化實踐@推薦與計算廣告系列
- 大廠技術實作 | 騰訊資訊流推薦排序中的并聯雙塔CTR結構@推薦與計算廣告系列
『計算機視覺 CV』大廠解決方案
- 大廠技術實作 | 圖像檢索及其在淘寶的應用@計算機視覺系列
- 大廠技術實作 | 圖像檢索及其在高德的應用@計算機視覺系列
『自然語言處理 NLP』大廠解決方案
- 大廠技術實作 | 詳解知識圖譜的建構全流程@自然語言處理系列
- 大廠技術實作 | 愛奇藝文娛知識圖譜的建構與應用實踐@自然語言處理系列
『金融科技』大廠解決方案
『生物醫療』大廠解決方案
『智能制造』大廠解決方案
『其他AI垂直領域』大廠解決方案

ShowMeAI系列教程精選推薦

圖解Python程式設計：從入門到精通系列教程
圖解資料分析：從入門到精通系列教程
圖解AI數學基礎：從入門到精通系列教程
圖解機器學習算法：從入門到精通系列教程
機器學習實戰：手把手教你玩轉機器學習系列
深度學習教程：吳恩達專項課程 · 全套筆記解讀
自然語言處理教程：斯坦福CS224n課程 · 課程帶學與全套筆記解讀
深度學習與計算機視覺教程：斯坦福CS231n · 全套筆記解讀

大廠技術實作 | 騰訊資訊流推薦排序中的并聯雙塔CTR結構 @推薦與計算廣告系列 ShowMeAIShowMeAI