Pytorch學習筆記-第九章RNN詩人

utils
data
model
main
- Train
- 字首詩生成
- 藏頭詩生成

記錄一下個人學習和使用Pytorch中的一些問題。強烈推薦《深度學習架構PyTorch：入門與實戰》.寫的非常好而且作者也十分用心，大家都可以看一看，本文為學習第九章RNN詩人的學習筆記。

主要分析實作代碼裡面main，data，model，utils這4個代碼檔案完成整個項目模型結構定義，訓練及生成，還有輸出展示的整個過程。

utils

這個檔案沒啥好說的了，就是封裝了一個visdom對象，再多加了友善使用的一次顯示多個點以及網格顯示多個圖檔的函數(然後這個項目用不到圖檔顯示，應該就是前幾個項目裡打包過來的)。

data

原始資料是JSON結構化的資料格式，需要讀入整理成可以被網絡接受的形式。每首詩變成大小為125的數組，不足則補齊，超過則截斷。

Pytorch學習筆記-第九章utilsdatamodelmain

當然為了後續的embedding，以及預測分類操作，我們需要給這些字一個序号，是以輸入資料的一首詩最後會成為這樣的形式。

Pytorch學習筆記-第九章utilsdatamodelmain

data.py檔案中有3個函數，

get_data

是主要函數負責調用

_parseRawData

從JSON檔案出解碼以及預處理詩詞資料（去掉無關數字，标點等），然後調用

pad_sequences

把資料處理成統一大小，最後在自身内部完成正方向索引字典和資料的儲存與傳回。其中填充函數功能比較強大，可以在很多場景下使用。

def pad_sequences(sequences,
                  maxlen=None,
                  dtype='int32',
                  padding='pre',
                  truncating='pre',
                  value=0.):
    """
    code from keras
    Pads each sequence to the same length (length of the longest sequence).
    If maxlen is provided, any sequence longer
    than maxlen is truncated to maxlen.
    Truncation happens off either the beginning (default) or
    the end of the sequence.
    Supports post-padding and pre-padding (default).
    Arguments:
        sequences: list of lists where each element is a sequence
        maxlen: int, maximum length
        dtype: type to cast the resulting sequence.
        padding: 'pre' or 'post', pad either before or after each sequence.
        truncating: 'pre' or 'post', remove values from sequences larger than
            maxlen either in the beginning or in the end of the sequence
        value: float, value to pad the sequences to the desired value.
    Returns:
        x: numpy array with dimensions (number_of_sequences, maxlen)
    Raises:
        ValueError: in case of invalid values for `truncating` or `padding`,
            or in case of invalid shape for a `sequences` entry.
    """
    if not hasattr(sequences, '__len__'):
        raise ValueError('`sequences` must be iterable.')
    lengths = []
    for x in sequences:
        if not hasattr(x, '__len__'):
            raise ValueError('`sequences` must be a list of iterables. '
                             'Found non-iterable: ' + str(x))
        lengths.append(len(x))

    num_samples = len(sequences)
    #如果沒有設定好的填充函數，則統一到資料中最長值
    if maxlen is None:
        maxlen = np.max(lengths)

    # take the sample shape from the first non empty sequence
    # checking for consistency in the main loop below.
    sample_shape = tuple()
    for s in sequences:
        if len(s) > 0:  # pylint: disable=g-explicit-length-test
            sample_shape = np.asarray(s).shape[1:]
            break
	#預設好的填充值
    x = (np.ones((num_samples, maxlen) + sample_shape) * value).astype(dtype)
    for idx, s in enumerate(sequences):
    #尾部填充還是頭部填充
        if not len(s):  # pylint: disable=g-explicit-length-test
            continue  # empty list/array was found
        if truncating == 'pre':
            trunc = s[-maxlen:]  # pylint: disable=invalid-unary-operand-type
        elif truncating == 'post':
            trunc = s[:maxlen]
        else:
            raise ValueError('Truncating type "%s" not understood' % truncating)

        # check `trunc` has expected shape
        trunc = np.asarray(trunc, dtype=dtype)
        if trunc.shape[1:] != sample_shape:
            raise ValueError(
                'Shape of sample %s of sequence at position %s is different from '
                'expected shape %s'
                % (trunc.shape[1:], idx, sample_shape))
		#尾部填充還是頭部填充
        if padding == 'post':
            x[idx, :len(trunc)] = trunc
        elif padding == 'pre':
            x[idx, -len(trunc):] = trunc
        else:
            raise ValueError('Padding type "%s" not understood' % padding)
    return x

model

該項目的模型定義比較簡單，在基礎的CharRNN(學習字元級别的組合，自動生成文本，其實就是用前n-1個字去預測最後一個字，成了一分類問題，不過類别比較多，是單字元個數)做了2個改進（rnn換成lstm，one-hot的詞編碼程式設計詞向量）。

class PoetryModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(PoetryModel, self).__init__()
        self.hidden_dim = hidden_dim
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, self.hidden_dim, num_layers=2)
        #全連接配接層，用來預測分類（下一個字）
        self.linear1 = nn.Linear(self.hidden_dim, vocab_size)

    def forward(self, input, hidden=None):
        seq_len, batch_size = input.size()
        if hidden is None:
        #2是因為兩層的LSTM
            #  h_0 = 0.01*torch.Tensor(2, batch_size, self.hidden_dim).normal_().cuda()
            #  c_0 = 0.01*torch.Tensor(2, batch_size, self.hidden_dim).normal_().cuda()
            h_0 = input.data.new(2, batch_size, self.hidden_dim).fill_(0).float()
            c_0 = input.data.new(2, batch_size, self.hidden_dim).fill_(0).float()
        else:
            h_0, c_0 = hidden
        # size: (seq_len,batch_size,embeding_dim)
        embeds = self.embeddings(input)
        # output size: (seq_len,batch_size,hidden_dim)
        output, hidden = self.lstm(embeds, (h_0, c_0))

        # size: (seq_len*batch_size,vocab_size)
        #相當于對每個字都預測了下一個字（用分類方法）
        output = self.linear1(output.view(seq_len * batch_size, -1))
        return output, hidden

main

main檔案裡面包含了模型訓練以及詩歌生成函數（藏頭詩和字首詩）

Train

訓練過程就是一個分類問題的訓練，不過需要注意的是輸入與标簽之間的關系，輸入資料的後一個字就是前一個字想要得到的預測結果，是以輸入與标簽之間是一個錯位關系。

字首詩生成

字首詩的生成比較簡單，把字首詞先輸入到網絡中計算對應的隐狀态。

if i < start_word_len:
            w = results[i]
            input = input.data.new([word2ix[w]]).view(1, 1)

然後再一邊預測，一邊把預測結果當作輸入進行下一個預測，直到出現終止符或者達到規定字數。

top_index = output.data[0].topk(1)[1][0].item()
            w = ix2word[top_index]
            results.append(w)
            input = input.data.new([top_index]).view(1, 1)

當然可以加一個要模仿的詩句，那麼這個詩句就會先輸入到模型，改變隐狀态，讓我們生成的詩句更符合模仿的意境和長度。（不保留輸出到結果裡面，隻作網絡新的輸入）

if prefix_words:
        for word in prefix_words:
            output, hidden = model(input, hidden)
            input = input.data.new([word2ix[word]]).view(1, 1)

藏頭詩生成

藏頭詩大體上和上面一般的詩差不多，但是提供的字首詞不是一開始全部輸入網絡，而是先輸入一個，然後網絡自動生成好幾個字直到生成了标點符号（，。！）表明是新的一句話時

再把下一把字首詞當作結果儲存同時輸入到網絡。

for i in range(opt.max_gen_len):
        output, hidden = model(input, hidden)
        top_index = output.data[0].topk(1)[1][0].item()
        w = ix2word[top_index]

        if (pre_word in {u'。', u'！', '<START>'}):
            # 如果遇到句号，藏頭的詞送進去生成

            if index == start_word_len:
                # 如果生成的詩歌已經包含全部藏頭的詞，則結束
                break
            else:
                # 把藏頭的詞作為輸入送入模型
                w = start_words[index]
                index += 1
                input = (input.data.new([word2ix[w]])).view(1, 1)
        else:
            # 否則的話，把上一次預測是詞作為下一個詞輸入
            input = (input.data.new([word2ix[w]])).view(1, 1)
        results.append(w)
        pre_word = w
    return results

Pytorch學習筆記-第九章utilsdatamodelmain

Pytorch學習筆記-第九章RNN詩人

utils

data

model

main

Train

字首詩生成

藏頭詩生成

繼續閱讀

簡單文檔分類——樸素貝葉斯算法樸素貝葉斯算法簡單文檔分類執行個體步驟總結樸素貝葉斯分類調用(sklearn)

考證大全 | 證券從業資格考試

敲黑闆！2021年證券從業考試考點預測

2021年銀行從業考試考情介紹,果斷收藏!

證券從業合格證書什麼時候列印？有哪些注意事項？

【幹貨滿滿】初級銀行從業考試《個人理财》重點梳理

2020年經濟師考試，難嗎？

初級銀行從業資格證有什麼用？

MBA提前面試純幹貨分享

MBA值得學麼

吳恩達logistic回歸實作

【人工智能行業大師訪談1】吳恩達采訪 Geoffery Hinton

深度學習模型分析人類複雜疾病的準确性

【趨高機器視覺】機器視覺技術原了解析及解決方案

解碼器用于語義分割：資料依賴的解碼可以實作靈活的特征聚合

cs231n斯坦福基于卷積神經網絡的CV學習筆記（一）KNN和線性分類器/分類器損失/反向傳播一，KNN圖像分類算法二，線性分類器三，線性分類器損失四，反向傳播五，神經網絡