Pytorch学习笔记-第九章RNN诗人

utils
data
model
main
- Train
- 前缀诗生成
- 藏头诗生成

记录一下个人学习和使用Pytorch中的一些问题。强烈推荐《深度学习框架PyTorch：入门与实战》.写的非常好而且作者也十分用心，大家都可以看一看，本文为学习第九章RNN诗人的学习笔记。

主要分析实现代码里面main，data，model，utils这4个代码文件完成整个项目模型结构定义，训练及生成，还有输出展示的整个过程。

utils

这个文件没啥好说的了，就是封装了一个visdom对象，再多加了方便使用的一次显示多个点以及网格显示多个图片的函数(然后这个项目用不到图片显示，应该就是前几个项目里打包过来的)。

data

原始数据是JSON结构化的数据格式，需要读入整理成可以被网络接受的形式。每首诗变成大小为125的数组，不足则补齐，超过则截断。

Pytorch学习笔记-第九章utilsdatamodelmain

当然为了后续的embedding，以及预测分类操作，我们需要给这些字一个序号，所以输入数据的一首诗最后会成为这样的形式。

Pytorch学习笔记-第九章utilsdatamodelmain

data.py文件中有3个函数，

get_data

是主要函数负责调用

_parseRawData

从JSON文件出解码以及预处理诗词数据（去掉无关数字，标点等），然后调用

pad_sequences

把数据处理成统一大小，最后在自身内部完成正方向索引字典和数据的保存与返回。其中填充函数功能比较强大，可以在很多场景下使用。

def pad_sequences(sequences,
                  maxlen=None,
                  dtype='int32',
                  padding='pre',
                  truncating='pre',
                  value=0.):
    """
    code from keras
    Pads each sequence to the same length (length of the longest sequence).
    If maxlen is provided, any sequence longer
    than maxlen is truncated to maxlen.
    Truncation happens off either the beginning (default) or
    the end of the sequence.
    Supports post-padding and pre-padding (default).
    Arguments:
        sequences: list of lists where each element is a sequence
        maxlen: int, maximum length
        dtype: type to cast the resulting sequence.
        padding: 'pre' or 'post', pad either before or after each sequence.
        truncating: 'pre' or 'post', remove values from sequences larger than
            maxlen either in the beginning or in the end of the sequence
        value: float, value to pad the sequences to the desired value.
    Returns:
        x: numpy array with dimensions (number_of_sequences, maxlen)
    Raises:
        ValueError: in case of invalid values for `truncating` or `padding`,
            or in case of invalid shape for a `sequences` entry.
    """
    if not hasattr(sequences, '__len__'):
        raise ValueError('`sequences` must be iterable.')
    lengths = []
    for x in sequences:
        if not hasattr(x, '__len__'):
            raise ValueError('`sequences` must be a list of iterables. '
                             'Found non-iterable: ' + str(x))
        lengths.append(len(x))

    num_samples = len(sequences)
    #如果没有设置好的填充函数，则统一到数据中最长值
    if maxlen is None:
        maxlen = np.max(lengths)

    # take the sample shape from the first non empty sequence
    # checking for consistency in the main loop below.
    sample_shape = tuple()
    for s in sequences:
        if len(s) > 0:  # pylint: disable=g-explicit-length-test
            sample_shape = np.asarray(s).shape[1:]
            break
	#预设好的填充值
    x = (np.ones((num_samples, maxlen) + sample_shape) * value).astype(dtype)
    for idx, s in enumerate(sequences):
    #尾部填充还是头部填充
        if not len(s):  # pylint: disable=g-explicit-length-test
            continue  # empty list/array was found
        if truncating == 'pre':
            trunc = s[-maxlen:]  # pylint: disable=invalid-unary-operand-type
        elif truncating == 'post':
            trunc = s[:maxlen]
        else:
            raise ValueError('Truncating type "%s" not understood' % truncating)

        # check `trunc` has expected shape
        trunc = np.asarray(trunc, dtype=dtype)
        if trunc.shape[1:] != sample_shape:
            raise ValueError(
                'Shape of sample %s of sequence at position %s is different from '
                'expected shape %s'
                % (trunc.shape[1:], idx, sample_shape))
		#尾部填充还是头部填充
        if padding == 'post':
            x[idx, :len(trunc)] = trunc
        elif padding == 'pre':
            x[idx, -len(trunc):] = trunc
        else:
            raise ValueError('Padding type "%s" not understood' % padding)
    return x

model

该项目的模型定义比较简单，在基础的CharRNN(学习字符级别的组合，自动生成文本，其实就是用前n-1个字去预测最后一个字，成了一分类问题，不过类别比较多，是单字符个数)做了2个改进（rnn换成lstm，one-hot的词编码编程词向量）。

class PoetryModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(PoetryModel, self).__init__()
        self.hidden_dim = hidden_dim
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, self.hidden_dim, num_layers=2)
        #全连接层，用来预测分类（下一个字）
        self.linear1 = nn.Linear(self.hidden_dim, vocab_size)

    def forward(self, input, hidden=None):
        seq_len, batch_size = input.size()
        if hidden is None:
        #2是因为两层的LSTM
            #  h_0 = 0.01*torch.Tensor(2, batch_size, self.hidden_dim).normal_().cuda()
            #  c_0 = 0.01*torch.Tensor(2, batch_size, self.hidden_dim).normal_().cuda()
            h_0 = input.data.new(2, batch_size, self.hidden_dim).fill_(0).float()
            c_0 = input.data.new(2, batch_size, self.hidden_dim).fill_(0).float()
        else:
            h_0, c_0 = hidden
        # size: (seq_len,batch_size,embeding_dim)
        embeds = self.embeddings(input)
        # output size: (seq_len,batch_size,hidden_dim)
        output, hidden = self.lstm(embeds, (h_0, c_0))

        # size: (seq_len*batch_size,vocab_size)
        #相当于对每个字都预测了下一个字（用分类方法）
        output = self.linear1(output.view(seq_len * batch_size, -1))
        return output, hidden

main

main文件里面包含了模型训练以及诗歌生成函数（藏头诗和前缀诗）

Train

训练过程就是一个分类问题的训练，不过需要注意的是输入与标签之间的关系，输入数据的后一个字就是前一个字想要得到的预测结果，所以输入与标签之间是一个错位关系。

前缀诗生成

前缀诗的生成比较简单，把前缀词先输入到网络中计算对应的隐状态。

if i < start_word_len:
            w = results[i]
            input = input.data.new([word2ix[w]]).view(1, 1)

然后再一边预测，一边把预测结果当作输入进行下一个预测，直到出现终止符或者达到规定字数。

top_index = output.data[0].topk(1)[1][0].item()
            w = ix2word[top_index]
            results.append(w)
            input = input.data.new([top_index]).view(1, 1)

当然可以加一个要模仿的诗句，那么这个诗句就会先输入到模型，改变隐状态，让我们生成的诗句更符合模仿的意境和长度。（不保留输出到结果里面，只作网络新的输入）

if prefix_words:
        for word in prefix_words:
            output, hidden = model(input, hidden)
            input = input.data.new([word2ix[word]]).view(1, 1)

藏头诗生成

藏头诗大体上和上面一般的诗差不多，但是提供的前缀词不是一开始全部输入网络，而是先输入一个，然后网络自动生成好几个字直到生成了标点符号（，。！）表明是新的一句话时

再把下一把前缀词当作结果保存同时输入到网络。

for i in range(opt.max_gen_len):
        output, hidden = model(input, hidden)
        top_index = output.data[0].topk(1)[1][0].item()
        w = ix2word[top_index]

        if (pre_word in {u'。', u'！', '<START>'}):
            # 如果遇到句号，藏头的词送进去生成

            if index == start_word_len:
                # 如果生成的诗歌已经包含全部藏头的词，则结束
                break
            else:
                # 把藏头的词作为输入送入模型
                w = start_words[index]
                index += 1
                input = (input.data.new([word2ix[w]])).view(1, 1)
        else:
            # 否则的话，把上一次预测是词作为下一个词输入
            input = (input.data.new([word2ix[w]])).view(1, 1)
        results.append(w)
        pre_word = w
    return results

Pytorch学习笔记-第九章utilsdatamodelmain

Pytorch学习笔记-第九章RNN诗人

utils

data

model

main

Train

前缀诗生成

藏头诗生成

继续阅读

简单文档分类——朴素贝叶斯算法朴素贝叶斯算法简单文档分类实例步骤总结朴素贝叶斯分类调用(sklearn)

考证大全 | 证券从业资格考试

敲黑板！2021年证券从业考试考点预测

2021年银行从业考试考情介绍,果断收藏!

证券从业合格证书什么时候打印？有哪些注意事项？

【干货满满】初级银行从业考试《个人理财》重点梳理

2020年经济师考试，难吗？

初级银行从业资格证有什么用？

MBA提前面试纯干货分享

MBA值得学么

吴恩达logistic回归实现

【人工智能行业大师访谈1】吴恩达采访 Geoffery Hinton

深度学习模型分析人类复杂疾病的准确性

【趋高机器视觉】机器视觉技术原理解析及解决方案

解码器用于语义分割：数据依赖的解码可以实现灵活的特征聚合

cs231n斯坦福基于卷积神经网络的CV学习笔记（一）KNN和线性分类器/分类器损失/反向传播一，KNN图像分类算法二，线性分类器三，线性分类器损失四，反向传播五，神经网络