文章目录

Before You Start:
- 什么是dialated convolutions？
- 什么是NER?
- 为什么文本处理可以使用CNN?
整体框架
- input data
- embedding layer
- dialated convolution layer or Bilstm
- - Bilstm
  - dilated convolution layer
- projection layer
- - dilated convolution 分类
  - bilstm 分类
- loss layer
标记数据，数据预处理
- 原始数据
- 标记数据
- 准备jieba,建立标记的字典
- 开始标记数据,打标记的同时，把数据分成３组，用于train,validation,test,最终得到的是IOB格式的标签
- 将IOB格式的标签转化成IOBES格式

Before You Start:

什么是dialated convolutions？

CNN 是新的feature map上一个点旧的feature map上一个filter windows上的总结（作了pooling)，pooling就是在做下采样，feature map不断缩小，也就是resolution不断衰减，可以理解为获取receptive field牺牲了resolution，站得高看得远但是看不清楚细节了。pooling是导致resolution衰减的原因。

为了不丢失细节，去掉pooling，但是去掉pooling会导致receptive field变小（这里是相比较加pooling的情况），这里就加入dialated处理，dialated即为filter windows内部的间隔，加入filter window本来３×３，看到９个点，这９个点是挨着的正方形９个点，当dialated＝２时，也是看到９个点，但是９个点之前是挨着的，现在变成每个点中间间隔了一个点，也就是视野变成了７×７。

dialated convolutions一般用于扩大receptive field。

什么是NER?

Named entity recongnition: 命名实体识别，就是把一篇文章的专有名词识别出来。对比图像，图像是把图中物体识别出来，这里是把一段话里面的实体识别出来。

为什么文本处理可以使用CNN?

从感受野的角度：处理文本就是要看上下文，filter windows就是１×N的窗口，N就是看到的字的数字，每个字的feature_dim就是filter windows　的channels。

整体框架

input data

即为标记的文本，具体而言：一个batch有四个维度，分割的原文，原文对应的int，标记的实体长度(0,1,2,3)，原文对应的标签。

_, chars, segs, tags = batch

chars, segs, 训练会用到，tags计算loss会用到。

embedding layer

chars, segs分别通过一个embedding layer然后将两者的feature拼接起来(100+20)为最后的feature map。

def embedding_layer(self, char_inputs, seg_inputs, config, name=None):
        """
        :param char_inputs: one-hot encoding of sentence
        :param seg_inputs: segmentation feature
        :param config: wither use segmentation feature
        :return: [1, num_steps, embedding size], 
        """
        #高:3 血:22 糖:23 和:24 高:3 血:22 压:25 char_inputs=[3,22,23,24,3,22,25]
        #高血糖 和 高血压 seg_inputs 高血糖=[1,2,3] 和=[0] 高血压=[1,2,3]  seg_inputs=[1,2,3,0,1,2,3]
        embedding = []
        self.char_inputs_test=char_inputs
        self.seg_inputs_test=seg_inputs
        with tf.variable_scope("char_embedding" if not name else name), tf.device('/cpu:0'):
            self.char_lookup = tf.get_variable(
                    name="char_embedding",
                    shape=[self.num_chars, self.char_dim],
                    initializer=self.initializer)
            #输入char_inputs='常' 对应的字典的索引/编号/value为：8
            #self.char_lookup=[2677*100]的向量，char_inputs字对应在字典的索引/编号/key=[1]
            embedding.append(tf.nn.embedding_lookup(self.char_lookup, char_inputs))
            #self.embedding1.append(tf.nn.embedding_lookup(self.char_lookup, char_inputs))
            if config["seg_dim"]:
                with tf.variable_scope("seg_embedding"), tf.device('/cpu:0'):
                    self.seg_lookup = tf.get_variable(
                        name="seg_embedding",
                        #shape=[4*20]
                        shape=[self.num_segs, self.seg_dim],
                        initializer=self.initializer)
                    embedding.append(tf.nn.embedding_lookup(self.seg_lookup, seg_inputs))
            embed = tf.concat(embedding, axis=-1)
        self.embed_test=embed
        self.embedding_test=embedding
        return embed

dialated convolution layer or Bilstm

Bilstm

def biLSTM_layer(self, model_inputs, lstm_dim, lengths, name=None):
        """
        :param lstm_inputs: [batch_size, num_steps, emb_size]
        :return: [batch_size, num_steps, 2*lstm_dim]
        """
        with tf.variable_scope("char_BiLSTM" if not name else name):
            lstm_cell = {}
            for direction in ["forward", "backward"]:
                with tf.variable_scope(direction):
                    lstm_cell[direction] = tf.contrib.rnn.CoupledInputForgetGateLSTMCell(
                        lstm_dim,
                        use_peepholes=True,
                        initializer=self.initializer,
                        state_is_tuple=True)
            outputs, final_states = tf.nn.bidirectional_dynamic_rnn(
                lstm_cell["forward"],
                lstm_cell["backward"],
                model_inputs,
                dtype=tf.float32,
                sequence_length=lengths)
        return tf.concat(outputs, axis=2)

dilated convolution layer

输入的形状：[第x段话，1,窗口看多少个word,每个字的feature_dim]

shape of input = [batch, in_height, in_width, in_channels]

窗口的形状：[1,窗口看多少个word,每个字的feature_dim，filters的数目]

shape of filter = [filter_height, filter_width, in_channels, out_channels]

先做一次正常的卷积，然后做self.repeat_times次，每一次有３次dilated convolution，dilated=1,dilated=1,dilated=2，感受野不断扩大。

注意dilated=1，相当于正常的卷积，但是视野也是会扩大的，想一下卷积不做pooling视野也是扩大的。

深度学习实战：基于bilstm或者dialated convolutions做NERBefore You Start:整体框架标记数据，数据预处理

def IDCNN_layer(self, model_inputs, 
                    name=None):
        """
        :param idcnn_inputs: [batch_size, num_steps, emb_size] 
        :return: [batch_size, num_steps, cnn_output_width]
        """
        #tf.expand_dims会向tensor中插入一个维度，插入位置就是参数代表的位置（维度从0开始）。
        model_inputs = tf.expand_dims(model_inputs, 1)
        self.model_inputs_test=model_inputs
        reuse = False
        if self.dropout == 1.0:
            reuse = True
        with tf.variable_scope("idcnn" if not name else name):
            #shape=[1*3*120*100]
            # shape=[1, self.filter_width, self.embedding_dim,
            #            self.num_filter]
            # print(shape)
            filter_weights = tf.get_variable(
                "idcnn_filter",
                shape=[1, self.filter_width, self.embedding_dim,
                       self.num_filter],
                initializer=self.initializer)
            
            """
            shape of input = [batch, in_height, in_width, in_channels]
            shape of filter = [filter_height, filter_width, in_channels, out_channels]
            """
            layerInput = tf.nn.conv2d(model_inputs,
                                      filter_weights,
                                      strides=[1, 1, 1, 1],
                                      padding="SAME",
                                      name="init_layer",use_cudnn_on_gpu=False)
            self.layerInput_test=layerInput
            finalOutFromLayers = []
            
            totalWidthForLastDim = 0
            for j in range(self.repeat_times):
                for i in range(len(self.layers)):
                    #1,1,2
                    dilation = self.layers[i]['dilation']
                    isLast = True if i == (len(self.layers) - 1) else False
                    with tf.variable_scope("atrous-conv-layer-%d" % i,
                                           reuse=True
                                           if (reuse or j > 0) else False):
                        #w 卷积核的高度，卷积核的宽度，图像通道数，卷积核个数
                        w = tf.get_variable(
                            "filterW",
                            shape=[1, self.filter_width, self.num_filter,
                                   self.num_filter],
                            initializer=tf.contrib.layers.xavier_initializer())
                        if j==1 and i==1:
                            self.w_test_1=w
                        if j==2 and i==1:
                            self.w_test_2=w                            
                        b = tf.get_variable("filterB", shape=[self.num_filter])
#tf.nn.atrous_conv2d(value,filters,rate,padding,name=None）
    #除去name参数用以指定该操作的name，与方法有关的一共四个参数：                  
    #value： 
    #指需要做卷积的输入图像，要求是一个4维Tensor，具有[batch, height, width, channels]这样的shape，具体含义是[训练时一个batch的图片数量, 图片高度, 图片宽度, 图像通道数] 
    #filters： 
    #相当于CNN中的卷积核，要求是一个4维Tensor，具有[filter_height, filter_width, channels, out_channels]这样的shape，具体含义是[卷积核的高度，卷积核的宽度，图像通道数，卷积核个数]，同理这里第三维channels，就是参数value的第四维
    #rate： 
    #要求是一个int型的正数，正常的卷积操作应该会有stride（即卷积核的滑动步长），但是空洞卷积是没有stride参数的，
    #这一点尤其要注意。取而代之，它使用了新的rate参数，那么rate参数有什么用呢？它定义为我们在输入
    #图像上卷积时的采样间隔，你可以理解为卷积核当中穿插了（rate-1）数量的“0”，
    #把原来的卷积核插出了很多“洞洞”，这样做卷积时就相当于对原图像的采样间隔变大了。
    #具体怎么插得，可以看后面更加详细的描述。此时我们很容易得出rate=1时，就没有0插入，
    #此时这个函数就变成了普通卷积。  
    #padding： 
    #string类型的量，只能是”SAME”,”VALID”其中之一，这个值决定了不同边缘填充方式。
    #ok，完了，到这就没有参数了，或许有的小伙伴会问那“stride”参数呢。其实这个函数已经默认了stride=1，也就是滑动步长无法改变，固定为1。
    #结果返回一个Tensor，填充方式为“VALID”时，返回[batch,height-2*(filter_width-1),width-2*(filter_height-1),out_channels]的Tensor，填充方式为“SAME”时，返回[batch, height, width, out_channels]的Tensor，这个结果怎么得出来的？先不急，我们通过一段程序形象的演示一下空洞卷积。                        
                        conv = tf.nn.atrous_conv2d(layerInput,
                                                   w,
                                                   rate=dilation,
                                                   padding="SAME")
                        self.conv_test=conv 
                        conv = tf.nn.bias_add(conv, b)
                        conv = tf.nn.relu(conv)
                        if isLast:
                            finalOutFromLayers.append(conv)
                            totalWidthForLastDim += self.num_filter
                        layerInput = conv
            finalOut = tf.concat(axis=3, values=finalOutFromLayers)
            keepProb = 1.0 if reuse else 0.5
            finalOut = tf.nn.dropout(finalOut, keepProb)
            #Removes dimensions of size 1 from the shape of a tensor. 
                #从tensor中删除所有大小是1的维度
            
                #Given a tensor input, this operation returns a tensor of the same type with all dimensions of size 1 removed. If you don’t want to remove all size 1 dimensions, you can remove specific size 1 dimensions by specifying squeeze_dims. 
            
                #给定张量输入，此操作返回相同类型的张量，并删除所有尺寸为1的尺寸。 如果不想删除所有尺寸1尺寸，可以通过指定squeeze_dims来删除特定尺寸1尺寸。
            finalOut = tf.squeeze(finalOut, [1])
            finalOut = tf.reshape(finalOut, [-1, totalWidthForLastDim])
            self.cnn_output_width = totalWidthForLastDim
            return finalOut

projection layer

卷积为特征提取器，后面需要添加FC分类。

dilated convolution 分类

#Project layer for idcnn by crownpku
    #Delete the hidden layer, and change bias initializer
    def project_layer_idcnn(self, idcnn_outputs, name=None):
        """
        :param lstm_outputs: [batch_size, num_steps, emb_size] 
        :return: [batch_size, num_steps, num_tags]
        """
        with tf.variable_scope("project"  if not name else name):
            
            # project to score of tags
            with tf.variable_scope("logits"):
                W = tf.get_variable("W", shape=[self.cnn_output_width, self.num_tags],
                                    dtype=tf.float32, initializer=self.initializer)

                b = tf.get_variable("b",  initializer=tf.constant(0.001, shape=[self.num_tags]))

                pred = tf.nn.xw_plus_b(idcnn_outputs, W, b)

            return tf.reshape(pred, [-1, self.num_steps, self.num_tags])

bilstm 分类

def project_layer_bilstm(self, lstm_outputs, name=None):
        """
        hidden layer between lstm layer and logits
        :param lstm_outputs: [batch_size, num_steps, emb_size] 
        :return: [batch_size, num_steps, num_tags]
        """
        with tf.variable_scope("project"  if not name else name):
            with tf.variable_scope("hidden"):
                W = tf.get_variable("W", shape=[self.lstm_dim*2, self.lstm_dim],
                                    dtype=tf.float32, initializer=self.initializer)

                b = tf.get_variable("b", shape=[self.lstm_dim], dtype=tf.float32,
                                    initializer=tf.zeros_initializer())
                output = tf.reshape(lstm_outputs, shape=[-1, self.lstm_dim*2])
                hidden = tf.tanh(tf.nn.xw_plus_b(output, W, b))

            # project to score of tags
            with tf.variable_scope("logits"):
                W = tf.get_variable("W", shape=[self.lstm_dim, self.num_tags],
                                    dtype=tf.float32, initializer=self.initializer)

                b = tf.get_variable("b", shape=[self.num_tags], dtype=tf.float32,
                                    initializer=tf.zeros_initializer())

                pred = tf.nn.xw_plus_b(hidden, W, b)

            return tf.reshape(pred, [-1, self.num_steps, self.num_tags])

loss layer

NLP处理一般使用条件随机场。

def loss_layer(self, project_logits, lengths, name=None):
        """
        calculate crf loss
        :param project_logits: [1, num_steps, num_tags]
        :return: scalar loss
        """
        with tf.variable_scope("crf_loss"  if not name else name):
            small = -1000.0
            # pad logits for crf loss
            start_logits = tf.concat(
                [small * tf.ones(shape=[self.batch_size, 1, self.num_tags]), tf.zeros(shape=[self.batch_size, 1, 1])], axis=-1)
            pad_logits = tf.cast(small * tf.ones([self.batch_size, self.num_steps, 1]), tf.float32)
            logits = tf.concat([project_logits, pad_logits], axis=-1)
            logits = tf.concat([start_logits, logits], axis=1)
            targets = tf.concat(
                [tf.cast(self.num_tags*tf.ones([self.batch_size, 1]), tf.int32), self.targets], axis=-1)

            self.trans = tf.get_variable(
                "transitions",
                shape=[self.num_tags + 1, self.num_tags + 1],
                initializer=self.initializer)
            #crf_log_likelihood在一个条件随机场里面计算标签序列的log-likelihood
            #inputs: 一个形状为[batch_size, max_seq_len, num_tags] 的tensor,
            #一般使用BILSTM处理之后输出转换为他要求的形状作为CRF层的输入. 
            #tag_indices: 一个形状为[batch_size, max_seq_len] 的矩阵,其实就是真实标签. 
            #sequence_lengths: 一个形状为 [batch_size] 的向量,表示每个序列的长度. 
            #transition_params: 形状为[num_tags, num_tags] 的转移矩阵    
            #log_likelihood: 标量,log-likelihood 
            #transition_params: 形状为[num_tags, num_tags] 的转移矩阵               
            log_likelihood, self.trans = crf_log_likelihood(
                inputs=logits,
                tag_indices=targets,
                transition_params=self.trans,
                sequence_lengths=lengths+1)
            return tf.reduce_mean(-log_likelihood)

标记数据，数据预处理

原始数据

从网站爬取的数据，类似如下：

患者精神状况好，无发热，诉右髋部疼痛，饮食差，二便正常，查体：神清，各项生命体征平稳，心肺腹查体未见异常。右髋部压痛，右下肢皮牵引固定好，无松动，右足背动脉搏动好，足趾感觉运动正常。

标记数据

准备jieba,建立标记的字典

#%% for jieba
dics=csv.reader(open("DICT_NOW.csv",'r',encoding='utf8'))
#%% get word and class
for row in dics:                                                   # 将医学专有名词以及标签加入结巴词典中
    if len(row)==2:
        jieba.add_word(row[0].strip(),tag=row[1].strip())            # add_word保证添加的词语不会被cut掉
        jieba.suggest_freq(row[0].strip())                           # 可调节单个词语的词频，使其能（或不能）被分出来。

开始标记数据,打标记的同时，把数据分成３组，用于train,validation,test,最终得到的是IOB格式的标签

for file in os.listdir(c_root):
    if "txtoriginal.txt" in file:
        fp=open(c_root+file,'r',encoding='utf8')
        for line in fp:
            split_num+=1
            words=pseg.cut(line)
            for key,value in words: 
                #print(key)
                #print(value)
                if value.strip() and key.strip():
                    import time 
                    start_time=time.time()
                    index=str(1) if split_num%15<2 else str(2)  if split_num%15>1 and split_num%15<4 else str(3) 
                    end_time=time.time()
                    #print("method one used time is {}".format(end_time-start_time))
                    if value not in biaoji:
                        value='O'
                        for achar in key.strip():
                            if achar and achar.strip() in fuhao:
                                string=achar+" "+value.strip()+"\n"+"\n"
                                dev.write(string) if index=='1' else test.write(string) if index=='2' else train.write(string) 
                            elif achar.strip() and achar.strip() not in fuhao:
                                string = achar + " " + value.strip() + "\n"
                                dev.write(string) if index=='1' else test.write(string) if index=='2' else train.write(string) 
        
                    elif value.strip()  in biaoji:
                        begin=0
                        for char in key.strip():
                            if begin==0:
                                begin+=1
                                string1=char+' '+'B-'+value.strip()+'\n'
                                if index=='1':                               
                                    dev.write(string1)
                                elif index=='2':
                                    test.write(string1)
                                elif index=='3':
                                    train.write(string1)
                                else:
                                    pass
                            else:
                                string1 = char + ' ' + 'I-' + value.strip() + '\n'
                                if index=='1':                               
                                    dev.write(string1)
                                elif index=='2':
                                    test.write(string1)
                                elif index=='3':
                                    train.write(string1)
                                else:
                                    pass
                    else:
                        continue

将IOB格式的标签转化成IOBES格式

# Use selected tagging scheme (IOB / IOBES)               I：中间，O：其他，B：开始 | E：结束，S：单个
    update_tag_scheme(train_sentences, FLAGS.tag_schema)
    update_tag_scheme(test_sentences, FLAGS.tag_schema)
    update_tag_scheme(dev_sentences, FLAGS.tag_schema)

def update_tag_scheme(sentences, tag_scheme):
    """
    Check and update sentences tagging scheme to IOB2.
    Only IOB1 and IOB2 schemes are accepted.
    """
    for i, s in enumerate(sentences):
        tags = [w[-1] for w in s]
        # Check that tags are given in the IOB format
        if not iob2(tags):
            s_str = '\n'.join(' '.join(w) for w in s)
            raise Exception('Sentences should be given in IOB format! ' +
                            'Please check sentence %i:\n%s' % (i, s_str))
        if tag_scheme == 'iob':
            # If format was IOB1, we convert to IOB2
            for word, new_tag in zip(s, tags):
                word[-1] = new_tag
        elif tag_scheme == 'iobes':
            new_tags = iob_iobes(tags)
            for word, new_tag in zip(s, new_tags):
                word[-1] = new_tag
        else:
            raise Exception('Unknown tagging scheme!')

深度学习实战：基于bilstm或者dialated convolutions做NERBefore You Start:整体框架标记数据，数据预处理

文章目录

Before You Start:

什么是dialated convolutions？

什么是NER?

为什么文本处理可以使用CNN?

整体框架

input data

embedding layer

dialated convolution layer or Bilstm

Bilstm

dilated convolution layer

projection layer

dilated convolution 分类

bilstm 分类

loss layer

标记数据，数据预处理

原始数据

标记数据

准备jieba,建立标记的字典

开始标记数据,打标记的同时，把数据分成３组，用于train,validation,test,最终得到的是IOB格式的标签

将IOB格式的标签转化成IOBES格式

继续阅读

keras实现RNN手写字识别

农作物病虫害识别进展概述（***）摘要1 发展概述2 总结

t-SNE(t-distributed stochastic neighbor embedding)

使用Anaconda创建python虚拟环境

Knowledge Distillation(9)——Fast Human Pose Estimation

mmdetection源码阅读笔记（1）--创建网络创建cascade rcnn网络backboneneckRPN HEADassigners and samplersbbox headmask head小结

mxnet mobilenet SSD

策略梯度

The major advancements in Deep Learning in 2016

【deep learning学习笔记】Autoencoder

Deep Learning的几个经典网络 ----持续更新中

Dropout:A Simple Way to Prevent Neural Networks from Overfitting

[文献阅读]dropout - a simple way to prevent neural networks from overfitting

Pytorch学习笔记-第五章数据处理可视化工具GPU加速cuda固化数据

迁移学习相关资料Transfer learning applications

ul、ol、dl的区别语义区别样式区别