文章目錄

Before You Start:
- 什麼是dialated convolutions？
- 什麼是NER?
- 為什麼文本處理可以使用CNN?
整體架構
- input data
- embedding layer
- dialated convolution layer or Bilstm
- - Bilstm
  - dilated convolution layer
- projection layer
- - dilated convolution 分類
  - bilstm 分類
- loss layer
标記資料，資料預處理
- 原始資料
- 标記資料
- 準備jieba,建立标記的字典
- 開始标記資料,打标記的同時，把資料分成３組，用于train,validation,test,最終得到的是IOB格式的标簽
- 将IOB格式的标簽轉化成IOBES格式

Before You Start:

什麼是dialated convolutions？

CNN 是新的feature map上一個點舊的feature map上一個filter windows上的總結（作了pooling)，pooling就是在做下采樣，feature map不斷縮小，也就是resolution不斷衰減，可以了解為擷取receptive field犧牲了resolution，站得高看得遠但是看不清楚細節了。pooling是導緻resolution衰減的原因。

為了不丢失細節，去掉pooling，但是去掉pooling會導緻receptive field變小（這裡是相比較加pooling的情況），這裡就加入dialated處理，dialated即為filter windows内部的間隔，加入filter window本來３×３，看到９個點，這９個點是挨着的正方形９個點，當dialated＝２時，也是看到９個點，但是９個點之前是挨着的，現在變成每個點中間間隔了一個點，也就是視野變成了７×７。

dialated convolutions一般用于擴大receptive field。

什麼是NER?

Named entity recongnition: 命名實體識别，就是把一篇文章的專有名詞識别出來。對比圖像，圖像是把圖中物體識别出來，這裡是把一段話裡面的實體識别出來。

為什麼文本處理可以使用CNN?

從感受野的角度：處理文本就是要看上下文，filter windows就是１×N的視窗，N就是看到的字的數字，每個字的feature_dim就是filter windows　的channels。

整體架構

input data

即為标記的文本，具體而言：一個batch有四個次元，分割的原文，原文對應的int，标記的實體長度(0,1,2,3)，原文對應的标簽。

_, chars, segs, tags = batch

chars, segs, 訓練會用到，tags計算loss會用到。

embedding layer

chars, segs分别通過一個embedding layer然後将兩者的feature拼接起來(100+20)為最後的feature map。

def embedding_layer(self, char_inputs, seg_inputs, config, name=None):
        """
        :param char_inputs: one-hot encoding of sentence
        :param seg_inputs: segmentation feature
        :param config: wither use segmentation feature
        :return: [1, num_steps, embedding size], 
        """
        #高:3 血:22 糖:23 和:24 高:3 血:22 壓:25 char_inputs=[3,22,23,24,3,22,25]
        #高血糖 和 高血壓 seg_inputs 高血糖=[1,2,3] 和=[0] 高血壓=[1,2,3]  seg_inputs=[1,2,3,0,1,2,3]
        embedding = []
        self.char_inputs_test=char_inputs
        self.seg_inputs_test=seg_inputs
        with tf.variable_scope("char_embedding" if not name else name), tf.device('/cpu:0'):
            self.char_lookup = tf.get_variable(
                    name="char_embedding",
                    shape=[self.num_chars, self.char_dim],
                    initializer=self.initializer)
            #輸入char_inputs='常' 對應的字典的索引/編号/value為：8
            #self.char_lookup=[2677*100]的向量，char_inputs字對應在字典的索引/編号/key=[1]
            embedding.append(tf.nn.embedding_lookup(self.char_lookup, char_inputs))
            #self.embedding1.append(tf.nn.embedding_lookup(self.char_lookup, char_inputs))
            if config["seg_dim"]:
                with tf.variable_scope("seg_embedding"), tf.device('/cpu:0'):
                    self.seg_lookup = tf.get_variable(
                        name="seg_embedding",
                        #shape=[4*20]
                        shape=[self.num_segs, self.seg_dim],
                        initializer=self.initializer)
                    embedding.append(tf.nn.embedding_lookup(self.seg_lookup, seg_inputs))
            embed = tf.concat(embedding, axis=-1)
        self.embed_test=embed
        self.embedding_test=embedding
        return embed

dialated convolution layer or Bilstm

Bilstm

def biLSTM_layer(self, model_inputs, lstm_dim, lengths, name=None):
        """
        :param lstm_inputs: [batch_size, num_steps, emb_size]
        :return: [batch_size, num_steps, 2*lstm_dim]
        """
        with tf.variable_scope("char_BiLSTM" if not name else name):
            lstm_cell = {}
            for direction in ["forward", "backward"]:
                with tf.variable_scope(direction):
                    lstm_cell[direction] = tf.contrib.rnn.CoupledInputForgetGateLSTMCell(
                        lstm_dim,
                        use_peepholes=True,
                        initializer=self.initializer,
                        state_is_tuple=True)
            outputs, final_states = tf.nn.bidirectional_dynamic_rnn(
                lstm_cell["forward"],
                lstm_cell["backward"],
                model_inputs,
                dtype=tf.float32,
                sequence_length=lengths)
        return tf.concat(outputs, axis=2)

dilated convolution layer

輸入的形狀：[第x段話，1,視窗看多少個word,每個字的feature_dim]

shape of input = [batch, in_height, in_width, in_channels]

視窗的形狀：[1,視窗看多少個word,每個字的feature_dim，filters的數目]

shape of filter = [filter_height, filter_width, in_channels, out_channels]

先做一次正常的卷積，然後做self.repeat_times次，每一次有３次dilated convolution，dilated=1,dilated=1,dilated=2，感受野不斷擴大。

注意dilated=1，相當于正常的卷積，但是視野也是會擴大的，想一下卷積不做pooling視野也是擴大的。

深度學習實戰：基于bilstm或者dialated convolutions做NERBefore You Start:整體架構标記資料，資料預處理

def IDCNN_layer(self, model_inputs, 
                    name=None):
        """
        :param idcnn_inputs: [batch_size, num_steps, emb_size] 
        :return: [batch_size, num_steps, cnn_output_width]
        """
        #tf.expand_dims會向tensor中插入一個次元，插入位置就是參數代表的位置（次元從0開始）。
        model_inputs = tf.expand_dims(model_inputs, 1)
        self.model_inputs_test=model_inputs
        reuse = False
        if self.dropout == 1.0:
            reuse = True
        with tf.variable_scope("idcnn" if not name else name):
            #shape=[1*3*120*100]
            # shape=[1, self.filter_width, self.embedding_dim,
            #            self.num_filter]
            # print(shape)
            filter_weights = tf.get_variable(
                "idcnn_filter",
                shape=[1, self.filter_width, self.embedding_dim,
                       self.num_filter],
                initializer=self.initializer)
            
            """
            shape of input = [batch, in_height, in_width, in_channels]
            shape of filter = [filter_height, filter_width, in_channels, out_channels]
            """
            layerInput = tf.nn.conv2d(model_inputs,
                                      filter_weights,
                                      strides=[1, 1, 1, 1],
                                      padding="SAME",
                                      name="init_layer",use_cudnn_on_gpu=False)
            self.layerInput_test=layerInput
            finalOutFromLayers = []
            
            totalWidthForLastDim = 0
            for j in range(self.repeat_times):
                for i in range(len(self.layers)):
                    #1,1,2
                    dilation = self.layers[i]['dilation']
                    isLast = True if i == (len(self.layers) - 1) else False
                    with tf.variable_scope("atrous-conv-layer-%d" % i,
                                           reuse=True
                                           if (reuse or j > 0) else False):
                        #w 卷積核的高度，卷積核的寬度，圖像通道數，卷積核個數
                        w = tf.get_variable(
                            "filterW",
                            shape=[1, self.filter_width, self.num_filter,
                                   self.num_filter],
                            initializer=tf.contrib.layers.xavier_initializer())
                        if j==1 and i==1:
                            self.w_test_1=w
                        if j==2 and i==1:
                            self.w_test_2=w                            
                        b = tf.get_variable("filterB", shape=[self.num_filter])
#tf.nn.atrous_conv2d(value,filters,rate,padding,name=None）
    #除去name參數用以指定該操作的name，與方法有關的一共四個參數：                  
    #value： 
    #指需要做卷積的輸入圖像，要求是一個4維Tensor，具有[batch, height, width, channels]這樣的shape，具體含義是[訓練時一個batch的圖檔數量, 圖檔高度, 圖檔寬度, 圖像通道數] 
    #filters： 
    #相當于CNN中的卷積核，要求是一個4維Tensor，具有[filter_height, filter_width, channels, out_channels]這樣的shape，具體含義是[卷積核的高度，卷積核的寬度，圖像通道數，卷積核個數]，同理這裡第三維channels，就是參數value的第四維
    #rate： 
    #要求是一個int型的正數，正常的卷積操作應該會有stride（即卷積核的滑動步長），但是空洞卷積是沒有stride參數的，
    #這一點尤其要注意。取而代之，它使用了新的rate參數，那麼rate參數有什麼用呢？它定義為我們在輸入
    #圖像上卷積時的采樣間隔，你可以了解為卷積核當中穿插了（rate-1）數量的“0”，
    #把原來的卷積核插出了很多“洞洞”，這樣做卷積時就相當于對原圖像的采樣間隔變大了。
    #具體怎麼插得，可以看後面更加詳細的描述。此時我們很容易得出rate=1時，就沒有0插入，
    #此時這個函數就變成了普通卷積。  
    #padding： 
    #string類型的量，隻能是”SAME”,”VALID”其中之一，這個值決定了不同邊緣填充方式。
    #ok，完了，到這就沒有參數了，或許有的小夥伴會問那“stride”參數呢。其實這個函數已經預設了stride=1，也就是滑動步長無法改變，固定為1。
    #結果傳回一個Tensor，填充方式為“VALID”時，傳回[batch,height-2*(filter_width-1),width-2*(filter_height-1),out_channels]的Tensor，填充方式為“SAME”時，傳回[batch, height, width, out_channels]的Tensor，這個結果怎麼得出來的？先不急，我們通過一段程式形象的示範一下空洞卷積。                        
                        conv = tf.nn.atrous_conv2d(layerInput,
                                                   w,
                                                   rate=dilation,
                                                   padding="SAME")
                        self.conv_test=conv 
                        conv = tf.nn.bias_add(conv, b)
                        conv = tf.nn.relu(conv)
                        if isLast:
                            finalOutFromLayers.append(conv)
                            totalWidthForLastDim += self.num_filter
                        layerInput = conv
            finalOut = tf.concat(axis=3, values=finalOutFromLayers)
            keepProb = 1.0 if reuse else 0.5
            finalOut = tf.nn.dropout(finalOut, keepProb)
            #Removes dimensions of size 1 from the shape of a tensor. 
                #從tensor中删除所有大小是1的次元
            
                #Given a tensor input, this operation returns a tensor of the same type with all dimensions of size 1 removed. If you don’t want to remove all size 1 dimensions, you can remove specific size 1 dimensions by specifying squeeze_dims. 
            
                #給定張量輸入，此操作傳回相同類型的張量，并删除所有尺寸為1的尺寸。 如果不想删除所有尺寸1尺寸，可以通過指定squeeze_dims來删除特定尺寸1尺寸。
            finalOut = tf.squeeze(finalOut, [1])
            finalOut = tf.reshape(finalOut, [-1, totalWidthForLastDim])
            self.cnn_output_width = totalWidthForLastDim
            return finalOut

projection layer

卷積為特征提取器，後面需要添加FC分類。

dilated convolution 分類

#Project layer for idcnn by crownpku
    #Delete the hidden layer, and change bias initializer
    def project_layer_idcnn(self, idcnn_outputs, name=None):
        """
        :param lstm_outputs: [batch_size, num_steps, emb_size] 
        :return: [batch_size, num_steps, num_tags]
        """
        with tf.variable_scope("project"  if not name else name):
            
            # project to score of tags
            with tf.variable_scope("logits"):
                W = tf.get_variable("W", shape=[self.cnn_output_width, self.num_tags],
                                    dtype=tf.float32, initializer=self.initializer)

                b = tf.get_variable("b",  initializer=tf.constant(0.001, shape=[self.num_tags]))

                pred = tf.nn.xw_plus_b(idcnn_outputs, W, b)

            return tf.reshape(pred, [-1, self.num_steps, self.num_tags])

bilstm 分類

def project_layer_bilstm(self, lstm_outputs, name=None):
        """
        hidden layer between lstm layer and logits
        :param lstm_outputs: [batch_size, num_steps, emb_size] 
        :return: [batch_size, num_steps, num_tags]
        """
        with tf.variable_scope("project"  if not name else name):
            with tf.variable_scope("hidden"):
                W = tf.get_variable("W", shape=[self.lstm_dim*2, self.lstm_dim],
                                    dtype=tf.float32, initializer=self.initializer)

                b = tf.get_variable("b", shape=[self.lstm_dim], dtype=tf.float32,
                                    initializer=tf.zeros_initializer())
                output = tf.reshape(lstm_outputs, shape=[-1, self.lstm_dim*2])
                hidden = tf.tanh(tf.nn.xw_plus_b(output, W, b))

            # project to score of tags
            with tf.variable_scope("logits"):
                W = tf.get_variable("W", shape=[self.lstm_dim, self.num_tags],
                                    dtype=tf.float32, initializer=self.initializer)

                b = tf.get_variable("b", shape=[self.num_tags], dtype=tf.float32,
                                    initializer=tf.zeros_initializer())

                pred = tf.nn.xw_plus_b(hidden, W, b)

            return tf.reshape(pred, [-1, self.num_steps, self.num_tags])

loss layer

NLP處理一般使用條件随機場。

def loss_layer(self, project_logits, lengths, name=None):
        """
        calculate crf loss
        :param project_logits: [1, num_steps, num_tags]
        :return: scalar loss
        """
        with tf.variable_scope("crf_loss"  if not name else name):
            small = -1000.0
            # pad logits for crf loss
            start_logits = tf.concat(
                [small * tf.ones(shape=[self.batch_size, 1, self.num_tags]), tf.zeros(shape=[self.batch_size, 1, 1])], axis=-1)
            pad_logits = tf.cast(small * tf.ones([self.batch_size, self.num_steps, 1]), tf.float32)
            logits = tf.concat([project_logits, pad_logits], axis=-1)
            logits = tf.concat([start_logits, logits], axis=1)
            targets = tf.concat(
                [tf.cast(self.num_tags*tf.ones([self.batch_size, 1]), tf.int32), self.targets], axis=-1)

            self.trans = tf.get_variable(
                "transitions",
                shape=[self.num_tags + 1, self.num_tags + 1],
                initializer=self.initializer)
            #crf_log_likelihood在一個條件随機場裡面計算标簽序列的log-likelihood
            #inputs: 一個形狀為[batch_size, max_seq_len, num_tags] 的tensor,
            #一般使用BILSTM處理之後輸出轉換為他要求的形狀作為CRF層的輸入. 
            #tag_indices: 一個形狀為[batch_size, max_seq_len] 的矩陣,其實就是真實标簽. 
            #sequence_lengths: 一個形狀為 [batch_size] 的向量,表示每個序列的長度. 
            #transition_params: 形狀為[num_tags, num_tags] 的轉移矩陣    
            #log_likelihood: 标量,log-likelihood 
            #transition_params: 形狀為[num_tags, num_tags] 的轉移矩陣               
            log_likelihood, self.trans = crf_log_likelihood(
                inputs=logits,
                tag_indices=targets,
                transition_params=self.trans,
                sequence_lengths=lengths+1)
            return tf.reduce_mean(-log_likelihood)

标記資料，資料預處理

原始資料

從網站爬取的資料，類似如下：

患者精神狀況好，無發熱，訴右髋部疼痛，飲食差，二便正常，查體：神清，各項生命體征平穩，心肺腹查體未見異常。右髋部壓痛，右下肢皮牽引固定好，無松動，右足背動脈搏動好，足趾感覺運動正常。

标記資料

準備jieba,建立标記的字典

#%% for jieba
dics=csv.reader(open("DICT_NOW.csv",'r',encoding='utf8'))
#%% get word and class
for row in dics:                                                   # 将醫學專有名詞以及标簽加入結巴詞典中
    if len(row)==2:
        jieba.add_word(row[0].strip(),tag=row[1].strip())            # add_word保證添加的詞語不會被cut掉
        jieba.suggest_freq(row[0].strip())                           # 可調節單個詞語的詞頻，使其能（或不能）被分出來。

開始标記資料,打标記的同時，把資料分成３組，用于train,validation,test,最終得到的是IOB格式的标簽

for file in os.listdir(c_root):
    if "txtoriginal.txt" in file:
        fp=open(c_root+file,'r',encoding='utf8')
        for line in fp:
            split_num+=1
            words=pseg.cut(line)
            for key,value in words: 
                #print(key)
                #print(value)
                if value.strip() and key.strip():
                    import time 
                    start_time=time.time()
                    index=str(1) if split_num%15<2 else str(2)  if split_num%15>1 and split_num%15<4 else str(3) 
                    end_time=time.time()
                    #print("method one used time is {}".format(end_time-start_time))
                    if value not in biaoji:
                        value='O'
                        for achar in key.strip():
                            if achar and achar.strip() in fuhao:
                                string=achar+" "+value.strip()+"\n"+"\n"
                                dev.write(string) if index=='1' else test.write(string) if index=='2' else train.write(string) 
                            elif achar.strip() and achar.strip() not in fuhao:
                                string = achar + " " + value.strip() + "\n"
                                dev.write(string) if index=='1' else test.write(string) if index=='2' else train.write(string) 
        
                    elif value.strip()  in biaoji:
                        begin=0
                        for char in key.strip():
                            if begin==0:
                                begin+=1
                                string1=char+' '+'B-'+value.strip()+'\n'
                                if index=='1':                               
                                    dev.write(string1)
                                elif index=='2':
                                    test.write(string1)
                                elif index=='3':
                                    train.write(string1)
                                else:
                                    pass
                            else:
                                string1 = char + ' ' + 'I-' + value.strip() + '\n'
                                if index=='1':                               
                                    dev.write(string1)
                                elif index=='2':
                                    test.write(string1)
                                elif index=='3':
                                    train.write(string1)
                                else:
                                    pass
                    else:
                        continue

将IOB格式的标簽轉化成IOBES格式

# Use selected tagging scheme (IOB / IOBES)               I：中間，O：其他，B：開始 | E：結束，S：單個
    update_tag_scheme(train_sentences, FLAGS.tag_schema)
    update_tag_scheme(test_sentences, FLAGS.tag_schema)
    update_tag_scheme(dev_sentences, FLAGS.tag_schema)

def update_tag_scheme(sentences, tag_scheme):
    """
    Check and update sentences tagging scheme to IOB2.
    Only IOB1 and IOB2 schemes are accepted.
    """
    for i, s in enumerate(sentences):
        tags = [w[-1] for w in s]
        # Check that tags are given in the IOB format
        if not iob2(tags):
            s_str = '\n'.join(' '.join(w) for w in s)
            raise Exception('Sentences should be given in IOB format! ' +
                            'Please check sentence %i:\n%s' % (i, s_str))
        if tag_scheme == 'iob':
            # If format was IOB1, we convert to IOB2
            for word, new_tag in zip(s, tags):
                word[-1] = new_tag
        elif tag_scheme == 'iobes':
            new_tags = iob_iobes(tags)
            for word, new_tag in zip(s, new_tags):
                word[-1] = new_tag
        else:
            raise Exception('Unknown tagging scheme!')

深度學習實戰：基于bilstm或者dialated convolutions做NERBefore You Start:整體架構标記資料，資料預處理

文章目錄

Before You Start:

什麼是dialated convolutions？

什麼是NER?

為什麼文本處理可以使用CNN?

整體架構

input data

embedding layer

dialated convolution layer or Bilstm

Bilstm

dilated convolution layer

projection layer

dilated convolution 分類

bilstm 分類

loss layer

标記資料，資料預處理

原始資料

标記資料

準備jieba,建立标記的字典

開始标記資料,打标記的同時，把資料分成３組，用于train,validation,test,最終得到的是IOB格式的标簽

将IOB格式的标簽轉化成IOBES格式

繼續閱讀

keras實作RNN手寫字識别

農作物病蟲害識别進展概述（***）摘要1 發展概述2 總結

t-SNE(t-distributed stochastic neighbor embedding)

使用Anaconda建立python虛拟環境

Knowledge Distillation(9)——Fast Human Pose Estimation

mmdetection源碼閱讀筆記（1）--建立網絡建立cascade rcnn網絡backboneneckRPN HEADassigners and samplersbbox headmask head小結

mxnet mobilenet SSD

政策梯度

The major advancements in Deep Learning in 2016

【deep learning學習筆記】Autoencoder

Deep Learning的幾個經典網絡 ----持續更新中

Dropout:A Simple Way to Prevent Neural Networks from Overfitting

[文獻閱讀]dropout - a simple way to prevent neural networks from overfitting

Pytorch學習筆記-第五章資料處理可視化工具GPU加速cuda固化資料

遷移學習相關資料Transfer learning applications

ul、ol、dl的差別語義差別樣式差別