文章目錄
- Before You Start:
-
- 什麼是dialated convolutions?
- 什麼是NER?
- 為什麼文本處理可以使用CNN?
- 整體架構
-
- input data
- embedding layer
- dialated convolution layer or Bilstm
-
- Bilstm
- dilated convolution layer
- projection layer
-
- dilated convolution 分類
- bilstm 分類
- loss layer
- 标記資料,資料預處理
-
- 原始資料
- 标記資料
- 準備jieba,建立标記的字典
- 開始标記資料,打标記的同時,把資料分成3組,用于train,validation,test,最終得到的是IOB格式的标簽
- 将IOB格式的标簽轉化成IOBES格式
Before You Start:
什麼是dialated convolutions?
CNN 是新的feature map上一個點舊的feature map上一個filter windows上的總結(作了pooling),pooling就是在做下采樣,feature map不斷縮小,也就是resolution不斷衰減,可以了解為擷取receptive field犧牲了resolution,站得高看得遠但是看不清楚細節了。pooling是導緻resolution衰減的原因。
為了不丢失細節,去掉pooling,但是去掉pooling會導緻receptive field變小(這裡是相比較加pooling的情況),這裡就加入dialated處理,dialated即為filter windows内部的間隔,加入filter window本來3×3,看到9個點,這9個點是挨着的正方形9個點,當dialated=2時,也是看到9個點,但是9個點之前是挨着的,現在變成每個點中間間隔了一個點,也就是視野變成了7×7。
dialated convolutions一般用于擴大receptive field。
什麼是NER?
Named entity recongnition: 命名實體識别,就是把一篇文章的專有名詞識别出來。對比圖像,圖像是把圖中物體識别出來,這裡是把一段話裡面的實體識别出來。
為什麼文本處理可以使用CNN?
從感受野的角度:處理文本就是要看上下文,filter windows就是1×N的視窗,N就是看到的字的數字,每個字的feature_dim就是filter windows 的channels。
整體架構
input data
即為标記的文本,具體而言:一個batch有四個次元,分割的原文,原文對應的int,标記的實體長度(0,1,2,3),原文對應的标簽。
_, chars, segs, tags = batch
chars, segs, 訓練會用到,tags計算loss會用到。
embedding layer
chars, segs分别通過一個embedding layer然後将兩者的feature拼接起來(100+20)為最後的feature map。
def embedding_layer(self, char_inputs, seg_inputs, config, name=None):
"""
:param char_inputs: one-hot encoding of sentence
:param seg_inputs: segmentation feature
:param config: wither use segmentation feature
:return: [1, num_steps, embedding size],
"""
#高:3 血:22 糖:23 和:24 高:3 血:22 壓:25 char_inputs=[3,22,23,24,3,22,25]
#高血糖 和 高血壓 seg_inputs 高血糖=[1,2,3] 和=[0] 高血壓=[1,2,3] seg_inputs=[1,2,3,0,1,2,3]
embedding = []
self.char_inputs_test=char_inputs
self.seg_inputs_test=seg_inputs
with tf.variable_scope("char_embedding" if not name else name), tf.device('/cpu:0'):
self.char_lookup = tf.get_variable(
name="char_embedding",
shape=[self.num_chars, self.char_dim],
initializer=self.initializer)
#輸入char_inputs='常' 對應的字典的索引/編号/value為:8
#self.char_lookup=[2677*100]的向量,char_inputs字對應在字典的索引/編号/key=[1]
embedding.append(tf.nn.embedding_lookup(self.char_lookup, char_inputs))
#self.embedding1.append(tf.nn.embedding_lookup(self.char_lookup, char_inputs))
if config["seg_dim"]:
with tf.variable_scope("seg_embedding"), tf.device('/cpu:0'):
self.seg_lookup = tf.get_variable(
name="seg_embedding",
#shape=[4*20]
shape=[self.num_segs, self.seg_dim],
initializer=self.initializer)
embedding.append(tf.nn.embedding_lookup(self.seg_lookup, seg_inputs))
embed = tf.concat(embedding, axis=-1)
self.embed_test=embed
self.embedding_test=embedding
return embed
dialated convolution layer or Bilstm
Bilstm
def biLSTM_layer(self, model_inputs, lstm_dim, lengths, name=None):
"""
:param lstm_inputs: [batch_size, num_steps, emb_size]
:return: [batch_size, num_steps, 2*lstm_dim]
"""
with tf.variable_scope("char_BiLSTM" if not name else name):
lstm_cell = {}
for direction in ["forward", "backward"]:
with tf.variable_scope(direction):
lstm_cell[direction] = tf.contrib.rnn.CoupledInputForgetGateLSTMCell(
lstm_dim,
use_peepholes=True,
initializer=self.initializer,
state_is_tuple=True)
outputs, final_states = tf.nn.bidirectional_dynamic_rnn(
lstm_cell["forward"],
lstm_cell["backward"],
model_inputs,
dtype=tf.float32,
sequence_length=lengths)
return tf.concat(outputs, axis=2)
dilated convolution layer
輸入的形狀:[第x段話,1,視窗看多少個word,每個字的feature_dim]
shape of input = [batch, in_height, in_width, in_channels]
視窗的形狀:[1,視窗看多少個word,每個字的feature_dim,filters的數目]
shape of filter = [filter_height, filter_width, in_channels, out_channels]
先做一次正常的卷積,然後做self.repeat_times次,每一次有3次dilated convolution,dilated=1,dilated=1,dilated=2,感受野不斷擴大。
注意dilated=1,相當于正常的卷積,但是視野也是會擴大的,想一下卷積不做pooling視野也是擴大的。

def IDCNN_layer(self, model_inputs,
name=None):
"""
:param idcnn_inputs: [batch_size, num_steps, emb_size]
:return: [batch_size, num_steps, cnn_output_width]
"""
#tf.expand_dims會向tensor中插入一個次元,插入位置就是參數代表的位置(次元從0開始)。
model_inputs = tf.expand_dims(model_inputs, 1)
self.model_inputs_test=model_inputs
reuse = False
if self.dropout == 1.0:
reuse = True
with tf.variable_scope("idcnn" if not name else name):
#shape=[1*3*120*100]
# shape=[1, self.filter_width, self.embedding_dim,
# self.num_filter]
# print(shape)
filter_weights = tf.get_variable(
"idcnn_filter",
shape=[1, self.filter_width, self.embedding_dim,
self.num_filter],
initializer=self.initializer)
"""
shape of input = [batch, in_height, in_width, in_channels]
shape of filter = [filter_height, filter_width, in_channels, out_channels]
"""
layerInput = tf.nn.conv2d(model_inputs,
filter_weights,
strides=[1, 1, 1, 1],
padding="SAME",
name="init_layer",use_cudnn_on_gpu=False)
self.layerInput_test=layerInput
finalOutFromLayers = []
totalWidthForLastDim = 0
for j in range(self.repeat_times):
for i in range(len(self.layers)):
#1,1,2
dilation = self.layers[i]['dilation']
isLast = True if i == (len(self.layers) - 1) else False
with tf.variable_scope("atrous-conv-layer-%d" % i,
reuse=True
if (reuse or j > 0) else False):
#w 卷積核的高度,卷積核的寬度,圖像通道數,卷積核個數
w = tf.get_variable(
"filterW",
shape=[1, self.filter_width, self.num_filter,
self.num_filter],
initializer=tf.contrib.layers.xavier_initializer())
if j==1 and i==1:
self.w_test_1=w
if j==2 and i==1:
self.w_test_2=w
b = tf.get_variable("filterB", shape=[self.num_filter])
#tf.nn.atrous_conv2d(value,filters,rate,padding,name=None)
#除去name參數用以指定該操作的name,與方法有關的一共四個參數:
#value:
#指需要做卷積的輸入圖像,要求是一個4維Tensor,具有[batch, height, width, channels]這樣的shape,具體含義是[訓練時一個batch的圖檔數量, 圖檔高度, 圖檔寬度, 圖像通道數]
#filters:
#相當于CNN中的卷積核,要求是一個4維Tensor,具有[filter_height, filter_width, channels, out_channels]這樣的shape,具體含義是[卷積核的高度,卷積核的寬度,圖像通道數,卷積核個數],同理這裡第三維channels,就是參數value的第四維
#rate:
#要求是一個int型的正數,正常的卷積操作應該會有stride(即卷積核的滑動步長),但是空洞卷積是沒有stride參數的,
#這一點尤其要注意。取而代之,它使用了新的rate參數,那麼rate參數有什麼用呢?它定義為我們在輸入
#圖像上卷積時的采樣間隔,你可以了解為卷積核當中穿插了(rate-1)數量的“0”,
#把原來的卷積核插出了很多“洞洞”,這樣做卷積時就相當于對原圖像的采樣間隔變大了。
#具體怎麼插得,可以看後面更加詳細的描述。此時我們很容易得出rate=1時,就沒有0插入,
#此時這個函數就變成了普通卷積。
#padding:
#string類型的量,隻能是”SAME”,”VALID”其中之一,這個值決定了不同邊緣填充方式。
#ok,完了,到這就沒有參數了,或許有的小夥伴會問那“stride”參數呢。其實這個函數已經預設了stride=1,也就是滑動步長無法改變,固定為1。
#結果傳回一個Tensor,填充方式為“VALID”時,傳回[batch,height-2*(filter_width-1),width-2*(filter_height-1),out_channels]的Tensor,填充方式為“SAME”時,傳回[batch, height, width, out_channels]的Tensor,這個結果怎麼得出來的?先不急,我們通過一段程式形象的示範一下空洞卷積。
conv = tf.nn.atrous_conv2d(layerInput,
w,
rate=dilation,
padding="SAME")
self.conv_test=conv
conv = tf.nn.bias_add(conv, b)
conv = tf.nn.relu(conv)
if isLast:
finalOutFromLayers.append(conv)
totalWidthForLastDim += self.num_filter
layerInput = conv
finalOut = tf.concat(axis=3, values=finalOutFromLayers)
keepProb = 1.0 if reuse else 0.5
finalOut = tf.nn.dropout(finalOut, keepProb)
#Removes dimensions of size 1 from the shape of a tensor.
#從tensor中删除所有大小是1的次元
#Given a tensor input, this operation returns a tensor of the same type with all dimensions of size 1 removed. If you don’t want to remove all size 1 dimensions, you can remove specific size 1 dimensions by specifying squeeze_dims.
#給定張量輸入,此操作傳回相同類型的張量,并删除所有尺寸為1的尺寸。 如果不想删除所有尺寸1尺寸,可以通過指定squeeze_dims來删除特定尺寸1尺寸。
finalOut = tf.squeeze(finalOut, [1])
finalOut = tf.reshape(finalOut, [-1, totalWidthForLastDim])
self.cnn_output_width = totalWidthForLastDim
return finalOut
projection layer
卷積為特征提取器,後面需要添加FC分類。
dilated convolution 分類
#Project layer for idcnn by crownpku
#Delete the hidden layer, and change bias initializer
def project_layer_idcnn(self, idcnn_outputs, name=None):
"""
:param lstm_outputs: [batch_size, num_steps, emb_size]
:return: [batch_size, num_steps, num_tags]
"""
with tf.variable_scope("project" if not name else name):
# project to score of tags
with tf.variable_scope("logits"):
W = tf.get_variable("W", shape=[self.cnn_output_width, self.num_tags],
dtype=tf.float32, initializer=self.initializer)
b = tf.get_variable("b", initializer=tf.constant(0.001, shape=[self.num_tags]))
pred = tf.nn.xw_plus_b(idcnn_outputs, W, b)
return tf.reshape(pred, [-1, self.num_steps, self.num_tags])
bilstm 分類
def project_layer_bilstm(self, lstm_outputs, name=None):
"""
hidden layer between lstm layer and logits
:param lstm_outputs: [batch_size, num_steps, emb_size]
:return: [batch_size, num_steps, num_tags]
"""
with tf.variable_scope("project" if not name else name):
with tf.variable_scope("hidden"):
W = tf.get_variable("W", shape=[self.lstm_dim*2, self.lstm_dim],
dtype=tf.float32, initializer=self.initializer)
b = tf.get_variable("b", shape=[self.lstm_dim], dtype=tf.float32,
initializer=tf.zeros_initializer())
output = tf.reshape(lstm_outputs, shape=[-1, self.lstm_dim*2])
hidden = tf.tanh(tf.nn.xw_plus_b(output, W, b))
# project to score of tags
with tf.variable_scope("logits"):
W = tf.get_variable("W", shape=[self.lstm_dim, self.num_tags],
dtype=tf.float32, initializer=self.initializer)
b = tf.get_variable("b", shape=[self.num_tags], dtype=tf.float32,
initializer=tf.zeros_initializer())
pred = tf.nn.xw_plus_b(hidden, W, b)
return tf.reshape(pred, [-1, self.num_steps, self.num_tags])
loss layer
NLP處理一般使用條件随機場。
def loss_layer(self, project_logits, lengths, name=None):
"""
calculate crf loss
:param project_logits: [1, num_steps, num_tags]
:return: scalar loss
"""
with tf.variable_scope("crf_loss" if not name else name):
small = -1000.0
# pad logits for crf loss
start_logits = tf.concat(
[small * tf.ones(shape=[self.batch_size, 1, self.num_tags]), tf.zeros(shape=[self.batch_size, 1, 1])], axis=-1)
pad_logits = tf.cast(small * tf.ones([self.batch_size, self.num_steps, 1]), tf.float32)
logits = tf.concat([project_logits, pad_logits], axis=-1)
logits = tf.concat([start_logits, logits], axis=1)
targets = tf.concat(
[tf.cast(self.num_tags*tf.ones([self.batch_size, 1]), tf.int32), self.targets], axis=-1)
self.trans = tf.get_variable(
"transitions",
shape=[self.num_tags + 1, self.num_tags + 1],
initializer=self.initializer)
#crf_log_likelihood在一個條件随機場裡面計算标簽序列的log-likelihood
#inputs: 一個形狀為[batch_size, max_seq_len, num_tags] 的tensor,
#一般使用BILSTM處理之後輸出轉換為他要求的形狀作為CRF層的輸入.
#tag_indices: 一個形狀為[batch_size, max_seq_len] 的矩陣,其實就是真實标簽.
#sequence_lengths: 一個形狀為 [batch_size] 的向量,表示每個序列的長度.
#transition_params: 形狀為[num_tags, num_tags] 的轉移矩陣
#log_likelihood: 标量,log-likelihood
#transition_params: 形狀為[num_tags, num_tags] 的轉移矩陣
log_likelihood, self.trans = crf_log_likelihood(
inputs=logits,
tag_indices=targets,
transition_params=self.trans,
sequence_lengths=lengths+1)
return tf.reduce_mean(-log_likelihood)
标記資料,資料預處理
原始資料
從網站爬取的資料,類似如下:
患者精神狀況好,無發熱,訴右髋部疼痛,飲食差,二便正常,查體:神清,各項生命體征平穩,心肺腹查體未見異常。右髋部壓痛,右下肢皮牽引固定好,無松動,右足背動脈搏動好,足趾感覺運動正常。
标記資料
準備jieba,建立标記的字典
#%% for jieba
dics=csv.reader(open("DICT_NOW.csv",'r',encoding='utf8'))
#%% get word and class
for row in dics: # 将醫學專有名詞以及标簽加入結巴詞典中
if len(row)==2:
jieba.add_word(row[0].strip(),tag=row[1].strip()) # add_word保證添加的詞語不會被cut掉
jieba.suggest_freq(row[0].strip()) # 可調節單個詞語的詞頻,使其能(或不能)被分出來。
開始标記資料,打标記的同時,把資料分成3組,用于train,validation,test,最終得到的是IOB格式的标簽
for file in os.listdir(c_root):
if "txtoriginal.txt" in file:
fp=open(c_root+file,'r',encoding='utf8')
for line in fp:
split_num+=1
words=pseg.cut(line)
for key,value in words:
#print(key)
#print(value)
if value.strip() and key.strip():
import time
start_time=time.time()
index=str(1) if split_num%15<2 else str(2) if split_num%15>1 and split_num%15<4 else str(3)
end_time=time.time()
#print("method one used time is {}".format(end_time-start_time))
if value not in biaoji:
value='O'
for achar in key.strip():
if achar and achar.strip() in fuhao:
string=achar+" "+value.strip()+"\n"+"\n"
dev.write(string) if index=='1' else test.write(string) if index=='2' else train.write(string)
elif achar.strip() and achar.strip() not in fuhao:
string = achar + " " + value.strip() + "\n"
dev.write(string) if index=='1' else test.write(string) if index=='2' else train.write(string)
elif value.strip() in biaoji:
begin=0
for char in key.strip():
if begin==0:
begin+=1
string1=char+' '+'B-'+value.strip()+'\n'
if index=='1':
dev.write(string1)
elif index=='2':
test.write(string1)
elif index=='3':
train.write(string1)
else:
pass
else:
string1 = char + ' ' + 'I-' + value.strip() + '\n'
if index=='1':
dev.write(string1)
elif index=='2':
test.write(string1)
elif index=='3':
train.write(string1)
else:
pass
else:
continue
将IOB格式的标簽轉化成IOBES格式
# Use selected tagging scheme (IOB / IOBES) I:中間,O:其他,B:開始 | E:結束,S:單個
update_tag_scheme(train_sentences, FLAGS.tag_schema)
update_tag_scheme(test_sentences, FLAGS.tag_schema)
update_tag_scheme(dev_sentences, FLAGS.tag_schema)
def update_tag_scheme(sentences, tag_scheme):
"""
Check and update sentences tagging scheme to IOB2.
Only IOB1 and IOB2 schemes are accepted.
"""
for i, s in enumerate(sentences):
tags = [w[-1] for w in s]
# Check that tags are given in the IOB format
if not iob2(tags):
s_str = '\n'.join(' '.join(w) for w in s)
raise Exception('Sentences should be given in IOB format! ' +
'Please check sentence %i:\n%s' % (i, s_str))
if tag_scheme == 'iob':
# If format was IOB1, we convert to IOB2
for word, new_tag in zip(s, tags):
word[-1] = new_tag
elif tag_scheme == 'iobes':
new_tags = iob_iobes(tags)
for word, new_tag in zip(s, new_tags):
word[-1] = new_tag
else:
raise Exception('Unknown tagging scheme!')