天天看點

蛋白序列 | 基于深度學習的蛋白質序列家族分類

背景簡介

蛋白質資料集來自于結構生物資訊學研究協作組織(RCSB)的蛋白質資料庫(PDB)。

RCSB :  Research Collaboratory for Structural Bioinformatics

PDB :  Protein Data Bank

PDB是原子坐标和描述蛋白質和其他重要生物大分子的資訊儲存庫。結構生物學家使用諸如X射線晶體學、NMR和低溫電子顯微術的方法來确定分子中每個原子的相對于彼此的位置。

不斷發展的PDB反映了世界各地實驗室正在進行的研究,使得在研究和教育中使用資料庫既令人興奮又具有挑戰性。結構可用于生命過程中涉及的許多蛋白質和核酸,是以可以通路PDB資料庫查找核糖體、癌基因、藥物靶标甚至整個病毒的結構。但是,找到所需資訊可能是一項挑戰,因為PDB會存在許多不同的結構,經常發現給定分子或部分結構的多種結構,或已經從其天然形式修飾或失活的結構。

蛋白質序列家族分類

根據氨基酸序列對蛋白質家族進行分類。 工作基于自然語言處理(NLP)中深度學習模型,并假設蛋白質序列可以被視為一種語言。

#import library
 
import numpy as np
import pandas as pd
 
from keras.models import Sequential
from keras.layers import Dense, Conv1D, MaxPooling1D, Flatten,AtrousConvolution1D,SpatialDropout1D, Dropout
from keras.layers.normalization import BatchNormalization
 
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.callbacks import EarlyStopping
from keras.preprocessing import text, sequence
from keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split
 
%matplotlib inline      
# Merge the two Data set together
# Drop duplicates
df = pd.read_csv('pdbdata_no_dups.csv').merge(pd.read_csv('pdbdata_seq.csv'), how='inner', on='structureId').drop_duplicates(["sequence"]) # ,"classification"
# Drop rows with missing labels
df = df[[type(c) == type('') for c in df.classification.values]]
df = df[[type(c) == type('') for c in df.sequence.values]]
# select proteins
df = df[df.macromoleculeType_x == 'Protein']
df.reset_index()
print(df.shape)
df.head()      

(87761, 18)

蛋白序列 | 基于深度學習的蛋白質序列家族分類
print(df.residueCount_x.quantile(0.9))
df.residueCount_x.describe()      

count     87761.000000

mean        922.913937

std        3173.118920

min           3.000000

25%         234.000000

50%         451.000000

75%         880.000000

max      157478.000000

Name: residueCount_x, dtype: float64

df = df.loc[df.residueCount_x<1200]
print(df.shape[0])
df.residueCount_x.describe()      

count    73140.000000

mean       433.599918

std        289.671609

min          3.000000

25%        198.000000

50%        373.000000

75%        618.000000

max       1198.000000

EDA

import matplotlib.pyplot as plt
from collections import Counter
 
# count numbers of instances per class
cnt = Counter(df.classification)
# select only K most common classes! - was 10 by default
top_classes = 10
# sort classes
sorted_classes = cnt.most_common()[:top_classes]
classes = [c[0] for c in sorted_classes]
counts = [c[1] for c in sorted_classes]
print("at least " + str(counts[-1]) + " instances per class")
 
# apply to dataframe
print(str(df.shape[0]) + " instances before")
df = df[[c in classes for c in df.classification]]
print(str(df.shape[0]) + " instances after")
 
seqs = df.sequence.values
lengths = [len(s) for s in seqs]
 
# visualize
fig, axarr = plt.subplots(1,2, figsize=(20,5))
axarr[0].bar(range(len(classes)), counts)
plt.sca(axarr[0])
plt.xticks(range(len(classes)), classes, rotation='vertical')
axarr[0].set_ylabel('frequency')
 
axarr[1].hist(lengths, bins=50, normed=False)
axarr[1].set_xlabel('sequence length')
axarr[1].set_ylabel('# sequences')
plt.show()      
蛋白序列 | 基于深度學習的蛋白質序列家族分類

轉換标簽

首先,資料集被簡化為十個最常見類之一的樣本。 序列的長度範圍從極少數氨基酸到幾千個氨基酸。

其次,使用sklearn.preprocessing中的LabelBinarizer将字元串中的标簽轉換為一個one hot表示。

from sklearn.preprocessing import LabelBinarizer
 
# Transform labels to one-hot
lb = LabelBinarizer()
Y = lb.fit_transform(df.classification)      

用keras進一步預處理序列

使用keras庫進行文本處理:

Tokenizer:将序列的每個字元轉換為數字

pad_sequences:確定每個序列具有相同的長度(max_length)。

train_test_split:将資料分成訓練和測試樣本。

# maximum length of sequence, everything afterwards is discarded! Default 256
# max_length = 256
max_length = 300
 
#create and fit tokenizer
tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(seqs)
#represent input data as word rank number sequences
X = tokenizer.texts_to_sequences(seqs)
X = sequence.pad_sequences(X, maxlen=max_length)      

建立keras模型并适合

NLP中取得的成功建議使用已經實作為keras層(嵌入)的字嵌入。本例中,隻有20個不同的單詞(每個氨基酸)。可以嘗試對序列的one hot表示進行2D卷積,而不是單詞嵌入,但是這種方法側重于将NLP理論應用于蛋白質序列。

為了提高性能,還可以使用深層體系結構(後續的卷積和池化層),這裡有兩層,其中第一層有64個濾波器,卷積大小為6,第二層有32個濾波器,大小為3。

将激活平展并傳遞到完全相同的層,其中最後一層是softmax激活,并且大小對應于類的數量。

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=.2)      
embedding_dim = 11 # orig 8
 
# create the model
model = Sequential()
model.add(Embedding(len(tokenizer.word_index)+1, embedding_dim, input_length=max_length))
# model.add(Conv1D(filters=64, kernel_size=6, padding='same', activation='relu')) #orig
model.add(Conv1D(filters=128, kernel_size=9, padding='valid', activation='relu',dilation_rate=1))
# model.add(BatchNormalization())
model.add(MaxPooling1D(pool_size=2))
# model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu')) # orig
model.add(Conv1D(filters=64, kernel_size=4, padding='valid', activation='relu')) 
model.add(BatchNormalization())
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(200, activation='elu')) # 128
# model.add(BatchNormalization())
# model.add(Dense(64, activation='elu')) # 128
model.add(BatchNormalization())
model.add(Dense(top_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())      
蛋白序列 | 基于深度學習的蛋白質序列家族分類
es = EarlyStopping(monitor='val_acc', verbose=1, patience=4)
 
model.fit(X_train, y_train,  batch_size=128, verbose=1, validation_split=0.15,callbacks=[es],epochs=25) # epochs=15, # batch_size=128      
蛋白序列 | 基于深度學習的蛋白質序列家族分類
from keras.layers import Conv1D, MaxPooling1D, Concatenate, Input
from keras.models import Sequential,Model
 
units = 256
num_filters = 32
filter_sizes=(3,5, 9,15,21)
conv_blocks = []
 
embedding_layer = Embedding(len(tokenizer.word_index)+1, embedding_dim, input_length=max_length)
es2 = EarlyStopping(monitor='val_acc', verbose=1, patience=4)
 
sequence_input = Input(shape=(max_length,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
 
z = Dropout(0.1)(embedded_sequences)
 
for sz in filter_sizes:
    conv = Conv1D(
        filters=num_filters,
        kernel_size=sz,
        padding="valid",
        activation="relu",
        strides=1)(z)
    conv = MaxPooling1D(pool_size=2)(conv)
    conv = Flatten()(conv)
    conv_blocks.append(conv)
z = Concatenate()(conv_blocks) if len(conv_blocks) > 1 else conv_blocks[0]
z = Dropout(0.25)(z)
z = BatchNormalization()(z)
z = Dense(units, activation="relu")(z)
z = BatchNormalization()(z)
predictions = Dense(top_classes, activation="softmax")(z)
model2 = Model(sequence_input, predictions)
model2.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model2.summary())
 
model2.fit(X_train, y_train,  batch_size=64, verbose=1, validation_split=0.15,callbacks=[es],epochs=30)      
蛋白序列 | 基于深度學習的蛋白質序列家族分類
蛋白序列 | 基于深度學習的蛋白質序列家族分類
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
import itertools
 
train_pred = model.predict(X_train)
test_pred = model.predict(X_test)
print("train-acc = " + str(accuracy_score(np.argmax(y_train, axis=1), np.argmax(train_pred, axis=1))))
print("test-acc = " + str(accuracy_score(np.argmax(y_test, axis=1), np.argmax(test_pred, axis=1))))
 
# Compute confusion matrix
cm = confusion_matrix(np.argmax(y_test, axis=1), np.argmax(test_pred, axis=1))
 
# Plot normalized confusion matrix
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
np.set_printoptions(precision=2)
plt.figure(figsize=(10,10))
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion matrix')
plt.colorbar()
tick_marks = np.arange(len(lb.classes_))
plt.xticks(tick_marks, lb.classes_, rotation=90)
plt.yticks(tick_marks, lb.classes_)
#for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
#    plt.text(j, i, format(cm[i, j], '.2f'), horizontalalignment="center", color="white" if cm[i, j] > cm.max() / 2. else "black")
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
 
print(classification_report(np.argmax(y_test, axis=1), np.argmax(test_pred, axis=1), target_names=lb.classes_))      
蛋白序列 | 基于深度學習的蛋白質序列家族分類
train_pred = model2.predict(X_train)
test_pred = model2.predict(X_test)
print("train-acc = " + str(accuracy_score(np.argmax(y_train, axis=1), np.argmax(train_pred, axis=1))))
print("test-acc = " + str(accuracy_score(np.argmax(y_test, axis=1), np.argmax(test_pred, axis=1))))
 
# Compute confusion matrix
cm = confusion_matrix(np.argmax(y_test, axis=1), np.argmax(test_pred, axis=1))
 
# Plot normalized confusion matrix
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
np.set_printoptions(precision=2)
plt.figure(figsize=(10,10))
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion matrix')
plt.colorbar()
tick_marks = np.arange(len(lb.classes_))
plt.xticks(tick_marks, lb.classes_, rotation=90)
plt.yticks(tick_marks, lb.classes_)
#for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
#    plt.text(j, i, format(cm[i, j], '.2f'), horizontalalignment="center", color="white" if cm[i, j] > cm.max() / 2. else "black")
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
 
print(classification_report(np.argmax(y_test, axis=1), np.argmax(test_pred, axis=1), target_names=lb.classes_))      
蛋白序列 | 基于深度學習的蛋白質序列家族分類