以為對
read.table/read.delim
很熟了,誰知又掉坑裡了。
我有個3萬多行的資料集,包括樣品表達量和注釋資訊。大概長這樣:
本來3萬多行,可是讀進來的時候變成了1萬多行,而且read.delim和read.table減少的行數還不一樣。我用Excel打開,再另存為txt格式讀入後,資料行數變回正常的3萬多。
MP <- read.delim("combine_test.txt",sep = '\t',header = T)
MP1 <- read.table("combine_test.txt",sep = '\t',header = T)
MP2<- read.delim("new_combine_test.txt",sep = '\t',header = T)
是以我在想是不是Rstudio的問題。于是我在Linux中測試了下,發現更詭異。
MP <- read.table("combine_test2.txt",header = T,sep='\t')
dim(MP)
MP2 <- read.delim("combine_test2.txt",header = T,sep='\t')
dim(MP2)
write.table(MP,"out.txt",col.names=T,row.names=F,sep='\t',quote=F)
write.table(MP2,"out.txt",col.names=T,row.names=F,sep='\t',quote=F)
dim顯示的都是1萬多行,原樣輸出的資料卻有3萬多行!
我意識到是資料格式的問題了。用readr來試試:
MP2 <- as.data.frame(read_delim("combine_test.txt",delim = '\t'))
變回正常了。難道
base R
還不如
tidyverse
嗎???我在網上查了查,終于找到原因了,那就是一個
quote
參數的事情。
MP3 <- read.table("combine_test.txt",sep = '\t',quote = "",header = T)
MP4 <- read.delim("combine_test.txt",sep = '\t',quote = "",header = T)
關于
quote
參數,那個答案是這麼解釋的:
Explanation: Your data has a single quote on 59th line (( pyridoxamine 5'-phosphate oxidase (predicted)). Then there is another single quote, which complements the single quote on line 59, is on line 137 (5'-hydroxyl-kinase activity...). Everything within quote will be read as a single field of data, and quotes can include the newline character also. That's why you lose the lines in between. quote = "" disables quoting altogether.
簡單了解就是我的資料裡面包含了單引号
''
,兩個單引号之間會當成一個字段來處理,我需要提前用
quote=""
将字段引起來。我檢查了下,在我的KEGG的描述中确實含有引号。
如果字段字元串中本身含有雙引号
""
或者其他符号時,也可能出錯。為檢查這種錯誤,可以用
count.fields
來統計每行的字段數,如果出現NA,則說明讀入的資料有誤。
num.fields = count.fields("combine_test.txt", sep="\t")
num.fields = count.fields("combine_test.txt", sep="\t",quote = "")
貌似
read.csv
不會出現這種問題,因為它提前引起來了。可見read.table确實有意想不到的錯誤發生。多了解下
fread
和
readr
系列吧。