[R] read.table/read.delim讀入資料行數變少？

2022-06-01 10:35:47

以為對

read.table/read.delim

很熟了，誰知又掉坑裡了。

我有個3萬多行的資料集，包括樣品表達量和注釋資訊。大概長這樣：

本來3萬多行，可是讀進來的時候變成了1萬多行，而且read.delim和read.table減少的行數還不一樣。我用Excel打開，再另存為txt格式讀入後，資料行數變回正常的3萬多。

MP <- read.delim("combine_test.txt",sep = '\t',header = T)
MP1 <- read.table("combine_test.txt",sep = '\t',header = T)
MP2<- read.delim("new_combine_test.txt",sep = '\t',header = T)

是以我在想是不是Rstudio的問題。于是我在Linux中測試了下，發現更詭異。

MP <- read.table("combine_test2.txt",header = T,sep='\t')
dim(MP)
MP2 <- read.delim("combine_test2.txt",header = T,sep='\t')
dim(MP2)
write.table(MP,"out.txt",col.names=T,row.names=F,sep='\t',quote=F)
write.table(MP2,"out.txt",col.names=T,row.names=F,sep='\t',quote=F)

dim顯示的都是1萬多行，原樣輸出的資料卻有3萬多行！

我意識到是資料格式的問題了。用readr來試試：

MP2 <- as.data.frame(read_delim("combine_test.txt",delim = '\t'))

變回正常了。難道

base R

還不如

tidyverse

嗎？？？我在網上查了查，終于找到原因了，那就是一個

quote

參數的事情。

MP3 <- read.table("combine_test.txt",sep = '\t',quote = "",header = T)
MP4 <- read.delim("combine_test.txt",sep = '\t',quote = "",header = T)

關于

quote

參數，那個答案是這麼解釋的：

Explanation: Your data has a single quote on 59th line (( pyridoxamine 5'-phosphate oxidase (predicted)). Then there is another single quote, which complements the single quote on line 59, is on line 137 (5'-hydroxyl-kinase activity...). Everything within quote will be read as a single field of data, and quotes can include the newline character also. That's why you lose the lines in between. quote = "" disables quoting altogether.

簡單了解就是我的資料裡面包含了單引号

''

，兩個單引号之間會當成一個字段來處理，我需要提前用

quote=""

将字段引起來。我檢查了下，在我的KEGG的描述中确實含有引号。

如果字段字元串中本身含有雙引号

""

或者其他符号時，也可能出錯。為檢查這種錯誤，可以用

count.fields

來統計每行的字段數，如果出現NA，則說明讀入的資料有誤。

num.fields = count.fields("combine_test.txt", sep="\t")

num.fields = count.fields("combine_test.txt", sep="\t",quote = "")

貌似

read.csv

不會出現這種問題，因為它提前引起來了。可見read.table确實有意想不到的錯誤發生。多了解下

fread

和

readr

系列吧。

[R] read.table/read.delim讀入資料行數變少？

繼續閱讀

Kafka：Topic概念與API介紹

5G小型蜂應用指南

PAT (Advanced Level) Practise 1012 The Best Rank (25)

mysql5.7的sql優化

線程通信和程序通信差別（線程程序差別）

Matlab随機波動率SV、GARCH用MCMC馬爾可夫鍊蒙特卡羅方法分析匯率時間序列

微信小程式前端解密擷取使用者資訊

Spring MVC 自學雜記（五） -- SpringMVC與前台的json資料互動

《MySQL技術内幕：InnoDB存儲引擎》筆記

擴容TIKV節點遇到的坑

PHP輔導代做程式設計：CS353 Database System

自學Zabbix3.10.2-事件通知Notifications upon events-Actions報警配置點選傳回：自學zabbix集錦

HDU 5678 ztr loves trees

拓端tecdat|R語言彈性網絡Elastic Net正則化懲罰回歸模型交叉驗證可視化

二叉樹及其應用--二叉樹建立

詳解STM32單片機的堆棧