資料操作中排序和去重是比較常見的資料操作,本專題對排序和去重做專門介紹,并且給出一種不常用卻比較有啟發意義的示例:多列無序去重
目 錄
1 排序
1.1 sort 單列排序傳回值
1.2 order 單列排序傳回索引
1.3 rank 單列排序傳回“秩”
1.4 arrage 多列排序
1.5、reorder 用在繪圖中
2 去重
2.1 unique 單向量/多列完全重複去重
2.2 duplicated函數
3 多列無序去重
說明:多列無序重複比較值得學習
正 文
1 排序
1.1 sort 單列排序傳回值
總結:sort是直接對向量排序,傳回原數值
#sort相關文法
sort(x, decreasing = FALSE, ...)
## Default S3 method:
sort(x, decreasing = FALSE, na.last = NA, ...)
sort.int(x, partial = NULL, na.last = NA, decreasing = FALSE,
method = c("auto", "shell", "quick", "radix"), index.return = FALSE)
sort示例
> set.seed(416)
> x <- round(runif(10,1,20))
> x;sort(x)
[1] 9 13 7 13 20 16 4 1 6 17
[1] 1 4 6 7 9 13 13 16 17 20 #可以發現sort函數是對原始向量進行排序
#如果遇到矩陣,sort函數會将矩陣轉換為向量
> set.seed(416)
> x <- round(runif(10,1,20))
> y <- matrix(x,nrow = 5)
> y;sort(y)
[,1] [,2]
[1,] 9 16
[2,] 13 4
[3,] 7 1
[4,] 13 6
[5,] 20 17
[1] 1 4 6 7 9 13 13 16 17 20 #sort(y)
1.2 order 單列排序傳回索引
總結:order先對數值排序,然後傳回排序後各數值的索引
#order相關文法
order(..., na.last = TRUE, decreasing = FALSE,
method = c("auto", "shell", "radix"))
order示例
> set.seed(416)
> x <- round(runif(10,1,20))
> x
[1] 9 13 7 13 20 16 4 1 6 17
> order(x)
[1] 8 7 9 3 1 2 4 6 10 5 #order傳回x序列的索引值
> sort(x)
[1] 1 4 6 7 9 13 13 16 17 20
> x[order(x)]
[1] 1 4 6 7 9 13 13 16 17 20 #根據索引對x進行排序
#當遇到矩陣時,order将按列對原始矩陣進行排序,并且傳回其索引向量
> set.seed(416)
> x <- round(runif(10,1,20))
> y <- matrix(x,nrow = 5)
> y
[,1] [,2]
[1,] 9 16
[2,] 13 4
[3,] 7 1
[4,] 13 6
[5,] 20 17
> order(y)
[1] 8 7 9 3 1 2 4 6 10 5 #str(order(y)) 傳回int
> sort(y)
[1] 1 4 6 7 9 13 13 16 17 20
> y[order(y)]
[1] 1 4 6 7 9 13 13 16 17 20
1.3 rank 單列排序傳回“秩”
總結:rank傳回原資料各項排名(有并列的情況)
概念解釋:秩是基于樣本值的大小在全體樣本中所占位次(秩)的統計量。
#rank文法
rank(x, na.last = TRUE,
ties.method = c("average", "first", "last", "random", "max", "min"))
rank示例
> set.seed(416)
> x <- round(runif(10,1,20))
> x
[1] 9 13 7 13 20 16 4 1 6 17
> rank(x) #rank傳回x中每個元素的秩
[1] 5.0 6.5 4.0 6.5 10.0 8.0 2.0 1.0 3.0 9.0
1.4 arrage 多列排序
總結:arrange是dplyr包中的排序函數,可對資料框以列的形式進行因子排序
> library(dplyr) #加載dplyr
> arrange(mtcars, cyl, disp) #對mtcars資料框按照cyl和disp升序排序
mpg cyl disp hp drat wt qsec vs am gear carb
1 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
2 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
……
6 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
7 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
……
23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
……
26 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
27 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
……
32 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
> arrange(mtcars, desc(disp)) #對mtcars資料框按照cyl升序和和disp降序排序
mpg cyl disp hp drat wt qsec vs am gear carb
1 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
2 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
3 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
……
12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
15 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
……
27 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
……
32 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
1.5、reorder 用在繪圖中
1.5.1 在graphics繪圖系統中
require(graphics)
bymedian <- with(InsectSprays, reorder(spray, count, median))
boxplot(count ~ bymedian, data = InsectSprays,
xlab = "Type of spray", ylab = "Insect count",
main = "InsectSprays data", varwidth = TRUE,
col = "lightgray")
1.5.2 比如ggplot中繪條形圖使x軸按y軸數值大小排序
說明:reorder函數具有對排序變量的因子化作用
> attach(mtcars)
> str(reorder(gear,disp))
Factor w/ 3 levels "4","5","3": 1 1 1 3 3 3 3 1 1 1 ...
- attr(*, "scores")= num [1:3(1d)] 326 123 202
..- attr(*, "dimnames")=List of 1
.. ..$ : chr [1:3] "3" "4" "5"
> str(factor(gear))
Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
> detach(mtcars)
library(ggplot2)
data(mtcars)
head(mtcars)
ggplot(mtcars,aes(x=reorder(gear,disp), y= disp)) + geom_boxplot() + labs(title = "圖1")
ggplot(mtcars,aes(x=factor(gear), y= disp)) + geom_boxplot() + labs(title = "圖2")
2 去重
2.1 unique 單向量/多列完全重複去重
總結:unique中,R中預設的是fromLast=FALSE,即若樣本點重複出現,則取首次出現的;否則去最後一次出現的。列名不變,去掉重複樣本值之後的行名位置仍為原先的行名位置。
> df <- data.frame(x = c("A","B","C","D","E","B","C","B"), y = c("B","A","D","E","B","C","A","A"))
> df
x y
1 A B
2 B A
3 C D
4 D E
5 E B
6 B C
7 C A
8 B A
> unique(df)
x y
1 A B
2 B A
3 C D
4 D E
5 E B
6 B C
7 C A
> unique(df,fromLast = TRUE)
x y
1 A B
3 C D
4 D E
5 E B
6 B C
7 C A
8 B A
2.2 duplicated函數
總結:duplicated可對原資料框做單列或多列去重,并且傳回波爾向量(索引)
> df <- data.frame(x = c("A","B","C","D","E","B","C","B"), y = c("B","A","D","E","B","C","A","A"))
> df
x y
1 A B
2 B A
3 C D
4 D E
5 E B
6 B C
7 C A
8 B A
> df_index <- duplicated(df$x) #建構一個布爾向量(索引)
> df_index
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
> df[!df_index,] #篩選
x y
1 A B
2 B A
3 C D
4 D E
5 E B
3 多列無序去重
總結:多列無序去重指,多列非按照獨立列比較重複,而是指逐行比較每一行是否出現過此元素(不按照列順序).
例如:matrix(c("a","b"),nrow = 1) 和 matrix(c("b","a"),nrow = 1)也是重複
> data.frame(matrix(c("a","b"),nrow = 1))
X1 X2
1 a b
> data.frame(matrix(c("b","a"),nrow = 1))
X1 X2
1 b a
#生成測試集
> df <- data.frame(x = c("A","B","C","D","E","B","C","B"), y = c("B","C","D","E","B","C","A","A"),z = c(1:8))
#對資料集df[,c(1:2)]逐行操作排序,并将排序後結果合并
> df$merge <- apply(df[,c(1:2)],1,function(x) paste(sort(x),collapse=''))
#對逐行排序合并的結果進行去重,傳回索引向量,然後(反向!)篩選
> df_du<-df[!duplicated(df$merge),]
> df
x y z merge
1 A B 1 AB
2 B C 2 BC
3 C D 3 CD
4 D E 4 DE
5 E B 5 BE
6 B C 6 BC
7 C A 7 AC
8 B A 8 AB
> df_du
x y z merge
1 A B 1 AB
2 B C 2 BC
3 C D 3 CD
4 D E 4 DE
5 E B 5 BE
7 C A 7 AC