天天看點

資料清洗過程中常見的排序和去重操作

資料操作中排序和去重是比較常見的資料操作,本專題對排序和去重做專門介紹,并且給出一種不常用卻比較有啟發意義的示例:多列無序去重

目 錄

1 排序

1.1 sort 單列排序傳回值

1.2 order 單列排序傳回索引

1.3 rank 單列排序傳回“秩”

1.4 arrage 多列排序

1.5、reorder 用在繪圖中

2 去重

2.1 unique 單向量/多列完全重複去重

2.2 duplicated函數

3  多列無序去重

說明:多列無序重複比較值得學習

正 文

1 排序

1.1 sort 單列排序傳回值

總結:sort是直接對向量排序,傳回原數值

#sort相關文法 
sort(x, decreasing = FALSE, ...)  
## Default S3 method: 
sort(x, decreasing = FALSE, na.last = NA, ...)  
sort.int(x, partial = NULL, na.last = NA, decreasing = FALSE, 
         method = c("auto", "shell", "quick", "radix"), index.return = FALSE)      

sort示例

> set.seed(416)  
> x <- round(runif(10,1,20))  
> x;sort(x) 
 [1]  9 13  7 13 20 16  4  1  6 17 
 [1]  1  4  6  7  9 13 13 16 17 20 #可以發現sort函數是對原始向量進行排序 
  
  
#如果遇到矩陣,sort函數會将矩陣轉換為向量 
> set.seed(416)  
> x <- round(runif(10,1,20))  
> y <- matrix(x,nrow = 5)  
> y;sort(y) 
     [,1] [,2] 
[1,]    9   16  
[2,]   13    4  
[3,]    7    1  
[4,]   13    6  
[5,]   20   17  
 [1]  1  4  6  7  9 13 13 16 17 20 #sort(y)      

1.2 order 單列排序傳回索引

總結:order先對數值排序,然後傳回排序後各數值的索引

#order相關文法  
order(..., na.last = TRUE, decreasing = FALSE,  
      method = c("auto", "shell", "radix"))      

order示例

> set.seed(416)  
> x <- round(runif(10,1,20))  
> x 
 [1]  9 13  7 13 20 16  4  1  6 17 
> order(x) 
 [1]  8  7  9  3  1  2  4  6 10  5  #order傳回x序列的索引值  
> sort(x)  
 [1]  1  4  6  7  9 13 13 16 17 20 
> x[order(x)]  
 [1]  1  4  6  7  9 13 13 16 17 20 #根據索引對x進行排序 
  
  
 #當遇到矩陣時,order将按列對原始矩陣進行排序,并且傳回其索引向量 
 > set.seed(416)  
> x <- round(runif(10,1,20))  
> y <- matrix(x,nrow = 5)  
> y 
     [,1] [,2] 
[1,]    9   16  
[2,]   13    4  
[3,]    7    1  
[4,]   13    6  
[5,]   20   17  
> order(y) 
 [1]  8  7  9  3  1  2  4  6 10  5 #str(order(y)) 傳回int  
> sort(y)  
 [1]  1  4  6  7  9 13 13 16 17 20 
> y[order(y)]  
 [1]  1  4  6  7  9 13 13 16 17 20      

1.3 rank 單列排序傳回“秩”

總結:rank傳回原資料各項排名(有并列的情況)

概念解釋:秩是基于樣本值的大小在全體樣本中所占位次(秩)的統計量。

#rank文法 
rank(x, na.last = TRUE, 
     ties.method = c("average", "first", "last", "random", "max", "min"))      

rank示例

> set.seed(416)  
> x <- round(runif(10,1,20))  
> x 
 [1]  9 13  7 13 20 16  4  1  6 17 
> rank(x) #rank傳回x中每個元素的秩  
 [1]  5.0  6.5  4.0  6.5 10.0  8.0  2.0  1.0  3.0  9.0      

1.4 arrage 多列排序

總結:arrange是dplyr包中的排序函數,可對資料框以列的形式進行因子排序

> library(dplyr) #加載dplyr    
> arrange(mtcars, cyl, disp) #對mtcars資料框按照cyl和disp升序排序   
    mpg cyl  disp  hp drat    wt  qsec vs am gear carb  
1  33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1  
2  30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2  
……  
6  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1  
7  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1  
……  
23 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2  
……  
26 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4  
27 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2  
……  
32 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4  
    
> arrange(mtcars, desc(disp)) #對mtcars資料框按照cyl升序和和disp降序排序   
    mpg cyl  disp  hp drat    wt  qsec vs am gear carb  
1  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4  
2  10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4  
3  14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4  
……  
12 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3  
13 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3  
14 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3  
15 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1  
……  
27 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1  
28 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2  
……  
32 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1      

1.5、reorder 用在繪圖中

1.5.1 在graphics繪圖系統中

require(graphics) 
  
bymedian <- with(InsectSprays, reorder(spray, count, median))  
boxplot(count ~ bymedian, data = InsectSprays,  
        xlab = "Type of spray", ylab = "Insect count",  
        main = "InsectSprays data", varwidth = TRUE,  
        col = "lightgray")      
資料清洗過程中常見的排序和去重操作

1.5.2 比如ggplot中繪條形圖使x軸按y軸數值大小排序

說明:reorder函數具有對排序變量的因子化作用

> attach(mtcars) 
> str(reorder(gear,disp))  
 Factor w/ 3 levels "4","5","3": 1 1 1 3 3 3 3 1 1 1 ... 
 - attr(*, "scores")= num [1:3(1d)] 326 123 202  
  ..- attr(*, "dimnames")=List of 1 
  .. ..$ : chr [1:3] "3" "4" "5"  
> str(factor(gear))  
 Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ... 
> detach(mtcars)      
library(ggplot2)  
data(mtcars)  
head(mtcars)  
ggplot(mtcars,aes(x=reorder(gear,disp), y= disp)) + geom_boxplot() + labs(title = "圖1")   
ggplot(mtcars,aes(x=factor(gear), y= disp)) + geom_boxplot() + labs(title = "圖2")      
資料清洗過程中常見的排序和去重操作
資料清洗過程中常見的排序和去重操作

2 去重

2.1 unique 單向量/多列完全重複去重

總結:unique中,R中預設的是fromLast=FALSE,即若樣本點重複出現,則取首次出現的;否則去最後一次出現的。列名不變,去掉重複樣本值之後的行名位置仍為原先的行名位置。

> df <- data.frame(x = c("A","B","C","D","E","B","C","B"), y = c("B","A","D","E","B","C","A","A"))  
> df  
  x y 
1 A B 
2 B A 
3 C D 
4 D E 
5 E B 
6 B C 
7 C A 
8 B A 
> unique(df) 
  x y 
1 A B 
2 B A 
3 C D 
4 D E 
5 E B 
6 B C 
7 C A 
> unique(df,fromLast = TRUE) 
  x y 
1 A B 
3 C D 
4 D E 
5 E B 
6 B C 
7 C A 
8 B A      

2.2 duplicated函數

總結:duplicated可對原資料框做單列或多列去重,并且傳回波爾向量(索引)

> df <- data.frame(x = c("A","B","C","D","E","B","C","B"), y = c("B","A","D","E","B","C","A","A"))  
> df  
  x y 
1 A B 
2 B A 
3 C D 
4 D E 
5 E B 
6 B C 
7 C A 
8 B A 
> df_index <- duplicated(df$x) #建構一個布爾向量(索引)  
> df_index  
[1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE 
> df[!df_index,] #篩選 
  x y 
1 A B 
2 B A 
3 C D 
4 D E 
5 E B      

3  多列無序去重

總結:多列無序去重指,多列非按照獨立列比較重複,而是指逐行比較每一行是否出現過此元素(不按照列順序).

例如:matrix(c("a","b"),nrow = 1) 和 matrix(c("b","a"),nrow = 1)也是重複

> data.frame(matrix(c("a","b"),nrow = 1))  
  X1 X2 
1  a  b 
> data.frame(matrix(c("b","a"),nrow = 1))  
  X1 X2 
1  b  a      
#生成測試集 
> df <- data.frame(x = c("A","B","C","D","E","B","C","B"), y = c("B","C","D","E","B","C","A","A"),z = c(1:8)) 
#對資料集df[,c(1:2)]逐行操作排序,并将排序後結果合并  
> df$merge <- apply(df[,c(1:2)],1,function(x) paste(sort(x),collapse='')) 
#對逐行排序合并的結果進行去重,傳回索引向量,然後(反向!)篩選  
> df_du<-df[!duplicated(df$merge),] 
> df 
  x y z merge 
1 A B 1    AB 
2 B C 2    BC 
3 C D 3    CD 
4 D E 4    DE 
5 E B 5    BE 
6 B C 6    BC 
7 C A 7    AC 
8 B A 8    AB 
> df_du  
  x y z merge 
1 A B 1    AB 
2 B C 2    BC 
3 C D 3    CD 
4 D E 4    DE 
5 E B 5    BE 
7 C A 7    AC