天天看點

如何合并資料在R使用R合并,dplyr,或data.table

作者:數字化江湖
如何合并資料在R使用R合并,dplyr,或data.table

R有許多通過公共列連接配接資料幀的快速、優雅的方法。我想向你們展示其中的三個:

1. 基數R的merge()函數

2. Dplyr的join函數族

3. 資料。表的括号文法

一、擷取并導入資料

在這個例子中,我将使用我最喜歡的示範資料集之一——來自美國交通統計局的航班延誤時間。如果您想跟随,請通路http://bit.ly/USFlightDelays并下載下傳您選擇的時間段的資料,包括航班日期、Reporting_Airline、出發地、目的地和出發時間。還可以擷取Reporting_Airline的查找表。

或者,你可以下載下傳這兩個資料集,加上我在一個檔案中的R代碼和一個解釋不同類型的資料合并的PowerPoint,在這裡:

要用基本R讀入檔案,我首先解壓縮航班延誤檔案,然後用read.csv()導入航班延誤資料和代碼查找檔案。如果您正在運作該代碼,則您下載下傳的延遲檔案的名稱可能與下面代碼中的名稱不同。另外,請注意查找檔案不尋常的.csv_擴充名。

unzip("673598238_T_ONTIME_REPORTING.zip")

mydf <- read.csv("673598238_T_ONTIME_REPORTING.csv",

sep = ",", quote="\"")

mylookup <- read.csv("L_UNIQUE_CARRIERS.csv_",

quote="\"", sep = "," )

接下來,我将用head()檢視這兩個檔案:

head(mydf)

FL_DATE OP_UNIQUE_CARRIER ORIGIN DEST DEP_DELAY_NEW X1 2019-08-01 DL ATL DFW 31 NA2 2019-08-01 DL DFW ATL 0 NA3 2019-08-01 DL IAH ATL 40 NA4 2019-08-01 DL PDX SLC 0 NA5 2019-08-01 DL SLC PDX 0 NA6 2019-08-01 DL DTW ATL 10 NA

head(mylookup)

Code Description1 02Q Titan Airways2 04Q Tradewind Aviation3 05Q Comlux Aviation, AG4 06Q Master Top Linhas Aereas Ltd.5 07Q Flair Airlines Ltd.6 09Q Swift Air, LLC d/b/a Eastern Air Lines d/b/a Eastern

二、與底R合并

mydf延遲資料幀隻有航空公司資訊的代碼。我想用mylookup中的航空公司名稱添加一列。一種基于R的方法是使用merge()函數,使用基本文法merge(df1, df2)。資料幀1和資料幀2的順序無關緊要,但無論哪個是第一個都被認為是x,第二個是y。華東CIO大會、華東CIO聯盟、CDLC中國數字化燈塔大會、CXO數字化研學之旅、數字化江湖-講武堂,數字化江湖-大俠傳、數字化江湖-論劍、CXO系列管理論壇(陸家嘴CXO管理論壇、甯波東錢湖CXO管理論壇等)、數字化轉型網,走進燈塔工廠系列、ECIO大會等

如果你想要連接配接的列沒有相同的名稱,你需要告訴歸并你想要連接配接的列:by。X為X資料幀的列名,由。Y表示Y,比如merge(df1, df2, by。x = "df1ColName", by。y = "df2ColName")。

您還可以告訴歸并是否需要包含參數all的所有行,包括沒有比對的行,還是隻需要比對的行。X和所有。y。在這種情況下,我想要所有的行從延遲資料;如果查找表中沒有航空公司代碼,我仍然需要該資訊。但我不需要查找表中不在延遲資料中的行(其中有一些已不再飛行的舊航空公司的代碼)。是以,所有。x = TRUE但所有。y = FALSE。代碼如下:

joined_df <- merge(mydf, mylookup, by.x = "OP_UNIQUE_CARRIER",

by.y = "Code", all.x = TRUE, all.y = FALSE)

新的連接配接資料幀包括一個名為Description的列,其中包含基于航空公司代碼的航空公司名稱:

head(joined_df)

OP_UNIQUE_CARRIER FL_DATE ORIGIN DEST DEP_DELAY_NEW X Description1 9E 2019-08-12 JFK SYR 0 NA Endeavor Air Inc.2 9E 2019-08-12 TYS DTW 0 NA Endeavor Air Inc.3 9E 2019-08-12 ORF LGA 0 NA Endeavor Air Inc.4 9E 2019-08-13 IAH MSP 6 NA Endeavor Air Inc.5 9E 2019-08-12 DTW JFK 58 NA Endeavor Air Inc.6 9E 2019-08-12 SYR JFK 0 NA Endeavor Air Inc.

三、與dplyr連接配接

dplyr包的連接配接函數使用SQL資料庫文法。左連接配接意味着:包括左邊的所有内容(merge()中的x資料幀是什麼)和從右邊(y)資料幀比對的所有行。如果聯接列有相同的名稱,你隻需要left_join(x, y).如果它們沒有相同的名稱,你需要一個by參數,比如left_join(x, y, by = c("df1ColName"= "df2ColName"))。

注意by的文法:它是一個命名向量,左右列名都用引号括起來。

更新:從dplyr 1.1.0版本開始(2023年1月29日在CRAN上),dplyr連接配接有一個額外的by文法,使用join_by():

left_join(x, y, by = join_by(df1ColName == df2ColName))

新的join_by()幫助函數使用了不帶引号的列名和==布爾運算符,包的作者說,這個運算符在R上下文中比在c上下文中更有意義("col1" = "col2"),因為=是為了給變量指派,而不是測試是否相等。

左連接配接保留左資料幀中的所有行,隻比對來自右資料幀的行。

下面是使用left_join()導入和合并兩個資料集的代碼。它首先加載dplyr和readr包,然後用read_csv()讀入這兩個檔案。當使用read_csv()時,我不需要先解壓縮檔案。

library(dplyr)

library(readr)

mytibble <- read_csv("673598238_T_ONTIME_REPORTING.zip")

mylookup_tibble <- read_csv("L_UNIQUE_CARRIERS.csv_")

joined_tibble <- left_join(mytibble, mylookup_tibble,

by = join_by(OP_UNIQUE_CARRIER == Code))

注意,dplyr的舊by文法沒有join_by()仍然有效

joined_tibble <- left_join(mytibble, mylookup_tibble,

by = c("OP_UNIQUE_CARRIER" = "Code"))

Read_csv()建立tibbles,這是一種具有一些額外功能的資料幀類型。Left_join()将兩者合并。看一下文法:在這種情況下,順序很重要。Left_join()意味着包含左邊或第一個資料集的所有行,但隻包含與第二個資料集比對的行。并且,因為我需要通過兩個不同名稱的列來連接配接,是以我包含了一個by參數。

在dplyr的開發版中,新的連接配接文法是:

joined_tibble2 <- left_join(mytibble, mylookup_tibble,

by = join_by(OP_UNIQUE_CARRIER == Code))

但是,由于大多數人可能都有CRAN版本,是以在本文的其餘部分中,我将使用dplyr最初的命名向量文法,直到join_by()成為CRAN版本的一部分。

我們可以使用dplyr的glimpse()函數檢視結果的結構,這是檢視資料幀頂部幾項的另一種方式:

glimpse(joined_tibble)Observations: 658,461Variables: 7

$ FL_DATE <date> 2019-08-01, 2019-08-01, 2019-08-01, 2019-08-01, 2019-08-01…

$ OP_UNIQUE_CARRIER <chr> "DL", "DL", "DL", "DL", "DL", "DL", "DL", "DL", "DL", "DL",…

$ ORIGIN <chr> "ATL", "DFW", "IAH", "PDX", "SLC", "DTW", "ATL", "MSP", "JF…

$ DEST <chr> "DFW", "ATL", "ATL", "SLC", "PDX", "ATL", "DTW", "JFK", "MS…

$ DEP_DELAY_NEW <dbl> 31, 0, 40, 0, 0, 10, 0, 22, 0, 0, 0, 17, 5, 2, 0, 0, 8, 0, …

$ X6 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

$ Description <chr> "Delta Air Lines Inc.", "Delta Air Lines Inc.", "Delta Air …

這個合并的資料集現在有一個新列,列中包含航空公司的名稱。如果您自己運作這段代碼的一個版本,您可能會注意到dplyr比基數R快得多。

原文:

R has a number of quick, elegant ways to join data frames by a common column. I’d like to show you three of them:

· base R’s merge() function

· dplyr’s join family of functions

· data.table’s bracket syntax

Get and import the data

For this example I’ll use one of my favorite demo data sets—flight delay times from the U.S. Bureau of Transportation Statistics. If you want to follow along, head to http://bit.ly/USFlightDelays and download data for the time frame of your choice with the columns Flight Date, Reporting_Airline, Origin, Destination, and DepartureDelayMinutes. Also get the lookup table for Reporting_Airline.

Or, you can download these two data sets—plus my R code in a single file and a PowerPoint explaining different types of data merges—here:

To read in the file with base R, I’d first unzip the flight delay file and then import both flight delay data and the code lookup file with read.csv(). If you’re running the code, the delay file you downloaded will likely have a different name than in the code below. Also, note the lookup file’s unusual .csv_ extension.

unzip("673598238_T_ONTIME_REPORTING.zip")

mydf <- read.csv("673598238_T_ONTIME_REPORTING.csv",

sep = ",", quote="\"")

mylookup <- read.csv("L_UNIQUE_CARRIERS.csv_",

quote="\"", sep = "," )

Next, I’ll take a peek at both files with head():

head(mydf)

FL_DATE OP_UNIQUE_CARRIER ORIGIN DEST DEP_DELAY_NEW X1 2019-08-01 DL ATL DFW 31 NA2 2019-08-01 DL DFW ATL 0 NA3 2019-08-01 DL IAH ATL 40 NA4 2019-08-01 DL PDX SLC 0 NA5 2019-08-01 DL SLC PDX 0 NA6 2019-08-01 DL DTW ATL 10 NA

head(mylookup)

Code Description1 02Q Titan Airways2 04Q Tradewind Aviation3 05Q Comlux Aviation, AG4 06Q Master Top Linhas Aereas Ltd.5 07Q Flair Airlines Ltd.6 09Q Swift Air, LLC d/b/a Eastern Air Lines d/b/a Eastern

Merges with base R

The mydf delay data frame only has airline information by code. I’d like to add a column with the airline names from mylookup. One base R way to do this is with the merge() function, using the basic syntax merge(df1, df2). The order of data frame 1 and data frame 2 doesn't matter, but whichever one is first is considered x and the second one is y.

If the columns you want to join by don’t have the same name, you need to tell merge which columns you want to join by: by.x for the x data frame column name, and by.y for the y one, such as merge(df1, df2, by.x = "df1ColName", by.y = "df2ColName").

You can also tell merge whether you want all rows, including ones without a match, or just rows that match, with the arguments all.x and all.y. In this case, I’d like all the rows from the delay data; if there’s no airline code in the lookup table, I still want the information. But I don’t need rows from the lookup table that aren’t in the delay data (there are some codes for old airlines that don’t fly anymore in there). So, all.x equals TRUE but all.y equals FALSE. Here's the code:

joined_df <- merge(mydf, mylookup, by.x = "OP_UNIQUE_CARRIER",

by.y = "Code", all.x = TRUE, all.y = FALSE)

The new joined data frame includes a column called Description with the name of the airline based on the carrier code:

head(joined_df)

OP_UNIQUE_CARRIER FL_DATE ORIGIN DEST DEP_DELAY_NEW X Description1 9E 2019-08-12 JFK SYR 0 NA Endeavor Air Inc.2 9E 2019-08-12 TYS DTW 0 NA Endeavor Air Inc.3 9E 2019-08-12 ORF LGA 0 NA Endeavor Air Inc.4 9E 2019-08-13 IAH MSP 6 NA Endeavor Air Inc.5 9E 2019-08-12 DTW JFK 58 NA Endeavor Air Inc.6 9E 2019-08-12 SYR JFK 0 NA Endeavor Air Inc.

Joins with dplyr

The dplyr package uses SQL database syntax for its join functions. A left join means: Include everything on the left (what was the x data frame in merge()) and all rows that match from the right (y) data frame. If the join columns have the same name, all you need is left_join(x, y). If they don’t have the same name, you need a by argument, such as left_join(x, y, by = c("df1ColName" = "df2ColName")).

Note the syntax for by: It’s a named vector, with both the left and right column names in quotation marks.

Update: Starting with dplyr version 1.1.0 (on CRAN as of January 29, 2023), dplyr joins have an additional by syntax using join_by():

left_join(x, y, by = join_by(df1ColName == df2ColName))

The new join_by() helper function uses unquoted column names and the == boolean operator, which package authors say makes more sense in an R context than c("col1" = "col2"), since = is meant for assigning a value to a variable, not testing for equality.

A left join keeps all rows in the left data frame and only matching rows from the right data frame.

The code to import and merge both data sets using left_join() is below. It starts by loading the dplyr and readr packages, and then reads in the two files with read_csv(). When using read_csv(), I don’t need to unzip the file first.

library(dplyr)

library(readr)

mytibble <- read_csv("673598238_T_ONTIME_REPORTING.zip")

mylookup_tibble <- read_csv("L_UNIQUE_CARRIERS.csv_")

joined_tibble <- left_join(mytibble, mylookup_tibble,

by = join_by(OP_UNIQUE_CARRIER == Code))

Note that dplyr's older by syntax without join_by() still works

joined_tibble <- left_join(mytibble, mylookup_tibble,

by = c("OP_UNIQUE_CARRIER" = "Code"))

read_csv() creates tibbles, which are a type of data frame with some extra features. left_join() merges the two. Take a look at the syntax: In this case, order matters. left_join() means include all rows on the left, or first, data set, but only rows that match from the second one. And, because I need to join by two differently named columns, I included a by argument.

The new join syntax in the development-only version of dplyr would be:

joined_tibble2 <- left_join(mytibble, mylookup_tibble,

by = join_by(OP_UNIQUE_CARRIER == Code))

Since most people likely have the CRAN version, however, I will use dplyr's original named-vector syntax in the rest of this article, until join_by() becomes part of the CRAN version.

We can look at the structure of the result with dplyr’s glimpse() function, which is another way to see the top few items of a data frame:

glimpse(joined_tibble)Observations: 658,461Variables: 7

$ FL_DATE <date> 2019-08-01, 2019-08-01, 2019-08-01, 2019-08-01, 2019-08-01…

$ OP_UNIQUE_CARRIER <chr> "DL", "DL", "DL", "DL", "DL", "DL", "DL", "DL", "DL", "DL",…

$ ORIGIN <chr> "ATL", "DFW", "IAH", "PDX", "SLC", "DTW", "ATL", "MSP", "JF…

$ DEST <chr> "DFW", "ATL", "ATL", "SLC", "PDX", "ATL", "DTW", "JFK", "MS…

$ DEP_DELAY_NEW <dbl> 31, 0, 40, 0, 0, 10, 0, 22, 0, 0, 0, 17, 5, 2, 0, 0, 8, 0, …

$ X6 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

$ Description <chr> "Delta Air Lines Inc.", "Delta Air Lines Inc.", "Delta Air …

This joined data set now has a new column with the name of the airline. If you run a version of this code yourself, you’ll probably notice that dplyr is way faster than base R.

CXO聯盟(CXO union)是一家聚焦于CIO,CDO,cto,ciso,cfo,coo,chro,cpo,ceo等人群的平台組織,其中在CIO會議領域的領頭羊,目前舉辦了大量的CIO大會、CIO論壇、CIO活動、CIO會議、CIO峰會、CIO會展。如華東CIO會議、華南cio會議、華北cio會議、中國cio會議、西部CIO會議。在這裡,你可以參加大量的IT大會、IT行業會議、IT行業論壇、IT行業會展、數字化論壇、數字化轉型論壇,在這裡你可以認識很多的首席資訊官、首席數字官、首席财務官、首席技術官、首席人力資源官、首席營運官、首席執行官、IT總監、财務總監、資訊總監、營運總監、采購總監、供應鍊總監。

數字化轉型網(www.szhzxw.cn資訊媒體,是企業數字化轉型的必讀參考,在這裡你可以學習大量的知識,如财務數字化轉型、供應鍊數字化轉型、營運數字化轉型、生産數字化轉型、人力資源數字化轉型、市場營銷數字化轉型。通過關注我們的公衆号,你就知道如何實作企業數字化轉型?數字化轉型如何做?

【CXO聯盟部分會員】興達投資集團CEO、江蘇中超投資集團CEO、江蘇江潤銅業CEO、浙江協和集團CEO、山東墾利石化集團CEO、江蘇長電科技股份CEO、重慶鋼鐵股份CEO、天合光能股份CEO、江蘇恒瑞醫藥股份CEO、奧盛集團CEO、山鷹國際控股股份公司CEO、成都蛟龍投資CEO、宜賓天原集團股份CEO、勝達集團CEO、三寶集團股份CEO、山西晉城鋼鐵控股集團CEO、河北鑫達鋼鐵集團CEO、深圳市中金嶺南有色金屬股份CEO、甯波申洲針織CEO、廣西柳工集團CEO、蘇州創元投資發展(集團)CEO、邯鄲正大制管CEO、人福醫藥集團股份公司CEO、浙江升華控股集團CEO、河南濟源鋼鐵(集團)CEO、達利食品集團CEO、廣西汽車集團CEO、孝義市鵬飛實業CEO、宗申産業集團CEO、天津紡織集團(控股)CEO、晶澳太陽能科技股份CEO、唐山瑞豐鋼鐵(集團)CEO、重慶萬達薄闆CEO、唐山三友集團CEO、淩源鋼鐵集團CEO、甯波博洋控股集團CEO、天津市醫藥集團CEO、福建省汽車工業集團CEO、山西安泰控股集團CEO、牧原食品股份CEO、上海儀電(集團)CEO、上海勝華電纜(集團)CEO、大亞科技集團CEO、天津恒興集團CEO、攀枝花鋼城集團CEO、桂林力源糧油食品集團CEO、萬馬聯合控股集團CEO、山東鑫海科技股份CEO、江蘇上上電纜集團CEO、廣西貴港鋼鐵集團CEO、久立集團股份CEO