在R中合并两个数据帧（数据帧未正确合并）_R_Dataframe_Merge

在R中合并两个数据帧（数据帧未正确合并）

r dataframe merge

在R中合并两个数据帧（数据帧未正确合并）,r,dataframe,merge,R,Dataframe,Merge,对于两个数据帧，在R中使用merge函数有点问题。我有两个大数据框，它们的列和数据彼此相同，这就是我用来合并它们的地方例如，数据帧1有Instrument、RecordDate、HourMinuteSecond、毫秒等列。。。。数据帧2有Instrument、RecordDate、HourMinuteSecond、millis秒等列。。。。与数据帧2相比，还有其他几个不同的列。我现在使用merge函数，如下所示： DataFrame3 <- merge(DataFrame2, DataF

对于两个数据帧，在R中使用merge函数有点问题。我有两个大数据框，它们的列和数据彼此相同，这就是我用来合并它们的地方

例如，数据帧1有Instrument、RecordDate、HourMinuteSecond、毫秒等列。。。。数据帧2有Instrument、RecordDate、HourMinuteSecond、millis秒等列。。。。与数据帧2相比，还有其他几个不同的列。我现在使用merge函数，如下所示：

DataFrame3 <- merge(DataFrame2, DataFrame1, by=c("Instrument", "RecordDate","HourMinuteSecond","MilliSecond"))

请注意，还有其他列，但我省略了它们。我现在比较这些部分，就好像它们是向量一样。首先，我使用相同的函数来比较每个向量中的每个值，这给出了以下结果：

> identical(DataFrame1[120486,1] ,DataFrame2[65,1])
[1] FALSE
> identical(DataFrame1[120486,2] ,DataFrame2[65,2])
[1] TRUE
> identical(DataFrame1[120486,3] ,DataFrame2[65,3])
[1] FALSE
> identical(DataFrame1[120486,4] ,DataFrame2[65,4])
[1] TRUE

从相同的函数中，似乎Instruments和HourmituteSecond列中的值彼此不同。谁能告诉我是什么导致了这个问题？提前谢谢

编辑：这是dput输出，希望这就是您所指的：

> dput(droplevels(DataFrame2[65,1:4]))
structure(list(Instrument = structure(1L, .Label = "DTE", class = "factor"), 
RecordDate = structure(1L, .Label = "6/4/2012", class = "factor"), 
HourMinuteSecond = structure(1L, .Label = "16:10:27", class = "factor"), 
MilliSecond = 42L), .Names = c("Instrument", "RecordDate", 
"HourMinuteSecond", "MilliSecond"), row.names = 65L, class = "data.frame")

> dput(droplevels(DataFrame1[120486,1:4]))
structure(list(Instrument = structure(1L, .Label = "DTE", class = "factor"), 
RecordDate = structure(1L, .Label = "6/4/2012", class = "factor"), 
HourMinuteSecond = structure(1L, .Label = "16:10:27", class = "factor"), 
MilliSecond = 42L), .Names = c("Instrument", "RecordDate", 
"HourMinuteSecond", "MilliSecond"), row.names = 120486L, class = "data.frame")

这是str的输出：

> str(DataFrame1)
'data.frame':   317495 obs. of  9 variables:
 $ Instrument      : Factor w/ 4 levels "CDD","DTE","ERA",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ RecordDate      : Factor w/ 30 levels "5/18/2012","5/21/2012",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ HourMinuteSecond: Factor w/ 21763 levels "10:02:02","10:02:03",..: 14 14 14 17 19 22 24 25 25 25 ...
 $ MilliSecond     : int  26 57 158 70 73 8 926 448 457 458 ...
 $ L1BidPrice      : num  6.91 6.91 6.91 6.91 6.91 6.91 6.9 6.9 6.89 6.89 ...
 $ L1BidVolume     : int  520 504 504 504 504 508 20 4 20 20 ...
 $ L1AskPrice      : num  6.92 6.92 6.92 6.92 6.92 6.92 6.91 6.91 6.9 6.9 ...
 $ L1AskVolume     : int  3917 3917 3915 3932 3915 3915 3407 3407 13 30 ...
 $ Midquote        : num  6.92 6.92 6.92 6.92 6.92 ...

> str(DataFrame2)
'data.frame':   577 obs. of  15 variables:
 $ Instrument       : Factor w/ 2 levels "DTE","ERA": 1 1 1 1 1 1 1 1 1 1 ...
 $ RecordDate       : Factor w/ 30 levels "5/18/2012","5/21/2012",..: 1 1 1 1 1 1 2 2 2 2 ...
 $ HourMinuteSecond : Factor w/ 317 levels "10:02:10","10:02:21",..: 301 301 301 301 301 301 2 98 129 130 ...
 $ MilliSecond      : int  45 45 45 45 45 45 485 6 92 300 ...
 $ RecordType       : Factor w/ 1 level "TRADE": 1 1 1 1 1 1 1 1 1 1 ...
 $ Price            : num  0.195 0.195 0.195 0.195 0.195 0.195 0.2 0.19 0.19 0.185 ...
 $ Volume           : int  2686 6350 6350 6350 1620 3064 1 13986 25000 23092 ...
 $ UndisclosedVolume: Factor w/ 1 level "\\N": 1 1 1 1 1 1 1 1 1 1 ...
 $ DollarValue      : num  524 1238 1238 1238 316 ...
 $ Qualifiers       : Factor w/ 4 levels "\\N","AC","Bi",..: 2 2 2 2 2 2 4 4 3 4 ...
 $ BidID            : num  6.13e+18 6.13e+18 6.13e+18 6.13e+18 6.13e+18 ...
 $ AskID            : num  6.13e+18 6.13e+18 6.13e+18 6.13e+18 6.13e+18 ...
 $ BidOrAsk         : Factor w/ 1 level "\\N": 1 1 1 1 1 1 1 1 1 1 ...
 $ BuyerBrokerID    : int  229 229 229 229 229 229 236 129 229 112 ...
 $ SellerBrokerID   : int  297 210 210 210 110 157 229 229 299 229 ...

数据：

请使用dput在此处发布被合并者认为不相似的2条记录，可能您有不同的因素和级别？发布一些dput，例如dputDataFrame2[65,1:4]，或者如果另一行的dputdroplevelsDataFrame2[65,1:4]太长，以及类似的内容，则会让一切变得清晰。更有可能的是，如果在两个帧上都运行str，则会在第一列中发现，一个数据帧将它们作为因子，另一个字符数据DTE或在不同的级别上，第三列中的时间形状相同，但它们可能具有不同的时间格式，或在不同的级别上作为因子列出……使用str来解决它！嗨，我添加了dput输出，我使用了droplevels，因为输出非常大，我似乎无法上传文件。另外，我现在将尝试str函数：关于str，不同的级别如何影响数据？不幸的是，我不太熟悉这个函数或它的含义。你知道我应该怎么做才能修复它吗？谢谢你写出来，不过我还有另一个问题-结果是我的R版本与data.table软件包不兼容。我有最新版本的R/R Studio，因此无法安装。你知道怎么解决这个问题吗？我一直在寻找解决方法，但仍然找不到任何东西：/devtools也没有安装，它似乎也与我的R版本不兼容：感谢您的帮助，将getRversion放入RStudio会得到“2.15.2”，例如，当我尝试安装data.table时，出现此消息时，devtools也会出现类似的消息：将软件包安装到“C:/Users/Daniel/Documents/R/win library/2.15”中，因为“lib”在安装中是未指定的警告。软件包：软件包“data.table”不适用于R版本2.15.2Tanks堆！我认为它工作正常！我只想问最后一个问题：在你的答案上面有一行代码：setDTdf1[df2，on=c'Instrument'，'RecordDate'，'HourMinuteSecond'，'millis秒'，然后你写：setDTdf1 setDTdf2 mergedf1，df2，by=c'Instrument'，'RecordDate'，'HourMinuteSecond'，'millis秒'。您写这篇文章是为了说明如何有两种不同的方法来合并数据帧，还是说这应该是最后一步？我希望我的问题是可以理解的谢谢你，我真的很感谢你的帮助！是的，您的方法似乎给出了正确的结果，因此我认为它更好：

> str(DataFrame1)
'data.frame':   317495 obs. of  9 variables:
 $ Instrument      : Factor w/ 4 levels "CDD","DTE","ERA",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ RecordDate      : Factor w/ 30 levels "5/18/2012","5/21/2012",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ HourMinuteSecond: Factor w/ 21763 levels "10:02:02","10:02:03",..: 14 14 14 17 19 22 24 25 25 25 ...
 $ MilliSecond     : int  26 57 158 70 73 8 926 448 457 458 ...
 $ L1BidPrice      : num  6.91 6.91 6.91 6.91 6.91 6.91 6.9 6.9 6.89 6.89 ...
 $ L1BidVolume     : int  520 504 504 504 504 508 20 4 20 20 ...
 $ L1AskPrice      : num  6.92 6.92 6.92 6.92 6.92 6.92 6.91 6.91 6.9 6.9 ...
 $ L1AskVolume     : int  3917 3917 3915 3932 3915 3915 3407 3407 13 30 ...
 $ Midquote        : num  6.92 6.92 6.92 6.92 6.92 ...

> str(DataFrame2)
'data.frame':   577 obs. of  15 variables:
 $ Instrument       : Factor w/ 2 levels "DTE","ERA": 1 1 1 1 1 1 1 1 1 1 ...
 $ RecordDate       : Factor w/ 30 levels "5/18/2012","5/21/2012",..: 1 1 1 1 1 1 2 2 2 2 ...
 $ HourMinuteSecond : Factor w/ 317 levels "10:02:10","10:02:21",..: 301 301 301 301 301 301 2 98 129 130 ...
 $ MilliSecond      : int  45 45 45 45 45 45 485 6 92 300 ...
 $ RecordType       : Factor w/ 1 level "TRADE": 1 1 1 1 1 1 1 1 1 1 ...
 $ Price            : num  0.195 0.195 0.195 0.195 0.195 0.195 0.2 0.19 0.19 0.185 ...
 $ Volume           : int  2686 6350 6350 6350 1620 3064 1 13986 25000 23092 ...
 $ UndisclosedVolume: Factor w/ 1 level "\\N": 1 1 1 1 1 1 1 1 1 1 ...
 $ DollarValue      : num  524 1238 1238 1238 316 ...
 $ Qualifiers       : Factor w/ 4 levels "\\N","AC","Bi",..: 2 2 2 2 2 2 4 4 3 4 ...
 $ BidID            : num  6.13e+18 6.13e+18 6.13e+18 6.13e+18 6.13e+18 ...
 $ AskID            : num  6.13e+18 6.13e+18 6.13e+18 6.13e+18 6.13e+18 ...
 $ BidOrAsk         : Factor w/ 1 level "\\N": 1 1 1 1 1 1 1 1 1 1 ...
 $ BuyerBrokerID    : int  229 229 229 229 229 229 236 129 229 112 ...
 $ SellerBrokerID   : int  297 210 210 210 110 157 229 229 299 229 ...

# load data table library used for large data sets
library('data.table')

# convert factors into character
col1 <- colnames(df1)[sapply(df1, is.factor)]  # get columns that are factors for df1
col2 <- colnames(df2)[sapply(df2, is.factor)]  # get columns that are factors for df2

for(col in col1){   # df1
  set(df1, , col, as.character( df1[[col]] ) )    # for more info on set() function, read ?`:=`
}

for(col in col2){   # df2
  set(df2, , col, as.character( df2[[col]] ) )
}

# join two data frames by the selected columns in 'on' argument
setDT(df1)[df2, on = c('Instrument', 'RecordDate', 'HourMinuteSecond','MilliSecond')]   # setDT converts data frame to data table by reference
#    Instrument RecordDate HourMinuteSecond MilliSecond L1BidPrice L1BidVolume L1AskPrice L1AskVolume Midquote i.L1BidPrice i.L1BidVolume i.L1AskPrice i.L1AskVolume
# 1:        DTE   6/4/2012         16:10:27          42       6.91         520       6.92        3917     6.92            7             8            9            10
#    i.Midquote
# 1:         11

# merge function in data table is faster than the same function in base R function. You just convert data frame into data tables.
setDT(df1)
setDT(df2)
merge(df1, df2, by = c('Instrument', 'RecordDate', 'HourMinuteSecond','MilliSecond'))

df1 <- structure(list(Instrument = "DTE", RecordDate = "6/4/2012", HourMinuteSecond = "16:10:27", 
                      MilliSecond = 42L, L1BidPrice = 6.91, L1BidVolume = 520, 
                      L1AskPrice = 6.92, L1AskVolume = 3917, Midquote = 6.92), .Names = c("Instrument", 
                                                                                          "RecordDate", "HourMinuteSecond", "MilliSecond", "L1BidPrice", 
                                                                                          "L1BidVolume", "L1AskPrice", "L1AskVolume", "Midquote"), row.names = c(NA, -1L), class = "data.frame") 

df2 <- structure(list(Instrument = "DTE", RecordDate = "6/4/2012", HourMinuteSecond = "16:10:27", 
                      MilliSecond = 42L, L1BidPrice = 7, L1BidVolume = 8, L1AskPrice = 9, 
                      L1AskVolume = 10, Midquote = 11), .Names = c("Instrument", 
                                                                   "RecordDate", "HourMinuteSecond", "MilliSecond", "L1BidPrice", 
                                                                   "L1BidVolume", "L1AskPrice", "L1AskVolume", "Midquote"), row.names = 120486L, class = "data.frame")


df1
#    Instrument RecordDate HourMinuteSecond MilliSecond L1BidPrice L1BidVolume L1AskPrice L1AskVolume Midquote
# 1:        DTE   6/4/2012         16:10:27          42       6.91         520       6.92        3917     6.92

df2
#        Instrument RecordDate HourMinuteSecond MilliSecond L1BidPrice L1BidVolume L1AskPrice L1AskVolume Midquote
# 120486:        DTE   6/4/2012         16:10:27          42          7           8          9          10       11