R 按元素比较两列_R_String_Match_Sapply

R 按元素比较两列

r string

R 按元素比较两列,r,string,match,sapply,R,String,Match,Sapply,我有一个很大的数据集df（354903行），其中有两列名为df$ColumnName和df$ColumnName.1 head(df) CompleteName CompleteName.1 1 Lefebvre Arnaud Lefebvre Schuhl Anne 1.1 Lefebvre Arnaud Abe Lyu 1.2 Lefebvre Arnaud Abe Lyu 1.3 Lefebvre Arnau

我有一个很大的数据集

df

（354903行），其中有两列名为

df$ColumnName

和

df$ColumnName.1

head(df)
       CompleteName       CompleteName.1
1   Lefebvre Arnaud Lefebvre Schuhl Anne
1.1 Lefebvre Arnaud              Abe Lyu
1.2 Lefebvre Arnaud              Abe Lyu
1.3 Lefebvre Arnaud       Louvet Nicolas
1.4 Lefebvre Arnaud   Muller Jean Michel
1.5 Lefebvre Arnaud  De Dinechin Florent

我正在尝试创建标签，以查看名称是否相同。当我尝试一个小子集时，它起作用[1如果它们相同，0如果不相同]：

> match(df$CompleteName[1], df$CompleteName.1[1], nomatch = 0)
[1] 0
> match(df$CompleteName[1:10], df$CompleteName.1[1:10], nomatch = 0)
[1] 0 0 0 0 0 0 0 0 0 0

但一旦我抛出完整的列，它就会给出完全不同的值，这在我看来似乎毫无意义：

> match(df$CompleteName, df$CompleteName.1, nomatch = 0)
[1] 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101
[23] 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101
[45] 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101

我应该使用

sapply

？我没有弄明白，我尝试了这个错误：

 sapply(df, function(x) match(x$CompleteName, x$CompleteName.1, nomatch = 0))

请帮忙

从match的手册页

“match”返回的（第一个）匹配的位置向量它的第一个论点在第二个论点中

因此，您的数据似乎表明“Lefebvre Arnaud”（第一个参数中的第一个位置）的第一个匹配项位于第101行。我相信您打算做的是一个简单的比较，所以这只是等式运算符

一些样本数据：

> a <- rep ("Lefebvre Arnaud", 6)
> b <- c("Abe Lyu", "Abe Lyu", "Lefebvre Arnaud", rep("De Dinechin Florent", 3))
> x <- data.frame(a,b, stringsAsFactors=F)
> x
            a                   b
1 Lefebvre Arnaud             Abe Lyu
2 Lefebvre Arnaud             Abe Lyu
3 Lefebvre Arnaud     Lefebvre Arnaud
4 Lefebvre Arnaud De Dinechin Florent
5 Lefebvre Arnaud De Dinechin Florent
6 Lefebvre Arnaud De Dinechin Florent
> x$a == x$b
[1] FALSE FALSE  TRUE FALSE FALSE FALSE

>a b x x
a b
1勒斐伏尔·阿诺·阿贝·吕
2勒斐伏尔·阿诺·阿贝·吕
3勒斐伏尔阿诺勒斐伏尔阿诺
4勒斐伏尔·阿诺德·迪内琴·弗洛伦特酒店
5勒斐伏尔·阿诺德·迪内琴·弗洛伦特酒店
6勒斐伏尔·阿诺德·迪内琴·弗洛伦特酒店
>x$a==x$b
[1] 假假真假假假

编辑：此外，您还需要确保将苹果与苹果进行比较，因此请仔细检查列的数据类型。使用

str（df）

查看列是字符串还是因子。您可以使用“stringsAsFactors=FALSE”构造矩阵，也可以将因子转换为字符。有几种方法可以做到这一点，请点击这里：

来自match的手册页

“match”返回的（第一个）匹配的位置向量它的第一个论点在第二个论点中

一些样本数据：

> a <- rep ("Lefebvre Arnaud", 6)
> b <- c("Abe Lyu", "Abe Lyu", "Lefebvre Arnaud", rep("De Dinechin Florent", 3))
> x <- data.frame(a,b, stringsAsFactors=F)
> x
            a                   b
1 Lefebvre Arnaud             Abe Lyu
2 Lefebvre Arnaud             Abe Lyu
3 Lefebvre Arnaud     Lefebvre Arnaud
4 Lefebvre Arnaud De Dinechin Florent
5 Lefebvre Arnaud De Dinechin Florent
6 Lefebvre Arnaud De Dinechin Florent
> x$a == x$b
[1] FALSE FALSE  TRUE FALSE FALSE FALSE

>a b x x
a b
1勒斐伏尔·阿诺·阿贝·吕
2勒斐伏尔·阿诺·阿贝·吕
3勒斐伏尔阿诺勒斐伏尔阿诺
4勒斐伏尔·阿诺德·迪内琴·弗洛伦特酒店
5勒斐伏尔·阿诺德·迪内琴·弗洛伦特酒店
6勒斐伏尔·阿诺德·迪内琴·弗洛伦特酒店
>x$a==x$b
[1] 假假真假假假

编辑：此外，您还需要确保将苹果与苹果进行比较，因此请仔细检查列的数据类型。使用

str（df）

查看列是字符串还是因子。您可以使用“stringsAsFactors=FALSE”构造矩阵，也可以将因子转换为字符。有几种方法可以做到这一点，请检查这里：

正如其他人所指出的，

匹配

不在这里。您想要的是相等，您可以通过使用

==

进行测试来获得相等，这将为您提供

TRUE/FALSE

。然后使用

as.numeric

将为您提供所需的

1/0

或使用

将为您提供索引
但是您可能仍然存在因素问题
 # making up some similar data( adapted from earlier answer)
 a <- rep ("Lefebvre Arnaud", 6)
 b <- c("Abe Lyu", "Abe Lyu", "Lefebvre Arnaud", rep("De Dinechin Florent", 3))
 df <- data.frame(CompleteName = a, CompleteName.1 = b)
 which(df$CompleteName == df$CompleteName1)
 #integer(0)
 #Warning message:
 #In is.na(e2) : is.na() applied to non-(list or vector) of type 'NULL'

 str(df)
 # 'data.frame':    6 obs. of  2 variables:
 # $ CompleteName  : Factor w/ 1 level "Lefebvre Arnaud": 1 1 1 1 1 1
 # $ CompleteName.1: Factor w/ 3 levels "Abe Lyu","De Dinechin Florent",..: 1 1 3 2 2 2

为了避免将来出现此问题，请在R会话开始时运行选项（stringsAsFactors=FALSE）
（或将其放在.R
脚本的顶部）。更多讨论如下：


正如其他人所指出的，match
不在这里。您想要的是相等，您可以通过使用==
进行测试来获得相等，这将为您提供TRUE/FALSE
。然后使用as.numeric
将为您提供所需的1/0
或使用将为您提供索引
但是您可能仍然存在因素问题
 # making up some similar data( adapted from earlier answer)
 a <- rep ("Lefebvre Arnaud", 6)
 b <- c("Abe Lyu", "Abe Lyu", "Lefebvre Arnaud", rep("De Dinechin Florent", 3))
 df <- data.frame(CompleteName = a, CompleteName.1 = b)
 which(df$CompleteName == df$CompleteName1)
 #integer(0)
 #Warning message:
 #In is.na(e2) : is.na() applied to non-(list or vector) of type 'NULL'

 str(df)
 # 'data.frame':    6 obs. of  2 variables:
 # $ CompleteName  : Factor w/ 1 level "Lefebvre Arnaud": 1 1 1 1 1 1
 # $ CompleteName.1: Factor w/ 3 levels "Abe Lyu","De Dinechin Florent",..: 1 1 3 2 2 2

为了避免将来出现此问题，请在R会话开始时运行选项（stringsAsFactors=FALSE）
（或将其放在.R
脚本的顶部）。更多讨论如下：


这是一个使用data.table
的解决方案，它与data.frame
解决方案的性能比较基于与您案例相同的记录数
col1 = sample(x = letters, size = 354903, replace = TRUE)
col2 = sample(x = letters, size = 354903, replace = TRUE)

library(data.table)
dt = data.table(col1 = col1, col2 = col2)
df = data.frame(col1 = col1, col2 = col2)

# comparing the 2 columns
system.time(dt$col1==dt$col2)
system.time(df$col1==df$col2)

# storing the comparison in the table/frame itself
system.time(dt[, col3:= (col1==col2)])
system.time({df$col3 = (df$col1 == df$col2)})

data.table
方法在我的机器上提供了显著的加速：从0.020s到0.008s
你自己试试看。我知道这对这么少的行来说并不重要，但是如果乘以1000，你会看到一个很大的区别
 这是一个使用data.table
的解决方案，它与data.frame
解决方案的性能比较基于与您案例相同的记录数
col1 = sample(x = letters, size = 354903, replace = TRUE)
col2 = sample(x = letters, size = 354903, replace = TRUE)

library(data.table)
dt = data.table(col1 = col1, col2 = col2)
df = data.frame(col1 = col1, col2 = col2)

# comparing the 2 columns
system.time(dt$col1==dt$col2)
system.time(df$col1==df$col2)

# storing the comparison in the table/frame itself
system.time(dt[, col3:= (col1==col2)])
system.time({df$col3 = (df$col1 == df$col2)})

data.table
方法在我的机器上提供了显著的加速：从0.020s到0.008s
你自己试试看。我知道这对这么少的行来说并不重要，但是如果乘以1000，你会看到一个很大的区别
 您可能不希望匹配-它从第二列给出匹配值，而不是它们是否相等。如果您有字符串，则可以使用as.numeric（df$CompleteName==df$CompleteName.1）
此外，在构建数据时使用stringsAsFactors=FALSE
。frame@thelatemail正如其他人指出的，match
在这里不起作用。我的评论是要添加到@jeremycg'中。也没有证据表明这些列是因子列，是吗？@jaimedash-我不是这个意思。我的意思是，我们不知道这个OP是否有因子列。在这个问题上，没有证据可以告诉我们它们是因素还是性格。不过这没什么大不了的。这就是为什么在问题中发布数据时首选dput（）
的原因之一。您可能不希望匹配-它从第二列给出匹配值，而不是它们是否相等。如果您有字符串，则可以使用as.numeric（df$CompleteNam