慢循环R,如何使它更快?
我有一个电子邮件列表,我想使用最长的公共子字符串来比较行之间的模式(相似性) 数据是包含电子邮件的数据框:慢循环R,如何使它更快?,r,for-loop,pattern-matching,nested-loops,reshape,R,For Loop,Pattern Matching,Nested Loops,Reshape,我有一个电子邮件列表,我想使用最长的公共子字符串来比较行之间的模式(相似性) 数据是包含电子邮件的数据框: V1 1 "01003@163.com" 2 "cloud@coldmail.com" 3 "den_smukk_kiilar@hotmail.com" 4 "Esteban.verduzco@gmail.com" 5 "freiheitmensch@gmail.com" 6 "mitsoanastos@yahoo.com" 7 "ahme
V1
1 "01003@163.com"
2 "cloud@coldmail.com"
3 "den_smukk_kiilar@hotmail.com"
4 "Esteban.verduzco@gmail.com"
5 "freiheitmensch@gmail.com"
6 "mitsoanastos@yahoo.com"
7 "ahmedsir744@yahoo.com"
8 ...
这是我的代码:
library(stringdist)
for(i in 1:nrow(data)) {
sample <- data[i,]
for(j in (i+1):nrow(data)) if(i+1 <= nrow(data)) {
if((stringdist(data[j,],sample,method='lcs'))<=3) { #number of different characteres 3 (123.456 == 123.321)
duplicate <- data[j,]
email1 = as.character(data[i,])
email2 = as.character(data[j,])
pair <- cbind(email1, email2)
output3[dfrow, ] <- pair
dfrow <- dfrow + 1
}
}
}
我有30万封电子邮件,这要花很长时间
有更好的方法吗
谢谢 这里有一个尝试:
library(stringdist)
library(stringi)
library(dplyr)
library(tidyr)
# Hypothetical data frame
data <- data.frame(V1 = paste0(stri_rand_strings(5, 3, "[a-z]"),
"@", stri_rand_strings(5, 2, "[a-z]"), ".com"),
stringsAsFactors = FALSE)
其中:
# email1 email2
#1 kty@hm.com ryu@iq.com
#2 brs@wk.com pib@uo.com
#3 brs@wk.com ryu@iq.com
#4 pib@uo.com brs@wk.com
#5 ryu@iq.com kty@hm.com
#6 ryu@iq.com brs@wk.com
以下是一个尝试:
library(stringdist)
library(stringi)
library(dplyr)
library(tidyr)
# Hypothetical data frame
data <- data.frame(V1 = paste0(stri_rand_strings(5, 3, "[a-z]"),
"@", stri_rand_strings(5, 2, "[a-z]"), ".com"),
stringsAsFactors = FALSE)
其中:
# email1 email2
#1 kty@hm.com ryu@iq.com
#2 brs@wk.com pib@uo.com
#3 brs@wk.com ryu@iq.com
#4 pib@uo.com brs@wk.com
#5 ryu@iq.com kty@hm.com
#6 ryu@iq.com brs@wk.com
以下是一个尝试:
library(stringdist)
library(stringi)
library(dplyr)
library(tidyr)
# Hypothetical data frame
data <- data.frame(V1 = paste0(stri_rand_strings(5, 3, "[a-z]"),
"@", stri_rand_strings(5, 2, "[a-z]"), ".com"),
stringsAsFactors = FALSE)
其中:
# email1 email2
#1 kty@hm.com ryu@iq.com
#2 brs@wk.com pib@uo.com
#3 brs@wk.com ryu@iq.com
#4 pib@uo.com brs@wk.com
#5 ryu@iq.com kty@hm.com
#6 ryu@iq.com brs@wk.com
以下是一个尝试:
library(stringdist)
library(stringi)
library(dplyr)
library(tidyr)
# Hypothetical data frame
data <- data.frame(V1 = paste0(stri_rand_strings(5, 3, "[a-z]"),
"@", stri_rand_strings(5, 2, "[a-z]"), ".com"),
stringsAsFactors = FALSE)
其中:
# email1 email2
#1 kty@hm.com ryu@iq.com
#2 brs@wk.com pib@uo.com
#3 brs@wk.com ryu@iq.com
#4 pib@uo.com brs@wk.com
#5 ryu@iq.com kty@hm.com
#6 ryu@iq.com brs@wk.com
谢谢Steven,但此解决方案不适用于30万封电子邮件列表。内存不足…谢谢Steven,但此解决方案不适用于30万封电子邮件列表。内存不足…谢谢Steven,但此解决方案不适用于30万封电子邮件列表。内存不足…谢谢Steven,但此解决方案不适用于30万封电子邮件列表。内存不足。。。