R arulesSequences-事务中存在哪些频繁序列?需要更通用的方法
前面的问题R arulesSequences-事务中存在哪些频繁序列?需要更通用的方法,r,arules,R,Arules,前面的问题 在中,我问如何提取所谓的tidList,该tidList提供有关所发现的频繁序列是否存在于用于挖掘这些频繁序列的每个事务中的信息。更具体地说,如何提取布尔矩阵(表示序列的存在或不存在),从而使行顺序与原始事务数据集中的行顺序相同? 最终,通过使用存储在类序列对象中的tidList的transactionInfo属性,这变得非常容易 新问题 这个问题与前面的问题有点不同:在给定一组频繁序列的情况下,如何对新事务进行“评分”。即,给定序列类型的对象,如何从事务类型的新对象获取tidLis
在中,我问如何提取所谓的tidList,该tidList提供有关所发现的频繁序列是否存在于用于挖掘这些频繁序列的每个事务中的信息。更具体地说,如何提取布尔矩阵(表示序列的存在或不存在),从而使行顺序与原始事务数据集中的行顺序相同?
最终,通过使用存储在类序列对象中的tidList的transactionInfo属性,这变得非常容易 新问题
这个问题与前面的问题有点不同:在给定一组频繁序列的情况下,如何对新事务进行“评分”。即,给定序列类型的对象,如何从事务类型的新对象获取tidList类型的对象 为了说明这一点,我使用一些玩具数据集设计了一个示例:
library(arules)
library(arulesSequences)
library(stringr)
#Function used to convert character string into an object of type transactions.
#Source: https://github.com/cran/clickstream/blob/master/R/Clickstream.r.
as.transactions <- function(clickstreamList) {
transactionID <- unlist(lapply(seq(1, length(clickstreamList), 1), FUN =
function(x) rep(names(clickstreamList)[x], length(clickstreamList[[x]]))), use.names = F)
sequenceID <- unlist(lapply(seq(1, length(clickstreamList), 1), FUN =
function(x) rep(x, length(clickstreamList[[x]]))))
eventID <- unlist(lapply(clickstreamList, FUN = function(x)
1:length(x)), use.names = F)
transactionInfo <- data.frame(transactionID, sequenceID, eventID)
tr <- as(as.data.frame(unlist(clickstreamList, use.names = F)), "transactions")
transactionInfo(tr) <- transactionInfo
itemInfo(tr)$labels <- itemInfo(tr)$levels
return(tr)
}
#Dataset to mine frequent sequences from
data_mine_freq_seq <- data.frame(id = 1:10,
transaction = c("A B B A",
"A B C B D C B B B F A",
"A A B",
"B A B A",
"A B B B B",
"A A A B",
"A B B A B B",
"E F F A C B D A B C D E",
"A B B A B",
"A B"))
#Convert data to list containing character vectors
data_for_fseq_mining <- str_split(string = data_mine_freq_seq$transaction, pattern = " ")
#Include identifiers as names
names(data_for_fseq_mining) <- data_mine_freq_seq$id
#Convert to object of type transactions
data_for_fseq_mining_trans <- as.transactions(clickstreamList = data_for_fseq_mining)
#Mine frequent sequences with cspade, given some parameters.
sequences <- cspade(data = data_for_fseq_mining_trans,
parameter = list(support = 0.10, maxlen = 4, maxgap = 2),
control = list(tidList = TRUE, verbose = TRUE))
#Create a data frame that contains all sequences and their support (167 sequences in total).
sequences_df <- cbind(sequence = labels(sequences),
support = sequences@quality)
这将产生预期和期望的结果:
tidLists in sparse format with
167 items/itemsets (rows) and
1 transactions (columns)
但当新事务包含不在原始数据集中的元素时,会发生错误:
#Added a 'G' at the end of the transaction. Element 'G' is not an element in
#'data_mine_freq_seq'.
data_score <- data.frame(id = 11, transaction = "A B B C D A G")
#Convert data to list containing character vectors
data_score_list <- str_split(string = data_score$transaction, pattern = " ")
#Include identifier as name
names(data_score_list) <- data_score$id
#Convert to object of type transactions
data_score_trans <- as.transactions(clickstreamList = data_score_list)
#Score 'data_score_trans' using 'sequences' again:
supportingTransactions(x = sequences, transactions = data_score_trans)
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
#在交易结束时添加了一个“G”。元素“G”不是中的元素
#“数据挖掘频率顺序”。
data_score我想出了一个利用正则表达式功能的解决方法。我定义了以下函数:
score_pattern <- function(pattern, events){
regex_elements <- str_extract_all(string = pattern, pattern = "\\{.*?\\}")
regex_elements <- str_replace_all(string = unlist(regex_elements),
pattern = "\\{|\\}", replacement = "")
expr <- ""
for(i in 1:length(regex_elements)){
if(i == 1){
expr <- paste0(expr, "(^| )", regex_elements[i], collapse = "")
} else {
expr <- paste0(expr, "( | .*? )", regex_elements[i], collapse = "")
}
}
expr <- paste0(expr, "( |$)", collapse = "")
print(expr)
score_pattern <- ifelse(test = grepl(pattern = expr, x = events) == TRUE,
yes = 1, no = 0)
return(score_pattern)
}
score\u模式
#Added a 'G' at the end of the transaction. Element 'G' is not an element in
#'data_mine_freq_seq'.
data_score <- data.frame(id = 11, transaction = "A B B C D A G")
#Convert data to list containing character vectors
data_score_list <- str_split(string = data_score$transaction, pattern = " ")
#Include identifier as name
names(data_score_list) <- data_score$id
#Convert to object of type transactions
data_score_trans <- as.transactions(clickstreamList = data_score_list)
#Score 'data_score_trans' using 'sequences' again:
supportingTransactions(x = sequences, transactions = data_score_trans)
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
score_pattern <- function(pattern, events){
regex_elements <- str_extract_all(string = pattern, pattern = "\\{.*?\\}")
regex_elements <- str_replace_all(string = unlist(regex_elements),
pattern = "\\{|\\}", replacement = "")
expr <- ""
for(i in 1:length(regex_elements)){
if(i == 1){
expr <- paste0(expr, "(^| )", regex_elements[i], collapse = "")
} else {
expr <- paste0(expr, "( | .*? )", regex_elements[i], collapse = "")
}
}
expr <- paste0(expr, "( |$)", collapse = "")
print(expr)
score_pattern <- ifelse(test = grepl(pattern = expr, x = events) == TRUE,
yes = 1, no = 0)
return(score_pattern)
}
score_pattern(pattern = "<{B},{A}>", events = data_score$transaction)
[1] "(^| )B( | .*? )A( |$)"
[1] 1