如何在R中查找字符串之间相同的短语
假设我有以下字符串如何在R中查找字符串之间相同的短语,r,text-mining,R,Text Mining,假设我有以下字符串 c(">Date of Procedure 01/09/2018<", ">Date of Procedure 01/10/2018<", ">Date of Procedure 03/09/2018<", ">Date of Procedure 04/09/2018<", "Patient name Bilbo baggins", "Patient name: Jonny Begood", "Patient name E
c(">Date of Procedure 01/09/2018<", ">Date of Procedure 01/10/2018<",
">Date of Procedure 03/09/2018<", ">Date of Procedure 04/09/2018<",
"Patient name Bilbo baggins", "Patient name: Jonny Begood",
"Patient name Elma Fudd", "Patient name Miss Puddleduck", "Patient name: Itsy Bitsy",
"Patient name: Lala", "Type of procedure: OGD", "Type of procedure: OGD",
"Type of procedure: Colonoscopy", "Type of procedure Colonoscopy",
"Type of procedure: Colonoscopy", "Label 35252", "Label 543 ",
"Label 5254 ", "Label 23", "Label 555555 ", "Label 54354")
c(">程序日期2018年9月1日程序日期2018年10月1日程序日期2018年9月3日程序日期2018年9月4日当使用ngrams从tidytext中使用unnest_令牌时,您不能指定删除数字或其他不需要的字符。切换到quanteda包将在这种情况下对您有所帮助。代码中的注释用于解释
library(quanteda)
text <- c(">Date of Procedure 01/09/2018<", ">Date of Procedure 01/10/2018<",
">Date of Procedure 03/09/2018<", ">Date of Procedure 04/09/2018<",
"Patient name Bilbo baggins", "Patient name: Jonny Begood",
"Patient name Elma Fudd", "Patient name Miss Puddleduck", "Patient name: Itsy Bitsy",
"Patient name: Lala", "Type of procedure: OGD", "Type of procedure: OGD",
"Type of procedure: Colonoscopy", "Type of procedure Colonoscopy",
"Type of procedure: Colonoscopy", "Label 35252", "Label 543 ",
"Label 5254 ", "Label 23", "Label 555555 ", "Label 54354")
# tokenize text and remove punctuation and numbers
toks <- tokens(text, remove_numbers = TRUE, remove_punct = TRUE)
# create 1, 2 and 3 ngrams.
toks_grams <- tokens_ngrams(toks, n = 1:3)
# transform into a document feature matrix (step can be included in next one)
my_dfm <- dfm(toks_grams)
# turn the terms into a frequency table and filter out the ones that have a count of 1
# depending on needs you can filter out words ngrams or choose a higher occuring frequency to filter on.
freqs <- textstat_frequency(my_dfm)
freqs[freqs$frequency > 1, ]
feature frequency rank docfreq group
1 of 9 1 9 all
2 procedure 9 1 9 all
3 of_procedure 9 1 9 all
4 patient 6 4 6 all
5 name 6 4 6 all
6 patient_name 6 4 6 all
7 label 6 4 6 all
8 type 5 8 5 all
9 type_of 5 8 5 all
10 type_of_procedure 5 8 5 all
11 date 4 11 4 all
12 date_of 4 11 4 all
13 date_of_procedure 4 11 4 all
14 colonoscopy 3 14 3 all
15 procedure_colonoscopy 3 14 3 all
16 of_procedure_colonoscopy 3 14 3 all
17 ogd 2 17 2 all
18 procedure_ogd 2 17 2 all
19 of_procedure_ogd 2 17 2 all
库(quanteda)
对于更新的文本,是否有任何规则,因为我找不到具有example@akrun我从预期输出中删除了冒号。唯一的规则是提取的术语应该共享。如果患者姓名相同或日期相同,则可能会导致问题same@akrun我已经给出了唯一的()示例中的版本,因此日期和名称将不相同。根据您的规则,“程序日期”应与“日期”、“日期”、“程序日期”、“程序日期”、“程序日期”和“程序日期”数倍