如何在R中查找字符串之间相同的短语_R_Text Mining

如何在R中查找字符串之间相同的短语

如何在R中查找字符串之间相同的短语,r,text-mining,R,Text Mining,假设我有以下字符串 c(">Date of Procedure 01/09/2018<", ">Date of Procedure 01/10/2018<", ">Date of Procedure 03/09/2018<", ">Date of Procedure 04/09/2018<", "Patient name Bilbo baggins", "Patient name: Jonny Begood", "Patient name E

假设我有以下字符串

c(">Date of Procedure 01/09/2018<", ">Date of Procedure 01/10/2018<", 
">Date of Procedure 03/09/2018<", ">Date of Procedure 04/09/2018<", 
"Patient name Bilbo baggins", "Patient name: Jonny Begood", 
"Patient name Elma Fudd", "Patient name Miss Puddleduck", "Patient name: Itsy Bitsy", 
"Patient name: Lala", "Type of procedure: OGD", "Type of procedure: OGD", 
"Type of procedure: Colonoscopy", "Type of procedure Colonoscopy", 
"Type of procedure: Colonoscopy", "Label 35252", "Label 543 ", 
"Label 5254 ", "Label 23", "Label 555555 ", "Label 54354")

c(">程序日期2018年9月1日程序日期2018年10月1日程序日期2018年9月3日程序日期2018年9月4日当使用ngrams从tidytext中使用unnest_令牌时，您不能指定删除数字或其他不需要的字符。切换到quanteda包将在这种情况下对您有所帮助。代码中的注释用于解释
library(quanteda)
text <- c(">Date of Procedure 01/09/2018<", ">Date of Procedure 01/10/2018<", 
          ">Date of Procedure 03/09/2018<", ">Date of Procedure 04/09/2018<", 
          "Patient name Bilbo baggins", "Patient name: Jonny Begood", 
          "Patient name Elma Fudd", "Patient name Miss Puddleduck", "Patient name: Itsy Bitsy", 
          "Patient name: Lala", "Type of procedure: OGD", "Type of procedure: OGD", 
          "Type of procedure: Colonoscopy", "Type of procedure Colonoscopy", 
          "Type of procedure: Colonoscopy", "Label 35252", "Label 543 ", 
          "Label 5254 ", "Label 23", "Label 555555 ", "Label 54354")

# tokenize text and remove punctuation and numbers 
toks <- tokens(text, remove_numbers = TRUE, remove_punct = TRUE)

# create 1, 2 and 3 ngrams.
toks_grams <- tokens_ngrams(toks, n = 1:3)

# transform into a document feature matrix (step can be included in next one)    
my_dfm <- dfm(toks_grams)

# turn the terms into a frequency table and filter out the ones that have a count of 1
# depending on needs you can filter out words ngrams or choose a higher occuring frequency to filter on.
freqs <- textstat_frequency(my_dfm)
freqs[freqs$frequency > 1, ]


                    feature frequency rank docfreq group
1                        of         9    1       9   all
2                 procedure         9    1       9   all
3              of_procedure         9    1       9   all
4                   patient         6    4       6   all
5                      name         6    4       6   all
6              patient_name         6    4       6   all
7                     label         6    4       6   all
8                      type         5    8       5   all
9                   type_of         5    8       5   all
10        type_of_procedure         5    8       5   all
11                     date         4   11       4   all
12                  date_of         4   11       4   all
13        date_of_procedure         4   11       4   all
14              colonoscopy         3   14       3   all
15    procedure_colonoscopy         3   14       3   all
16 of_procedure_colonoscopy         3   14       3   all
17                      ogd         2   17       2   all
18            procedure_ogd         2   17       2   all
19         of_procedure_ogd         2   17       2   all

库（quanteda）
对于更新的文本，是否有任何规则，因为我找不到具有example@akrun我从预期输出中删除了冒号。唯一的规则是提取的术语应该共享。如果患者姓名相同或日期相同，则可能会导致问题same@akrun我已经给出了唯一的（）示例中的版本，因此日期和名称将不相同。根据您的规则，“程序日期”应与“日期”、“日期”、“程序日期”、“程序日期”、“程序日期”和“程序日期”数倍