NLP：在R中只提取全文的特定句子_R_Nlp_Text Mining

NLP：在R中只提取全文的特定句子

r nlp

NLP：在R中只提取全文的特定句子,r,nlp,text-mining,R,Nlp,Text Mining,我有多行文本数据（不同的文档），每行大约有60-70行文本数据（超过50000个字符）。但我感兴趣的领域只是基于关键字的1-2行数据。我只想提取那些存在关键字/词组的句子。我的假设是，通过只提取这段信息，我可以有一个更好的词性标注，更好地理解句子上下文，因为我只看我需要的句子。我的理解正确吗？除了使用正则表达式和句号，我们如何在R中实现这一点。这可能需要大量计算例如：男孩住在迈阿密，在圣马丁学校学习。男孩身高5.7英寸，体重60公斤。他对艺术和手工艺感兴趣；打篮球。。。。。。。。。。。。。。

我有多行文本数据（不同的文档），每行大约有60-70行文本数据（超过50000个字符）。但我感兴趣的领域只是基于关键字的1-2行数据。我只想提取那些存在关键字/词组的句子。我的假设是，通过只提取这段信息，我可以有一个更好的词性标注，更好地理解句子上下文，因为我只看我需要的句子。我的理解正确吗？除了使用正则表达式和句号，我们如何在R中实现这一点。这可能需要大量计算

例如：男孩住在迈阿密，在圣马丁学校学习。男孩身高5.7英寸，体重60公斤。他对艺术和手工艺感兴趣；打篮球。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。 "

我只想根据关键词study（词干关键词）提取句子“这个男孩住在迈阿密，在圣马丁学校学习”.

对于每个文档，您可以首先应用

SnowballC:：wordStem

进行柠檬化，然后使用

Tokenizer:：tokenize_Sequences

拆分文档。现在您可以使用

grepl

查找包含您要查找的关键字的句子。

对于每个文档，您可以首先应用

SnowballC:：wordStem

进行lemmatize，然后使用

tokenizers:：tokenize_句

拆分文档。现在可以使用

grepl

查找包含要查找的关键字的句子。

对于这个示例，我使用了三个包：NLP和openNLP（用于句子拆分）以及SnowballC（用于lemmatize）。我没有使用上面提到的tokenizers包，因为我不知道它。我提到的包是Apache OpenNLP工具包的一部分，为社区所熟知和使用

首先，使用以下代码安装上述软件包。如果已安装软件包，请跳到下一步：

## List of used packages 
list.of.packages <- c("NLP", "openNLP", "SnowballC")

## Returns a not installed packages list
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]

## Installs new packages
if(length(new.packages)) 
  install.packages(new.packages)

接下来，将文本转换为字符串（NLP程序包函数）。这是必要的，因为openNLP程序包使用字符串类型。在本例中，我使用了您在问题中提供的相同文本：

example_text <- paste0("The Boy lives in Miami and studies in the St. Martin School. ",
                       "The boy has a heiht of 5.7 and weights 60 Kg's. ", 
                       "He has intrest in the Arts and crafts; and plays basketball. ")

example_text <- as.String(example_text)

#output
> example_text
The Boy lives in Miami and studies in the St. Martin School. The boy has a heiht of 5.7 and weights 60 Kg's. He has intrest in the Arts and crafts; and plays basketball.

因此，当在该向量中检查关键字时，您的输出将是：

> splited_text[sentence_index]
    [1] "The Boy lives in Miami and studies in the st."

我还测试了上面提到的tokenizers包，也有同样的问题。因此，请注意，这在NLP注释任务中是一个开放的问题。但是，上面的逻辑和算法工作正常

我希望这会有所帮助。

对于这个例子，我使用了三个包：NLP和openNLP（用于句子分割）和SnowballC（用于lemmatize）。我没有使用上面提到的Tokenizer包，因为我不知道它。我提到的包是Apache openNLP工具包的一部分，为社区所熟知和使用

首先，使用以下代码安装上述软件包。如果已安装软件包，请跳到下一步：

## List of used packages 
list.of.packages <- c("NLP", "openNLP", "SnowballC")

## Returns a not installed packages list
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]

## Installs new packages
if(length(new.packages)) 
  install.packages(new.packages)

接下来，将文本转换为字符串（NLP程序包函数）。这是必要的，因为openNLP程序包使用字符串类型。在本例中，我使用了您在问题中提供的相同文本：

example_text <- paste0("The Boy lives in Miami and studies in the St. Martin School. ",
                       "The boy has a heiht of 5.7 and weights 60 Kg's. ", 
                       "He has intrest in the Arts and crafts; and plays basketball. ")

example_text <- as.String(example_text)

#output
> example_text
The Boy lives in Miami and studies in the St. Martin School. The boy has a heiht of 5.7 and weights 60 Kg's. He has intrest in the Arts and crafts; and plays basketball.

因此，当在该向量中检查关键字时，您的输出将是：

> splited_text[sentence_index]
    [1] "The Boy lives in Miami and studies in the st."

我还测试了上面提到的tokenizers包，也有同样的问题。因此，请注意，这在NLP注释任务中是一个开放的问题。但是，上面的逻辑和算法工作正常

我希望这有帮助

> splited_text
[1] "The Boy lives in Miami and studies in the st."                "Martin School."                                              
[3] "The boy has a heiht of 5.7 and weights 60 Kg's."              "He has intrest in the Arts and crafts; and plays basketball."

> splited_text[sentence_index]
    [1] "The Boy lives in Miami and studies in the st."