从Word文档中提取半结构化文本

从Word文档中提取半结构化文本,r,text-mining,tm,R,Text Mining,Tm,我想根据下面的表格文本挖掘一组文件。我可以创建一个语料库,其中每个文件都是一个文档(使用tm),但我认为最好创建一个语料库,其中第二个表单表中的每个部分都是一个包含以下元数据的文档: Author : John Smith DateTimeStamp: 2013-04-18 16:53:31 Description : Heading : Current Focus ID : Smith-John_e.doc Current Fo

我想根据下面的表格文本挖掘一组文件。我可以创建一个语料库,其中每个文件都是一个文档(使用
tm
),但我认为最好创建一个语料库,其中第二个表单表中的每个部分都是一个包含以下元数据的文档:

  Author       : John Smith
  DateTimeStamp: 2013-04-18 16:53:31
  Description  : 
  Heading      : Current Focus
  ID           : Smith-John_e.doc Current Focus
  Language     : en_CA
  Origin       : Smith-John_e.doc
  Name         : John Smith
  Title        : Manager
  TeamMembers  : Joe Blow, John Doe
  GroupLeader  : She who must be obeyed 
其中名称、标题、团队成员和组长从表单上的第一个表中提取。这样,要分析的每个文本块都会保留一些上下文

最好的方法是什么?我可以想出两种方法:

  • 以某种方式将我的语料库解析成子语料库
  • 以某种方式将文档解析为子文档,并从这些子文档生成一个语料库
任何指点都将不胜感激

表格如下:


包含两个文档的语料库。exc[[1]]来自.doc,exc[[2]]来自docx。他们都使用了上面的表格。

这里是一个方法的快速草图,希望它能激发一些更有天赋的人过来看看,并提出一些更有效、更稳健的建议。。。使用您问题中的
RData
文件,我发现
doc
docx
文件的结构稍有不同,因此需要稍有不同的方法(虽然我在元数据中看到你的
docx
是'fake2.txt',那么它真的是
docx
?我在你的另一个Q中看到你在R之外使用了一个转换器,这一定是为什么它是
txt

首先为
doc
文件获取自定义元数据。正如您所见,我不是正则表达式专家,但大致上是“去掉尾随空格和前导空格”,然后“去掉单词”,然后去掉标点符号

# create User-defined local meta data pairs
meta(exc[[1]], type = "corpus", tag = "Name1") <- gsub("^\\s+|\\s+$","", gsub("Name", "", gsub("[[:punct:]]", '', exc[[1]][3])))
meta(exc[[1]], type = "corpus", tag = "Title") <- gsub("^\\s+|\\s+$","", gsub("Title", "", gsub("[[:punct:]]", '', exc[[1]][4])))
meta(exc[[1]], type = "corpus", tag = "TeamMembers") <- gsub("^\\s+|\\s+$","", gsub("Team Members", "", gsub("[[:punct:]]", '', exc[[1]][5])))
meta(exc[[1]], type = "corpus", tag = "ManagerName") <- gsub("^\\s+|\\s+$","", gsub("Name of your", "", gsub("[[:punct:]]", '', exc[[1]][7])))
docx
文件执行相同的操作

# create User-defined local meta data pairs
meta(exc[[2]], type = "corpus", tag = "Name2") <- gsub("^\\s+|\\s+$","", gsub("Name", "", gsub("[[:punct:]]", '', exc[[2]][2])))
meta(exc[[2]], type = "corpus", tag = "Title") <- gsub("^\\s+|\\s+$","", gsub("Title", "", gsub("[[:punct:]]", '', exc[[2]][4])))
meta(exc[[2]], type = "corpus", tag = "TeamMembers") <- gsub("^\\s+|\\s+$","", gsub("Team Members", "", gsub("[[:punct:]]", '', exc[[2]][6])))
meta(exc[[2]], type = "corpus", tag = "ManagerName") <- gsub("^\\s+|\\s+$","", gsub("Name of your", "", gsub("[[:punct:]]", '', exc[[2]][8])))
如果您有大量文档,则可以使用包含这些
meta
函数的
lappy
函数

现在我们已经有了自定义元数据,我们可以将文档子集以排除该部分文本:

# create new corpus that excludes part of doc that is now in metadata. We just use square bracket indexing to subset the lines that are the second table of the forms (slightly different for each doc type)
excBody <- Corpus(VectorSource(c(paste(exc[[1]][13:length(exc[[1]])], collapse = ","), 
                      paste(exc[[2]][9:length(exc[[2]])], collapse = ","))))
# get rid of all the white spaces
excBody <- tm_map(excBody, stripWhitespace)
现在,文档已经准备好进行文本挖掘,上面表中的数据从文档中移出并进入文档元数据

当然,所有这些都取决于文档的高度规则性。如果每个文档的第一个表中有不同的行数,那么简单的索引方法可能会失败(尝试一下,看看会发生什么),并且需要更健壮的方法

更新:更稳健的方法

仔细阅读了这个问题之后,这里有一个更健壮的方法,它不依赖于为文档的特定行编制索引。相反,我们使用正则表达式从两个单词之间提取文本,以生成元数据并拆分文档

下面是我们如何创建用户定义的本地元数据(一种替代上述方法的方法)

library(gdata)#用于微调功能

txt您可以添加
dput(mycorpus)的输出吗
还是你问题的子集?@Ben:不幸的是,数据是保密的。但是表格在被读入语料库之前被转换成文本。如果虚假数据有用,我可以在周一发布一些。你的问题将通过一个小规模、自包含的可复制的例子得到更多的“点滴”problem@Ben:我添加了一些测试数据如果您不介意看一看。您的问题得到了满意的回答了吗?或者您还有其他需要解决的问题吗?看起来,为了提高效率,我需要分别处理.doc和.docx文件,否则正则表达式将需要是或。然后,一旦所有内容都采用相同的格式,我可以合并corpora.t现在试一试。如果你只是使用我的“更稳健的方法”,那么你应该能够一起做
doc
docx
。很想知道你的试验进展如何。。。
# create User-defined local meta data pairs
meta(exc[[2]], type = "corpus", tag = "Name2") <- gsub("^\\s+|\\s+$","", gsub("Name", "", gsub("[[:punct:]]", '', exc[[2]][2])))
meta(exc[[2]], type = "corpus", tag = "Title") <- gsub("^\\s+|\\s+$","", gsub("Title", "", gsub("[[:punct:]]", '', exc[[2]][4])))
meta(exc[[2]], type = "corpus", tag = "TeamMembers") <- gsub("^\\s+|\\s+$","", gsub("Team Members", "", gsub("[[:punct:]]", '', exc[[2]][6])))
meta(exc[[2]], type = "corpus", tag = "ManagerName") <- gsub("^\\s+|\\s+$","", gsub("Name of your", "", gsub("[[:punct:]]", '', exc[[2]][8])))
# inspect
meta(exc[[2]], type = "corpus")
Available meta data pairs are:
  Author       : 
  DateTimeStamp: 2013-04-22 14:06:10
  Description  : 
  Heading      : 
  ID           : fake2.txt
  Language     : en
  Origin       : 
User-defined local meta data pairs are:
$Name2
[1] "Joe Blow"

$Title
[1] "Shift Lead"

$TeamMembers
[1] "Melanie Baumgartner Toby Morrison"

$ManagerName
[1] "Selma Furtgenstein"
# create new corpus that excludes part of doc that is now in metadata. We just use square bracket indexing to subset the lines that are the second table of the forms (slightly different for each doc type)
excBody <- Corpus(VectorSource(c(paste(exc[[1]][13:length(exc[[1]])], collapse = ","), 
                      paste(exc[[2]][9:length(exc[[2]])], collapse = ","))))
# get rid of all the white spaces
excBody <- tm_map(excBody, stripWhitespace)
inspect(excBody)
A corpus with 2 text documents

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
  create_date creator 
Available variables in the data frame are:
  MetaID 

[[1]]
|CURRENT RESEARCH FOCUS |,| |,|Lorem ipsum dolor sit amet, consectetur adipiscing elit. |,|Donec at ipsum est, vel ullamcorper enim. |,|In vel dui massa, eget egestas libero. |,|Phasellus facilisis cursus nisi, gravida convallis velit ornare a. |,|MAIN AREAS OF EXPERTISE |,|Vestibulum aliquet faucibus tortor, sed aliquet purus elementum vel. |,|In sit amet ante non turpis elementum porttitor. |,|TECHNOLOGY PLATFORMS, INSTRUMENTATION EMPLOYED |,| Vestibulum sed turpis id nulla eleifend fermentum. |,|Nunc sit amet elit eu neque tincidunt aliquet eu at risus. |,|Cras tempor ipsum justo, ut blandit lacus. |,|INDUSTRY PARTNERS (WITHIN THE PAST FIVE YEARS) |,| Pellentesque facilisis nisl in libero scelerisque mattis eu quis odio. |,|Etiam a justo vel sapien rhoncus interdum. |,|ANTICIPATED PARTICIPATION IN PROGRAMS, EITHER APPROVED OR UNDER DEVELOPMENT |,|(Please include anticipated percentages of your time.) |,| Proin vitae ligula quis enim vulputate sagittis vitae ut ante. |,|ADDITIONAL ROLES, DISTINCTIONS, ACADEMIC QUALIFICATIONS AND NOTES |,|e.g., First Aid Responder, Other languages spoken, Degrees, Charitable Campaign |,|Canvasser (GCWCC), OSH representative, Social Committee |,|Sed nec tellus nec massa accumsan faucibus non imperdiet nibh. |,,

[[2]]
CURRENT RESEARCH FOCUS,,* Lorem ipsum dolor sit amet, consectetur adipiscing elit.,* Donec at ipsum est, vel ullamcorper enim.,* In vel dui massa, eget egestas libero.,* Phasellus facilisis cursus nisi, gravida convallis velit ornare a.,MAIN AREAS OF EXPERTISE,* Vestibulum aliquet faucibus tortor, sed aliquet purus elementum vel.,* In sit amet ante non turpis elementum porttitor. ,TECHNOLOGY PLATFORMS, INSTRUMENTATION EMPLOYED,* Vestibulum sed turpis id nulla eleifend fermentum.,* Nunc sit amet elit eu neque tincidunt aliquet eu at risus.,* Cras tempor ipsum justo, ut blandit lacus.,INDUSTRY PARTNERS (WITHIN THE PAST FIVE YEARS),* Pellentesque facilisis nisl in libero scelerisque mattis eu quis odio.,* Etiam a justo vel sapien rhoncus interdum.,ANTICIPATED PARTICIPATION IN PROGRAMS, EITHER APPROVED OR UNDER DEVELOPMENT ,(Please include anticipated percentages of your time.),* Proin vitae ligula quis enim vulputate sagittis vitae ut ante.,ADDITIONAL ROLES, DISTINCTIONS, ACADEMIC QUALIFICATIONS AND NOTES,e.g., First Aid Responder, Other languages spoken, Degrees, Charitable Campaign Canvasser (GCWCC), OSH representative, Social Committee,* Sed nec tellus nec massa accumsan faucibus non imperdiet nibh.,,
library(gdata) # for the trim function
txt <- paste0(as.character(exc[[1]]), collapse = ",")

# inspect the document to identify the words on either side of the string
# we want, so 'Name' and 'Title' are on either side of 'John Doe'
extract <- regmatches(txt, gregexpr("(?<=Name).*?(?=Title)", txt, perl=TRUE))
meta(exc[[1]], type = "corpus", tag = "Name1") <- trim(gsub("[[:punct:]]", "", extract))

extract <- regmatches(txt, gregexpr("(?<=Title).*?(?=Team)", txt, perl=TRUE))
meta(exc[[1]], type = "corpus", tag = "Title") <- trim(gsub("[[:punct:]]","", extract))

extract <- regmatches(txt, gregexpr("(?<=Members).*?(?=Supervised)", txt, perl=TRUE))
meta(exc[[1]], type = "corpus", tag = "TeamMembers") <- trim(gsub("[[:punct:]]","", extract))

extract <- regmatches(txt, gregexpr("(?<=your).*?(?=Supervisor)", txt,  perl=TRUE))
meta(exc[[1]], type = "corpus", tag = "ManagerName") <- trim(gsub("[[:punct:]]","", extract))

# inspect
meta(exc[[1]], type = "corpus")

Available meta data pairs are:
  Author       : 
  DateTimeStamp: 2013-04-22 13:59:28
  Description  : 
  Heading      : 
  ID           : fake1.doc
  Language     : en_CA
  Origin       : 
User-defined local meta data pairs are:
$Name1
[1] "John Doe"

$Title
[1] "Manager"

$TeamMembers
[1] "Elise Patton Jeffrey Barnabas"

$ManagerName
[1] "Selma Furtgenstein"
txt <- paste0(as.character(exc[[1]]), collapse = ",")
CURRENT_RESEARCH_FOCUS <- trim(gsub("[[:punct:]]","", regmatches(txt, gregexpr("(?<=CURRENT RESEARCH FOCUS).*?(?=MAIN AREAS OF EXPERTISE)", txt, perl=TRUE))))
[1] "Lorem ipsum dolor sit amet consectetur adipiscing elit                             Donec at ipsum est vel ullamcorper enim                                            In vel dui massa eget egestas libero                                               Phasellus facilisis cursus nisi gravida convallis velit ornare a"


MAIN_AREAS_OF_EXPERTISE <- trim(gsub("[[:punct:]]","", regmatches(txt, gregexpr("(?<=MAIN AREAS OF EXPERTISE).*?(?=TECHNOLOGY PLATFORMS, INSTRUMENTATION EMPLOYED)", txt, perl=TRUE))))
    [1] "Vestibulum aliquet faucibus tortor sed aliquet purus elementum vel                 In sit amet ante non turpis elementum porttitor"