用Python和regex逐行搜索和提取文件中的WH-word
我有一个文件,每行有一句话。我试图阅读该文件,并使用正则表达式搜索句子是否为疑问句,从句子中提取wh单词,并根据它在第一个文件中出现的顺序将它们保存回另一个文件 这就是我到目前为止所拥有的用Python和regex逐行搜索和提取文件中的WH-word,python,regex,nlp,extract,Python,Regex,Nlp,Extract,我有一个文件,每行有一句话。我试图阅读该文件,并使用正则表达式搜索句子是否为疑问句,从句子中提取wh单词,并根据它在第一个文件中出现的顺序将它们保存回另一个文件 这就是我到目前为止所拥有的 def whWordExtractor(inputFile): try: openFileObject = open(inputFile, "r") try: whPattern = re.compile(r'(.*)who|what|how|
def whWordExtractor(inputFile):
try:
openFileObject = open(inputFile, "r")
try:
whPattern = re.compile(r'(.*)who|what|how|where|when|why|which|whom|whose(\.*)', re.IGNORECASE)
with openFileObject as infile:
for line in infile:
whWord = whPattern.search(line)
print whWord
# Save the whWord extracted from inputFile into another whWord.txt file
# writeFileObject = open('whWord.txt','a')
# if not whWord:
# writeFileObject.write('None' + '\n')
# else:
# whQuestion = whWord
# writeFileObject.write(whQuestion+ '\n')
finally:
print 'Done. All WH-word extracted.'
openFileObject.close()
except IOError:
pass
The result after running the code above: set([])
我有什么地方做错了吗?如果有人能指出这一点,我将不胜感激。将(*)谁|什么|如何|在哪里|何时|为什么|谁|谁的(\.*)
更改为
“*(?:谁|什么|如何|在哪里|什么时候|为什么|哪个|谁|谁的)。*\”
不确定这是否是你要找的,但你可以尝试以下方法:
def whWordExtractor(inputFile):
try:
whPattern = re.compile(r'who|what|how|where|when|why|which|whom|whose', re.IGNORECASE)
with open(inputFile, "r") as infile:
for line in infile:
whMatch = whPattern.search(line)
if whMatch:
whWord = whMatch.group()
print whWord
# save to file
else:
# no match
except IOError:
pass
def whWordExtractor(inputFile):
try:
with open(inputFile) as f1:
whPattern = re.compile(r'(.*)who|what|how|where|when|why|which|whom|whose(\.*)', re.IGNORECASE)
with open('whWord.txt','a') as f2: #open file only once, to reduce I/O operations
for line in f1:
whWord = whPattern.search(line)
print whWord
if not whWord:
f2.write('None' + '\n')
else:
#As re.search returns a sre.SRE_Match object not string, so you will have to use either
# whWord.group() or better use whPattern.findall(line)
whQuestion = whWord.group()
f2.write(whQuestion+ '\n')
print 'Done. All WH-word extracted.'
except IOError:
pass
大概是这样的:
def whWordExtractor(inputFile):
try:
whPattern = re.compile(r'who|what|how|where|when|why|which|whom|whose', re.IGNORECASE)
with open(inputFile, "r") as infile:
for line in infile:
whMatch = whPattern.search(line)
if whMatch:
whWord = whMatch.group()
print whWord
# save to file
else:
# no match
except IOError:
pass
def whWordExtractor(inputFile):
try:
with open(inputFile) as f1:
whPattern = re.compile(r'(.*)who|what|how|where|when|why|which|whom|whose(\.*)', re.IGNORECASE)
with open('whWord.txt','a') as f2: #open file only once, to reduce I/O operations
for line in f1:
whWord = whPattern.search(line)
print whWord
if not whWord:
f2.write('None' + '\n')
else:
#As re.search returns a sre.SRE_Match object not string, so you will have to use either
# whWord.group() or better use whPattern.findall(line)
whQuestion = whWord.group()
f2.write(whQuestion+ '\n')
print 'Done. All WH-word extracted.'
except IOError:
pass
程序运行正常吗?不是我想要的方式。当它应该返回或打印从文件中提取的Wh字时,它返回一个空列表。我使用打印功能来测试是否得到正确的单词。是否只想匹配第一个WH单词?例如,
“总统的名字是什么?”
将返回“What”
,即使它也包含“who”
。Wesley Baugh,实际上,我想把句子中的第一个WH字返回,但是我忘了有时在同一个句子中存在另一个WH字。你和GRC都回答了我的问题,但我只能选择一个。所以,我选择了先到先得的方式。然而,我给了你一个+1,因为它增加了一个额外的步骤来减少IO操作。