用Python和regex逐行搜索和提取文件中的WH-word_Python_Regex_Nlp_Extract

用Python和regex逐行搜索和提取文件中的WH-word

python regex nlp

用Python和regex逐行搜索和提取文件中的WH-word,python,regex,nlp,extract,Python,Regex,Nlp,Extract,我有一个文件，每行有一句话。我试图阅读该文件，并使用正则表达式搜索句子是否为疑问句，从句子中提取wh单词，并根据它在第一个文件中出现的顺序将它们保存回另一个文件这就是我到目前为止所拥有的 def whWordExtractor(inputFile): try: openFileObject = open(inputFile, "r") try: whPattern = re.compile(r'(.*)who|what|how|

我有一个文件，每行有一句话。我试图阅读该文件，并使用正则表达式搜索句子是否为疑问句，从句子中提取wh单词，并根据它在第一个文件中出现的顺序将它们保存回另一个文件

这就是我到目前为止所拥有的

def whWordExtractor(inputFile):
    try:
        openFileObject = open(inputFile, "r")
        try:

            whPattern = re.compile(r'(.*)who|what|how|where|when|why|which|whom|whose(\.*)', re.IGNORECASE)
            with openFileObject as infile:
                for line in infile:

                    whWord = whPattern.search(line)
                    print whWord

# Save the whWord extracted from inputFile into another whWord.txt file
#                    writeFileObject = open('whWord.txt','a')                   
#                    if not whWord:
#                        writeFileObject.write('None' + '\n')
#                    else:
#                        whQuestion = whWord   
#                        writeFileObject.write(whQuestion+ '\n') 

        finally:
            print 'Done. All WH-word extracted.'
            openFileObject.close()
    except IOError:
        pass

The result after running the code above: set([])

我有什么地方做错了吗？如果有人能指出这一点，我将不胜感激。

将

（*）谁|什么|如何|在哪里|何时|为什么|谁|谁的（\.*）

更改为

“*（？：谁|什么|如何|在哪里|什么时候|为什么|哪个|谁|谁的）。*\”

不确定这是否是你要找的，但你可以尝试以下方法：

def whWordExtractor(inputFile):
    try:
        whPattern = re.compile(r'who|what|how|where|when|why|which|whom|whose', re.IGNORECASE)
        with open(inputFile, "r") as infile:
            for line in infile:
                whMatch = whPattern.search(line)
                if whMatch:
                    whWord = whMatch.group()
                    print whWord
                    # save to file
                else:
                    # no match
    except IOError:
        pass

def whWordExtractor(inputFile):
   try:
      with open(inputFile) as f1:
           whPattern = re.compile(r'(.*)who|what|how|where|when|why|which|whom|whose(\.*)', re.IGNORECASE)
           with open('whWord.txt','a') as f2:  #open file only once, to reduce I/O operations
               for line in f1:
                   whWord = whPattern.search(line)
                   print whWord
                   if not whWord:
                        f2.write('None' + '\n')
                   else:
                        #As re.search returns a sre.SRE_Match object not string, so you will have to use either
                        # whWord.group() or better use  whPattern.findall(line)
                        whQuestion = whWord.group()   
                        f2.write(whQuestion+ '\n') 
               print 'Done. All WH-word extracted.' 
   except IOError:
        pass

大概是这样的：

def whWordExtractor(inputFile):
    try:
        whPattern = re.compile(r'who|what|how|where|when|why|which|whom|whose', re.IGNORECASE)
        with open(inputFile, "r") as infile:
            for line in infile:
                whMatch = whPattern.search(line)
                if whMatch:
                    whWord = whMatch.group()
                    print whWord
                    # save to file
                else:
                    # no match
    except IOError:
        pass

def whWordExtractor(inputFile):
   try:
      with open(inputFile) as f1:
           whPattern = re.compile(r'(.*)who|what|how|where|when|why|which|whom|whose(\.*)', re.IGNORECASE)
           with open('whWord.txt','a') as f2:  #open file only once, to reduce I/O operations
               for line in f1:
                   whWord = whPattern.search(line)
                   print whWord
                   if not whWord:
                        f2.write('None' + '\n')
                   else:
                        #As re.search returns a sre.SRE_Match object not string, so you will have to use either
                        # whWord.group() or better use  whPattern.findall(line)
                        whQuestion = whWord.group()   
                        f2.write(whQuestion+ '\n') 
               print 'Done. All WH-word extracted.' 
   except IOError:
        pass

程序运行正常吗？不是我想要的方式。当它应该返回或打印从文件中提取的Wh字时，它返回一个空列表。我使用打印功能来测试是否得到正确的单词。是否只想匹配第一个WH单词？例如，

“总统的名字是什么？”

将返回

“What”

，即使它也包含

“who”

。Wesley Baugh，实际上，我想把句子中的第一个WH字返回，但是我忘了有时在同一个句子中存在另一个WH字。你和GRC都回答了我的问题，但我只能选择一个。所以，我选择了先到先得的方式。然而，我给了你一个+1，因为它增加了一个额外的步骤来减少IO操作。