Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/20.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
用Python和regex逐行搜索和提取文件中的WH-word_Python_Regex_Nlp_Extract - Fatal编程技术网

用Python和regex逐行搜索和提取文件中的WH-word

用Python和regex逐行搜索和提取文件中的WH-word,python,regex,nlp,extract,Python,Regex,Nlp,Extract,我有一个文件,每行有一句话。我试图阅读该文件,并使用正则表达式搜索句子是否为疑问句,从句子中提取wh单词,并根据它在第一个文件中出现的顺序将它们保存回另一个文件 这就是我到目前为止所拥有的 def whWordExtractor(inputFile): try: openFileObject = open(inputFile, "r") try: whPattern = re.compile(r'(.*)who|what|how|

我有一个文件,每行有一句话。我试图阅读该文件,并使用正则表达式搜索句子是否为疑问句,从句子中提取wh单词,并根据它在第一个文件中出现的顺序将它们保存回另一个文件

这就是我到目前为止所拥有的

def whWordExtractor(inputFile):
    try:
        openFileObject = open(inputFile, "r")
        try:

            whPattern = re.compile(r'(.*)who|what|how|where|when|why|which|whom|whose(\.*)', re.IGNORECASE)
            with openFileObject as infile:
                for line in infile:

                    whWord = whPattern.search(line)
                    print whWord

# Save the whWord extracted from inputFile into another whWord.txt file
#                    writeFileObject = open('whWord.txt','a')                   
#                    if not whWord:
#                        writeFileObject.write('None' + '\n')
#                    else:
#                        whQuestion = whWord   
#                        writeFileObject.write(whQuestion+ '\n') 

        finally:
            print 'Done. All WH-word extracted.'
            openFileObject.close()
    except IOError:
        pass

The result after running the code above: set([])
我有什么地方做错了吗?如果有人能指出这一点,我将不胜感激。

(*)谁|什么|如何|在哪里|何时|为什么|谁|谁的(\.*)
更改为
“*(?:谁|什么|如何|在哪里|什么时候|为什么|哪个|谁|谁的)。*\”

不确定这是否是你要找的,但你可以尝试以下方法:

def whWordExtractor(inputFile):
    try:
        whPattern = re.compile(r'who|what|how|where|when|why|which|whom|whose', re.IGNORECASE)
        with open(inputFile, "r") as infile:
            for line in infile:
                whMatch = whPattern.search(line)
                if whMatch:
                    whWord = whMatch.group()
                    print whWord
                    # save to file
                else:
                    # no match
    except IOError:
        pass
def whWordExtractor(inputFile):
   try:
      with open(inputFile) as f1:
           whPattern = re.compile(r'(.*)who|what|how|where|when|why|which|whom|whose(\.*)', re.IGNORECASE)
           with open('whWord.txt','a') as f2:  #open file only once, to reduce I/O operations
               for line in f1:
                   whWord = whPattern.search(line)
                   print whWord
                   if not whWord:
                        f2.write('None' + '\n')
                   else:
                        #As re.search returns a sre.SRE_Match object not string, so you will have to use either
                        # whWord.group() or better use  whPattern.findall(line)
                        whQuestion = whWord.group()   
                        f2.write(whQuestion+ '\n') 
               print 'Done. All WH-word extracted.' 
   except IOError:
        pass
大概是这样的:

def whWordExtractor(inputFile):
    try:
        whPattern = re.compile(r'who|what|how|where|when|why|which|whom|whose', re.IGNORECASE)
        with open(inputFile, "r") as infile:
            for line in infile:
                whMatch = whPattern.search(line)
                if whMatch:
                    whWord = whMatch.group()
                    print whWord
                    # save to file
                else:
                    # no match
    except IOError:
        pass
def whWordExtractor(inputFile):
   try:
      with open(inputFile) as f1:
           whPattern = re.compile(r'(.*)who|what|how|where|when|why|which|whom|whose(\.*)', re.IGNORECASE)
           with open('whWord.txt','a') as f2:  #open file only once, to reduce I/O operations
               for line in f1:
                   whWord = whPattern.search(line)
                   print whWord
                   if not whWord:
                        f2.write('None' + '\n')
                   else:
                        #As re.search returns a sre.SRE_Match object not string, so you will have to use either
                        # whWord.group() or better use  whPattern.findall(line)
                        whQuestion = whWord.group()   
                        f2.write(whQuestion+ '\n') 
               print 'Done. All WH-word extracted.' 
   except IOError:
        pass

程序运行正常吗?不是我想要的方式。当它应该返回或打印从文件中提取的Wh字时,它返回一个空列表。我使用打印功能来测试是否得到正确的单词。是否只想匹配第一个WH单词?例如,
“总统的名字是什么?”
将返回
“What”
,即使它也包含
“who”
。Wesley Baugh,实际上,我想把句子中的第一个WH字返回,但是我忘了有时在同一个句子中存在另一个WH字。你和GRC都回答了我的问题,但我只能选择一个。所以,我选择了先到先得的方式。然而,我给了你一个+1,因为它增加了一个额外的步骤来减少IO操作。