Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/jquery-ui/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何在我的代码中实现re.search()?_Python_Regex_Nlp - Fatal编程技术网

Python 如何在我的代码中实现re.search()?

Python 如何在我的代码中实现re.search()?,python,regex,nlp,Python,Regex,Nlp,我正在研究一个关于文本数据的二进制分类问题。我想根据文本中的单词在我选择的一些定义良好的单词类特征中的外观对它们进行分类。 现在,我一直在搜索每个词类中文本的整个词的出现情况,并在匹配时增加该词类的计数。此计数还用于计算每个词类的频率。这是我的密码: import nltk import re def wordClassFeatures(text): home = """woke home sleep today eat tired wake watch watched

我正在研究一个关于文本数据的二进制分类问题。我想根据文本中的单词在我选择的一些定义良好的单词类特征中的外观对它们进行分类。 现在,我一直在搜索每个词类中文本的整个词的出现情况,并在匹配时增加该词类的计数。此计数还用于计算每个词类的频率。这是我的密码:

import nltk
import re

def wordClassFeatures(text):
    home = """woke home sleep today eat tired wake watch
        watched dinner ate bed day house tv early boring
        yesterday watching sit"""

    conversation = """know people think person tell feel friends
talk new talking mean ask understand feelings care thinking
friend relationship realize question answer saying"""


    countHome = countConversation =0

    totalWords = len(text.split())

    text = text.lower()
    text = nltk.word_tokenize(text)
    conversation = nltk.word_tokenize(conversation)
    home = nltk.word_tokenize(home)
'''
    for word in text:
        if word in conversation: #this is my current approach
            countConversation += 1
        if word in home:
            countHome += 1
'''

    for word in text:
        if re.search(word, conversation): #this is what I want to implement
            countConversation += 1
        if re.search(word, home):
            countHome += 1

    countConversation /= 1.0*totalWords
    countHome /= 1.0*totalWords

    return(countHome,countConversation)

text = """ Long time no see. Like always I was rewriting it from scratch a couple of times. But nevertheless 
it's still java and now it uses metropolis sampling to help that poor path tracing converge. Btw. I did MLT on 
yesterday evening after 2 beers (it had to be Ballmer peak). Altough the implementation is still very fresh it 
easily outperforms standard path tracing, what is to be seen especially when difficult caustics are involved. 
I've implemented spectral rendering too, it was very easy actually, cause all computations on wavelengths are 
linear just like rgb. But then I realised that even if it does feel more physically correct to do so, whats the 
point? 3d applications are operating in rgb color space, and because I cant represent a rgb color as spectrum 
interchangeably I have to approximate it, so as long as I'm not running a physical simulation or something I don't
see the benefits (please correct me if I'm wrong), thus I abandoned that."""

print(wordClassFeatures(text))
这样做的缺点是,我现在有额外的开销来对所有单词类的每个单词进行词干分析,因为文本中的单词必须显式匹配才能归入一个单词类。因此,我现在尝试将文本中的每个单词作为正则表达式输入,并在每个单词类中搜索它。 这将抛出错误:

line 362, in wordClassFeatures
if re.search(conversation, word):
  File "/root/anaconda3/lib/python3.6/re.py", line 182, in search
    return _compile(pattern, flags).search(string)
  File "/root/anaconda3/lib/python3.6/re.py", line 289, in _compile
    p, loc = _cache[type(pattern), pattern, flags]
TypeError: unhashable type: 'list'
我知道语法中有一个重大错误,但我在网上找不到它,因为大多数用于搜索的语法都是以下格式:

关于搜索感谢|感谢|提前,x

有什么方法可以正确实现这一点吗?

我认为re.search是在寻找字符串或缓冲区,而不是代码为会话和主变量提供的列表

此外,当您标记化时,您正在使用文本的所有特殊字符进行标记化,这将导致搜索失败

所以,首先我们需要去除文本中的特殊字符

text = re.sub('\W+',' ', text ) #strip text of all special characters
接下来,我们将对话和主变量保留为字符串格式,而不是标记化

我们得到了期望的答案:

(0.21301775147928995, 0.20118343195266272)
完整代码如下:

import nltk
import re

def wordClassFeatures(text):
    home = """woke home sleep today eat tired wake watch
        watched dinner ate bed day house tv early boring
        yesterday watching sit"""

    conversation = """know people think person tell feel friends
talk new talking mean ask understand feelings care thinking
friend relationship realize question answer saying"""

    text = re.sub('\W+',' ', text ) #strip text of all special characters

    countHome = countConversation =0

    totalWords = len(text.split())

    text = text.lower()
    text = nltk.word_tokenize(text)
    #conversation = nltk.word_tokenize(conversation)
    #home = nltk.word_tokenize(home)
    '''
        for word in text:
            if word in conversation: #this is my current approach
                countConversation += 1
            if word in home:
                countHome += 1
    '''

    for word in text:
        if re.search(word, conversation): #this is what I want to implement
            countConversation += 1
        if re.search(word, home):
            countHome += 1

    countConversation /= 1.0*totalWords
    countHome /= 1.0*totalWords

    return(countHome,countConversation)

text = """ Long time no see. Like always I was rewriting it from scratch a couple of times. But nevertheless 
it's still java and now it uses metropolis sampling to help that poor path tracing converge. Btw. I did MLT on 
yesterday evening after 2 beers (it had to be Ballmer peak). Altough the implementation is still very fresh it 
easily outperforms standard path tracing, what is to be seen especially when difficult caustics are involved. 
I've implemented spectral rendering too, it was very easy actually, cause all computations on wavelengths are 
linear just like rgb. But then I realised that even if it does feel more physically correct to do so, whats the 
point? 3d applications are operating in rgb color space, and because I cant represent a rgb color as spectrum 
interchangeably I have to approximate it, so as long as I'm not running a physical simulation or something I don't
see the benefits (please correct me if I'm wrong), thus I abandoned that."""

print(wordClassFeatures(text))
我相信re.search是在寻找字符串或缓冲区,而不是代码为会话和主变量提供的列表

此外,当您标记化时,您正在使用文本的所有特殊字符进行标记化,这将导致搜索失败

所以,首先我们需要去除文本中的特殊字符

text = re.sub('\W+',' ', text ) #strip text of all special characters
接下来,我们将对话和主变量保留为字符串格式,而不是标记化

我们得到了期望的答案:

(0.21301775147928995, 0.20118343195266272)
完整代码如下:

import nltk
import re

def wordClassFeatures(text):
    home = """woke home sleep today eat tired wake watch
        watched dinner ate bed day house tv early boring
        yesterday watching sit"""

    conversation = """know people think person tell feel friends
talk new talking mean ask understand feelings care thinking
friend relationship realize question answer saying"""

    text = re.sub('\W+',' ', text ) #strip text of all special characters

    countHome = countConversation =0

    totalWords = len(text.split())

    text = text.lower()
    text = nltk.word_tokenize(text)
    #conversation = nltk.word_tokenize(conversation)
    #home = nltk.word_tokenize(home)
    '''
        for word in text:
            if word in conversation: #this is my current approach
                countConversation += 1
            if word in home:
                countHome += 1
    '''

    for word in text:
        if re.search(word, conversation): #this is what I want to implement
            countConversation += 1
        if re.search(word, home):
            countHome += 1

    countConversation /= 1.0*totalWords
    countHome /= 1.0*totalWords

    return(countHome,countConversation)

text = """ Long time no see. Like always I was rewriting it from scratch a couple of times. But nevertheless 
it's still java and now it uses metropolis sampling to help that poor path tracing converge. Btw. I did MLT on 
yesterday evening after 2 beers (it had to be Ballmer peak). Altough the implementation is still very fresh it 
easily outperforms standard path tracing, what is to be seen especially when difficult caustics are involved. 
I've implemented spectral rendering too, it was very easy actually, cause all computations on wavelengths are 
linear just like rgb. But then I realised that even if it does feel more physically correct to do so, whats the 
point? 3d applications are operating in rgb color space, and because I cant represent a rgb color as spectrum 
interchangeably I have to approximate it, so as long as I'm not running a physical simulation or something I don't
see the benefits (please correct me if I'm wrong), thus I abandoned that."""

print(wordClassFeatures(text))

“应该是搜索词,对话。”拉温试过了。抛出此错误:第362行,在wordClassFeatures if re.searchword中,对话:File/root/anaconda3/lib/python3.6/re.py,第182行,在search return\u compilepattern中,flags.searchstring类型错误:预期的字符串或类似object的字节此问题需要一个示例。这使我们更容易帮助您。如果您要使用正则表达式,请使用它们。如果您打算使用一种不同的方法,比如nltk,那么就使用它。你不能随意混搭。正则表达式在这里是一个危险的话题:你只需要问一下如何使用这个库,最好是用你在文档中已经尝试过的例子。它应该是re.searchword,conversation。@Rawing试过了。抛出此错误:第362行,在wordClassFeatures if re.searchword中,对话:File/root/anaconda3/lib/python3.6/re.py,第182行,在search return\u compilepattern中,flags.searchstring类型错误:预期的字符串或类似object的字节此问题需要一个示例。这使我们更容易帮助您。如果您要使用正则表达式,请使用它们。如果您打算使用一种不同的方法,比如nltk,那么就使用它。你不能随意混搭。正则表达式在这里是一个转移视线的问题:你只需要问一下如何使用这个库,最好是用你在文档中已经尝试过的例子。