NLP生成集合

NLP生成集合,nlp,Nlp,我正在进行实际操作,预期输出为 [('fans',3),('car',3),('productions',1)] [“跑车”、“运动迷”] 我的代码如下。我能够获得第一个预期输出,但无法正确获得第二个输出。有谁能帮我这里怎么了 from nltk.tokenize import RegexpTokenizer text='Thirty-five sports disciplines and four cultural activities will be offered duri

我正在进行实际操作,预期输出为

[('fans',3),('car',3),('productions',1)]

[“跑车”、“运动迷”]

我的代码如下。我能够获得第一个预期输出,但无法正确获得第二个输出。有谁能帮我这里怎么了

    from nltk.tokenize import RegexpTokenizer
    text='Thirty-five sports disciplines and four cultural activities will be offered during seven days of competitions. He skated with charisma, changing from one gear to another, from one direction to another, faster than a sports car. Armchair sports fans settling down to watch the Olympic Games could be for the high jump if they do not pay their TV licence fee. Such invitationals will attract more viewership for sports fans by sparking interest among sports fans. She barely noticed a flashy sports car almost run them over, until Eddie lunged forward and grabbed her body away. And he flatters the mother and she kind of gets prissy and he talks her into going for a ride in the sports car.'
    word='sports'
    tokenizedword = nltk.tokenize.regexp_tokenize(text, pattern = '\w*', gaps = False)
    #Step 2
    tokenizedwords = [x.lower() for x in tokenizedword if x != '']

    tokenizedwordsbigram=list(nltk.bigrams(tokenizedwords))
    stop_words = set(stopwords.words('english')) 
    filteredwords = []
    for x in tokenizedwordsbigram:
       if x not in stop_words:
          filteredwords.append(x)
     
    tokenizednonstopwordsbigram = nltk.ConditionalFreqDist(filteredwords)  
    print(tokenizednonstopwordsbigram[word].most_common(3))
    gen_text=nltk.Text(tokenizedwords)
    print(gen_text.collocations())

我在运行代码时添加了所需的导入
nltk import
来自nltk.corpus import stopwords
,得到了以下输出

导入nltk
从nltk.corpus导入停止词
从nltk.tokenize导入RegexpTokenizer
#用于查找bigram,即成对的单词
文本=\
“在7天的比赛中,将提供35个体育项目和4项文化活动。他以超凡魅力滑冰,从一个档位换到另一个档位,从一个方向换到另一个方向,比跑车还快。坐在扶手椅上观看奥运会的体育迷如果不支付电视许可费,可能会被要求跳高。这样的邀请会激发体育迷的兴趣,从而吸引更多的体育迷观看。她几乎没注意到一辆华丽的跑车差点把他们撞倒,直到埃迪猛冲上前,把她的身体抢走。他还奉承了这位母亲,她变得有点百里茜,他说服她坐跑车去兜风
单词=‘运动’
tokenizedword=nltk.tokenize.regexp\u tokenize(文本,模式='\w*',
间隙=假)
#步骤2
tokenizedwords=[x.lower(),如果x!='',则表示tokenizedwords中的x
TokenizedWordsGram=列表(nltk.bigrams(tokenizedwords))
stop\u words=设置(stopwords.words('english'))
filteredwords=[]
对于TokenizedWordsGram中的x:
如果x不在“停止”字中:
filteredwords.append(x)
TokenizedNonStopWordsGram=nltk.ConditionalFreqDist(filteredwords)
打印TokenizedNonSopwordsGram[word]。最常见(3)
gen_text=nltk.text(标记化单词)
打印gen_文本搭配()
以下是输出:

[('car', 3), ('fans', 3), ('disciplines', 1)]
sports car; sports fans
None
替换

print(gen_text.allocations())

print(gen_text.collocation_list())

您的程序将正常运行

谢谢。我得到的输出与您在jupyter note中得到的输出相同,但我的handson运行的是hackerrank,我得到的错误如下-->collectionwords=gen_text.collabons()文件“/var/ml/python3/lib/python3.7/site packages/nltk/text.py”,第444行,在搭配w1+“”+w2代表w1,在self.collaboration_列表(num,窗口大小)文件中的w2“/var/ml/python3/lib/python3.7/site packages/nltk/text.py”,第444行,在w1+“”+w2中表示w1,在self中表示w2。配置列表(数值,窗口大小)值错误:要解压缩的值太多(预期为2)