Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 3.x 使用Browns语料库NLTK Python的条件频率分布_Python 3.x_Nltk_Corpus - Fatal编程技术网

Python 3.x 使用Browns语料库NLTK Python的条件频率分布

Python 3.x 使用Browns语料库NLTK Python的条件频率分布,python-3.x,nltk,corpus,Python 3.x,Nltk,Corpus,我试图确定以“ing”或“ed”结尾的单词。计算条件频率分布,其中条件为['government','cabiods',],事件为'ing'或'ed'。将条件频率分布存储在变量inged_cfd中 下面是我的代码:- from nltk.corpus import brown import nltk genre_word = [ (genre, word.lower()) for genre in ['government', 'hobbies']

我试图确定以“ing”或“ed”结尾的单词。计算条件频率分布,其中条件为['government','cabiods',],事件为'ing'或'ed'。将条件频率分布存储在变量inged_cfd中

下面是我的代码:-

from nltk.corpus import brown
import nltk

genre_word = [ (genre, word.lower())
              for genre in ['government', 'hobbies']
              for word in brown.words(categories = genre) if (word.endswith('ing') or word.endswith('ed')) ]
            
genre_word_list = [list(x) for x in genre_word]

for wd in genre_word_list:
    if wd[1].endswith('ing'):
      wd[1] = 'ing'
    elif wd[1].endswith('ed'):
      wd[1] = 'ed'
      
inged_cfd = nltk.ConditionalFreqDist(genre_word_list)
        
inged_cfd.tabulate(conditions = ['government', 'hobbies'], samples = ['ed','ing'])
我想以表格格式输出到,使用上述代码,我得到的输出如下:-

            ed  ing 
government 2507 1605 
   hobbies 2561 2262
鉴于实际产出为:-

            ed  ing 
government 2507 1474 
   hobbies 2561 2169

请解决我的问题,并帮助我获得准确的输出。

需要排除stopwords。此外,在检查端部是否存在状况时,将壳体改为降下。工作守则如下:

from nltk.corpus import brown
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english')) 
genre_word = [ (genre, word.lower()) 
for genre in brown.categories() for word in brown.words(categories=genre) if (word.lower().endswith('ing') or word.lower().endswith('ed')) ]
genre_word_list = [list(x) for x in genre_word]

for wd in genre_word_list:
    if wd[1].endswith('ing') and wd[1] not in stop_words:
        wd[1] = 'ing'
    elif wd[1].endswith('ed') and wd[1] not in stop_words:
        wd[1] = 'ed'
  
inged_cfd = nltk.ConditionalFreqDist(genre_word_list)    
inged_cfd.tabulate(conditions = cfdconditions, samples = ['ed','ing'])

需要排除停止词。此外,在检查端部是否存在状况时,将壳体改为降下。工作守则如下:

from nltk.corpus import brown
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english')) 
genre_word = [ (genre, word.lower()) 
for genre in brown.categories() for word in brown.words(categories=genre) if (word.lower().endswith('ing') or word.lower().endswith('ed')) ]
genre_word_list = [list(x) for x in genre_word]

for wd in genre_word_list:
    if wd[1].endswith('ing') and wd[1] not in stop_words:
        wd[1] = 'ing'
    elif wd[1].endswith('ed') and wd[1] not in stop_words:
        wd[1] = 'ed'
  
inged_cfd = nltk.ConditionalFreqDist(genre_word_list)    
inged_cfd.tabulate(conditions = cfdconditions, samples = ['ed','ing'])

我使用了解决方案,但仍然无法通过一些测试。2个测试用例仍然失败

对于失败的测试用例,我的输出是:

                  good    bad better 
      adventure     39      9     30 
        fiction     60     17     27 
        mystery     45     13     29 
science_fiction     14      1      4 
                  ed  ing 
      adventure 3281 1844 
        fiction 2943 1767 
        mystery 2382 1374 
science_fiction  574  293 

我的代码是

def calculateCFD(cfdconditions, cfdevents):
    # Write your code here
    from nltk.corpus import brown
    from nltk import ConditionalFreqDist
    from nltk.corpus import stopwords
    stopword = set(stopwords.words('english'))
    cdev_cfd = [ (genre, word.lower()) for genre in cfdconditions for word in brown.words(categories=genre) if word.lower() not in stopword]
    cdev_cfd = [list(x) for x in cdev_cfd]
    cdev_cfd = nltk.ConditionalFreqDist(cdev_cfd)
    a = cdev_cfd.tabulate(condition = cfdconditions, samples = cfdevents)
    inged_cfd = [ (genre, word.lower()) for genre in cfdconditions for word in brown.words(categories=genre) if (word.lower().endswith('ing') or word.lower().endswith('ed')) ]
    inged_cfd = [list(x) for x in inged_cfd]
    for wd in inged_cfd:
        if wd[1].endswith('ing') and wd[1] not in stopword:
            wd[1] = 'ing'
        elif wd[1].endswith('ed') and wd[1] not in stopword:
            wd[1] = 'ed'

    inged_cfd = nltk.ConditionalFreqDist(inged_cfd)    
    b = inged_cfd.tabulate(conditions = sorted(cfdconditions), samples = ['ed','ing'])
    return(a,b)
如果有人能提供一个解决方案,那将是非常有帮助的。
谢谢

我使用了解决方案,但我仍然无法通过一些测试。2个测试用例仍然失败

对于失败的测试用例,我的输出是:

                  good    bad better 
      adventure     39      9     30 
        fiction     60     17     27 
        mystery     45     13     29 
science_fiction     14      1      4 
                  ed  ing 
      adventure 3281 1844 
        fiction 2943 1767 
        mystery 2382 1374 
science_fiction  574  293 

我的代码是

def calculateCFD(cfdconditions, cfdevents):
    # Write your code here
    from nltk.corpus import brown
    from nltk import ConditionalFreqDist
    from nltk.corpus import stopwords
    stopword = set(stopwords.words('english'))
    cdev_cfd = [ (genre, word.lower()) for genre in cfdconditions for word in brown.words(categories=genre) if word.lower() not in stopword]
    cdev_cfd = [list(x) for x in cdev_cfd]
    cdev_cfd = nltk.ConditionalFreqDist(cdev_cfd)
    a = cdev_cfd.tabulate(condition = cfdconditions, samples = cfdevents)
    inged_cfd = [ (genre, word.lower()) for genre in cfdconditions for word in brown.words(categories=genre) if (word.lower().endswith('ing') or word.lower().endswith('ed')) ]
    inged_cfd = [list(x) for x in inged_cfd]
    for wd in inged_cfd:
        if wd[1].endswith('ing') and wd[1] not in stopword:
            wd[1] = 'ing'
        elif wd[1].endswith('ed') and wd[1] not in stopword:
            wd[1] = 'ed'

    inged_cfd = nltk.ConditionalFreqDist(inged_cfd)    
    b = inged_cfd.tabulate(conditions = sorted(cfdconditions), samples = ['ed','ing'])
    return(a,b)
如果有人能提供一个解决方案,那将是非常有帮助的。 谢谢在这两个地方使用相同的cfdconditions变量会产生问题。实际上,在python中,所有内容都作为对象引用工作,因此,当您第一次使用cfd条件时,它可能会在传递到cdev_cfd.tablate时发生更改,而当您下次传递时,它可能会作为更改的对象传递。如果你再初始化一个列表,然后把这个列表传给第二个列表,效果会更好

这是我的修改

from nltk.corpus import brown

from nltk.corpus import stopwords

def calculateCFD(cfdconditions, cfdevents):
    stop_words= stopwords.words('english')
    at=[i for i in cfdconditions]
    nt = [(genre, word.lower())
          for genre in cfdconditions
          for word in brown.words(categories=genre) if word not in stop_words and word.isalpha()]

    cdv_cfd = nltk.ConditionalFreqDist(nt)
    cdv_cfd.tabulate(conditions=cfdconditions, samples=cfdevents)
    nt1 = [(genre, word.lower())
          for genre in cfdconditions
          for word in brown.words(categories=genre) ]
    
    temp =[]
    for we in nt1:
        wd = we[1]
        if wd[-3:] == 'ing' and wd not in stop_words:
            temp.append((we[0] ,'ing'))

        if wd[-2:] == 'ed':
            temp.append((we[0] ,'ed'))
        

    inged_cfd = nltk.ConditionalFreqDist(temp)
    a=['ed','ing']
    inged_cfd.tabulate(conditions=at, samples=a)
希望有帮助

在两个位置使用相同的cfdconditions变量会产生问题。实际上,在python中,所有内容都作为对象引用工作,因此,当您第一次使用cfd条件时,它可能会在传递到cdev_cfd.tablate时发生更改,而当您下次传递时,它可能会作为更改的对象传递。如果你再初始化一个列表,然后把这个列表传给第二个列表,效果会更好

这是我的修改

from nltk.corpus import brown

from nltk.corpus import stopwords

def calculateCFD(cfdconditions, cfdevents):
    stop_words= stopwords.words('english')
    at=[i for i in cfdconditions]
    nt = [(genre, word.lower())
          for genre in cfdconditions
          for word in brown.words(categories=genre) if word not in stop_words and word.isalpha()]

    cdv_cfd = nltk.ConditionalFreqDist(nt)
    cdv_cfd.tabulate(conditions=cfdconditions, samples=cfdevents)
    nt1 = [(genre, word.lower())
          for genre in cfdconditions
          for word in brown.words(categories=genre) ]
    
    temp =[]
    for we in nt1:
        wd = we[1]
        if wd[-3:] == 'ing' and wd not in stop_words:
            temp.append((we[0] ,'ing'))

        if wd[-2:] == 'ed':
            temp.append((we[0] ,'ed'))
        

    inged_cfd = nltk.ConditionalFreqDist(temp)
    a=['ed','ing']
    inged_cfd.tabulate(conditions=at, samples=a)
希望有帮助

预期输出为-

                 many years 

        fiction    29    44 

      adventure    24    32 

science_fiction    11    16 

                  ed  ing 

        fiction 2943 1767 

      adventure 3281 1844 

science_fiction  574  293 

预期产量为-

                 many years 

        fiction    29    44 

      adventure    24    32 

science_fiction    11    16 

                  ed  ing 

        fiction 2943 1767 

      adventure 3281 1844 

science_fiction  574  293 


我使用了这种方法,它的代码行更少,速度更快

from nltk.corpus import brown
from nltk.corpus import stopwords
    stop_words = set(stopwords.words('english'))
    cdev_cfd = nltk.ConditionalFreqDist([(genre, word.lower()) for genre in cfdconditions
          for word in brown.words(categories=genre) if word.lower() not in stop_words])
    
    inged_cfd = nltk.ConditionalFreqDist([(genre, word[-3:].lower() if word.lower().endswith('ing') else word[-2:].lower()) 
                                          for genre in conditions for word in brown.words(categories=genre) 
                                          if word.lower() not in stop_words and  (word.lower().endswith('ing') or word.lower().endswith('ed'))])
    
    cdev_cfd.tabulate(conditions=conditions, samples=cfdevents)
    
    inged_cfd.tabulate(conditions=conditions, samples=['ed','ing'])

我使用了这种方法,它的代码行更少,速度更快

from nltk.corpus import brown
from nltk.corpus import stopwords
    stop_words = set(stopwords.words('english'))
    cdev_cfd = nltk.ConditionalFreqDist([(genre, word.lower()) for genre in cfdconditions
          for word in brown.words(categories=genre) if word.lower() not in stop_words])
    
    inged_cfd = nltk.ConditionalFreqDist([(genre, word[-3:].lower() if word.lower().endswith('ing') else word[-2:].lower()) 
                                          for genre in conditions for word in brown.words(categories=genre) 
                                          if word.lower() not in stop_words and  (word.lower().endswith('ing') or word.lower().endswith('ed'))])
    
    cdev_cfd.tabulate(conditions=conditions, samples=cfdevents)
    
    inged_cfd.tabulate(conditions=conditions, samples=['ed','ing'])

ishan Kankane分享了上面的代码,这是完美的工作。我注意到的不同之处是1使用了isalpha,尽管在问题中没有提到-尝试添加它2,同时生成'ing'和'ed'列表-通常我看到它是一个元组列表…但在代码中,我们使用了list of list,并尝试在生成类型时转换它3,If条件下的单词-他没有使用If单词。lower不在stopwords中,他只是在单词不在stopwords中的情况下使用-试试这个,以及Ishan Kankane分享上面的代码,这很好地工作。我注意到的不同之处是1使用了isalpha,尽管在问题中没有提到-尝试添加它2,同时生成'ing'和'ed'列表-通常我看到它是一个元组列表…但在代码中,我们使用了list of list,并尝试在生成类型时转换它3,If条件下的单词-他没有使用If单词。lower不在stopwords中,他只是在单词不在stopwords中时使用-也试试这个你好,欢迎来到SO社区!我们总是鼓励您添加一些文本来解释代码的功能,而不是自己粘贴它!您好,欢迎来到SO社区!我们总是鼓励您添加一些文本来解释代码的功能,而不是自己粘贴它!