Python 3.x 使用Browns语料库NLTK Python的条件频率分布
我试图确定以“ing”或“ed”结尾的单词。计算条件频率分布,其中条件为['government','cabiods',],事件为'ing'或'ed'。将条件频率分布存储在变量inged_cfd中 下面是我的代码:-Python 3.x 使用Browns语料库NLTK Python的条件频率分布,python-3.x,nltk,corpus,Python 3.x,Nltk,Corpus,我试图确定以“ing”或“ed”结尾的单词。计算条件频率分布,其中条件为['government','cabiods',],事件为'ing'或'ed'。将条件频率分布存储在变量inged_cfd中 下面是我的代码:- from nltk.corpus import brown import nltk genre_word = [ (genre, word.lower()) for genre in ['government', 'hobbies']
from nltk.corpus import brown
import nltk
genre_word = [ (genre, word.lower())
for genre in ['government', 'hobbies']
for word in brown.words(categories = genre) if (word.endswith('ing') or word.endswith('ed')) ]
genre_word_list = [list(x) for x in genre_word]
for wd in genre_word_list:
if wd[1].endswith('ing'):
wd[1] = 'ing'
elif wd[1].endswith('ed'):
wd[1] = 'ed'
inged_cfd = nltk.ConditionalFreqDist(genre_word_list)
inged_cfd.tabulate(conditions = ['government', 'hobbies'], samples = ['ed','ing'])
我想以表格格式输出到,使用上述代码,我得到的输出如下:-
ed ing
government 2507 1605
hobbies 2561 2262
鉴于实际产出为:-
ed ing
government 2507 1474
hobbies 2561 2169
请解决我的问题,并帮助我获得准确的输出。需要排除stopwords。此外,在检查端部是否存在状况时,将壳体改为降下。工作守则如下:
from nltk.corpus import brown
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
genre_word = [ (genre, word.lower())
for genre in brown.categories() for word in brown.words(categories=genre) if (word.lower().endswith('ing') or word.lower().endswith('ed')) ]
genre_word_list = [list(x) for x in genre_word]
for wd in genre_word_list:
if wd[1].endswith('ing') and wd[1] not in stop_words:
wd[1] = 'ing'
elif wd[1].endswith('ed') and wd[1] not in stop_words:
wd[1] = 'ed'
inged_cfd = nltk.ConditionalFreqDist(genre_word_list)
inged_cfd.tabulate(conditions = cfdconditions, samples = ['ed','ing'])
需要排除停止词。此外,在检查端部是否存在状况时,将壳体改为降下。工作守则如下:
from nltk.corpus import brown
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
genre_word = [ (genre, word.lower())
for genre in brown.categories() for word in brown.words(categories=genre) if (word.lower().endswith('ing') or word.lower().endswith('ed')) ]
genre_word_list = [list(x) for x in genre_word]
for wd in genre_word_list:
if wd[1].endswith('ing') and wd[1] not in stop_words:
wd[1] = 'ing'
elif wd[1].endswith('ed') and wd[1] not in stop_words:
wd[1] = 'ed'
inged_cfd = nltk.ConditionalFreqDist(genre_word_list)
inged_cfd.tabulate(conditions = cfdconditions, samples = ['ed','ing'])
我使用了解决方案,但仍然无法通过一些测试。2个测试用例仍然失败 对于失败的测试用例,我的输出是:
good bad better
adventure 39 9 30
fiction 60 17 27
mystery 45 13 29
science_fiction 14 1 4
ed ing
adventure 3281 1844
fiction 2943 1767
mystery 2382 1374
science_fiction 574 293
及
我的代码是
def calculateCFD(cfdconditions, cfdevents):
# Write your code here
from nltk.corpus import brown
from nltk import ConditionalFreqDist
from nltk.corpus import stopwords
stopword = set(stopwords.words('english'))
cdev_cfd = [ (genre, word.lower()) for genre in cfdconditions for word in brown.words(categories=genre) if word.lower() not in stopword]
cdev_cfd = [list(x) for x in cdev_cfd]
cdev_cfd = nltk.ConditionalFreqDist(cdev_cfd)
a = cdev_cfd.tabulate(condition = cfdconditions, samples = cfdevents)
inged_cfd = [ (genre, word.lower()) for genre in cfdconditions for word in brown.words(categories=genre) if (word.lower().endswith('ing') or word.lower().endswith('ed')) ]
inged_cfd = [list(x) for x in inged_cfd]
for wd in inged_cfd:
if wd[1].endswith('ing') and wd[1] not in stopword:
wd[1] = 'ing'
elif wd[1].endswith('ed') and wd[1] not in stopword:
wd[1] = 'ed'
inged_cfd = nltk.ConditionalFreqDist(inged_cfd)
b = inged_cfd.tabulate(conditions = sorted(cfdconditions), samples = ['ed','ing'])
return(a,b)
如果有人能提供一个解决方案,那将是非常有帮助的。
谢谢我使用了解决方案,但我仍然无法通过一些测试。2个测试用例仍然失败 对于失败的测试用例,我的输出是:
good bad better
adventure 39 9 30
fiction 60 17 27
mystery 45 13 29
science_fiction 14 1 4
ed ing
adventure 3281 1844
fiction 2943 1767
mystery 2382 1374
science_fiction 574 293
及
我的代码是
def calculateCFD(cfdconditions, cfdevents):
# Write your code here
from nltk.corpus import brown
from nltk import ConditionalFreqDist
from nltk.corpus import stopwords
stopword = set(stopwords.words('english'))
cdev_cfd = [ (genre, word.lower()) for genre in cfdconditions for word in brown.words(categories=genre) if word.lower() not in stopword]
cdev_cfd = [list(x) for x in cdev_cfd]
cdev_cfd = nltk.ConditionalFreqDist(cdev_cfd)
a = cdev_cfd.tabulate(condition = cfdconditions, samples = cfdevents)
inged_cfd = [ (genre, word.lower()) for genre in cfdconditions for word in brown.words(categories=genre) if (word.lower().endswith('ing') or word.lower().endswith('ed')) ]
inged_cfd = [list(x) for x in inged_cfd]
for wd in inged_cfd:
if wd[1].endswith('ing') and wd[1] not in stopword:
wd[1] = 'ing'
elif wd[1].endswith('ed') and wd[1] not in stopword:
wd[1] = 'ed'
inged_cfd = nltk.ConditionalFreqDist(inged_cfd)
b = inged_cfd.tabulate(conditions = sorted(cfdconditions), samples = ['ed','ing'])
return(a,b)
如果有人能提供一个解决方案,那将是非常有帮助的。
谢谢在这两个地方使用相同的cfdconditions变量会产生问题。实际上,在python中,所有内容都作为对象引用工作,因此,当您第一次使用cfd条件时,它可能会在传递到cdev_cfd.tablate时发生更改,而当您下次传递时,它可能会作为更改的对象传递。如果你再初始化一个列表,然后把这个列表传给第二个列表,效果会更好
这是我的修改
from nltk.corpus import brown
from nltk.corpus import stopwords
def calculateCFD(cfdconditions, cfdevents):
stop_words= stopwords.words('english')
at=[i for i in cfdconditions]
nt = [(genre, word.lower())
for genre in cfdconditions
for word in brown.words(categories=genre) if word not in stop_words and word.isalpha()]
cdv_cfd = nltk.ConditionalFreqDist(nt)
cdv_cfd.tabulate(conditions=cfdconditions, samples=cfdevents)
nt1 = [(genre, word.lower())
for genre in cfdconditions
for word in brown.words(categories=genre) ]
temp =[]
for we in nt1:
wd = we[1]
if wd[-3:] == 'ing' and wd not in stop_words:
temp.append((we[0] ,'ing'))
if wd[-2:] == 'ed':
temp.append((we[0] ,'ed'))
inged_cfd = nltk.ConditionalFreqDist(temp)
a=['ed','ing']
inged_cfd.tabulate(conditions=at, samples=a)
希望有帮助 在两个位置使用相同的cfdconditions变量会产生问题。实际上,在python中,所有内容都作为对象引用工作,因此,当您第一次使用cfd条件时,它可能会在传递到cdev_cfd.tablate时发生更改,而当您下次传递时,它可能会作为更改的对象传递。如果你再初始化一个列表,然后把这个列表传给第二个列表,效果会更好
这是我的修改
from nltk.corpus import brown
from nltk.corpus import stopwords
def calculateCFD(cfdconditions, cfdevents):
stop_words= stopwords.words('english')
at=[i for i in cfdconditions]
nt = [(genre, word.lower())
for genre in cfdconditions
for word in brown.words(categories=genre) if word not in stop_words and word.isalpha()]
cdv_cfd = nltk.ConditionalFreqDist(nt)
cdv_cfd.tabulate(conditions=cfdconditions, samples=cfdevents)
nt1 = [(genre, word.lower())
for genre in cfdconditions
for word in brown.words(categories=genre) ]
temp =[]
for we in nt1:
wd = we[1]
if wd[-3:] == 'ing' and wd not in stop_words:
temp.append((we[0] ,'ing'))
if wd[-2:] == 'ed':
temp.append((we[0] ,'ed'))
inged_cfd = nltk.ConditionalFreqDist(temp)
a=['ed','ing']
inged_cfd.tabulate(conditions=at, samples=a)
希望有帮助 预期输出为-
many years
fiction 29 44
adventure 24 32
science_fiction 11 16
ed ing
fiction 2943 1767
adventure 3281 1844
science_fiction 574 293
及
预期产量为-
many years
fiction 29 44
adventure 24 32
science_fiction 11 16
ed ing
fiction 2943 1767
adventure 3281 1844
science_fiction 574 293
及
我使用了这种方法,它的代码行更少,速度更快
from nltk.corpus import brown
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
cdev_cfd = nltk.ConditionalFreqDist([(genre, word.lower()) for genre in cfdconditions
for word in brown.words(categories=genre) if word.lower() not in stop_words])
inged_cfd = nltk.ConditionalFreqDist([(genre, word[-3:].lower() if word.lower().endswith('ing') else word[-2:].lower())
for genre in conditions for word in brown.words(categories=genre)
if word.lower() not in stop_words and (word.lower().endswith('ing') or word.lower().endswith('ed'))])
cdev_cfd.tabulate(conditions=conditions, samples=cfdevents)
inged_cfd.tabulate(conditions=conditions, samples=['ed','ing'])
我使用了这种方法,它的代码行更少,速度更快
from nltk.corpus import brown
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
cdev_cfd = nltk.ConditionalFreqDist([(genre, word.lower()) for genre in cfdconditions
for word in brown.words(categories=genre) if word.lower() not in stop_words])
inged_cfd = nltk.ConditionalFreqDist([(genre, word[-3:].lower() if word.lower().endswith('ing') else word[-2:].lower())
for genre in conditions for word in brown.words(categories=genre)
if word.lower() not in stop_words and (word.lower().endswith('ing') or word.lower().endswith('ed'))])
cdev_cfd.tabulate(conditions=conditions, samples=cfdevents)
inged_cfd.tabulate(conditions=conditions, samples=['ed','ing'])
ishan Kankane分享了上面的代码,这是完美的工作。我注意到的不同之处是1使用了isalpha,尽管在问题中没有提到-尝试添加它2,同时生成'ing'和'ed'列表-通常我看到它是一个元组列表…但在代码中,我们使用了list of list,并尝试在生成类型时转换它3,If条件下的单词-他没有使用If单词。lower不在stopwords中,他只是在单词不在stopwords中的情况下使用-试试这个,以及Ishan Kankane分享上面的代码,这很好地工作。我注意到的不同之处是1使用了isalpha,尽管在问题中没有提到-尝试添加它2,同时生成'ing'和'ed'列表-通常我看到它是一个元组列表…但在代码中,我们使用了list of list,并尝试在生成类型时转换它3,If条件下的单词-他没有使用If单词。lower不在stopwords中,他只是在单词不在stopwords中时使用-也试试这个你好,欢迎来到SO社区!我们总是鼓励您添加一些文本来解释代码的功能,而不是自己粘贴它!您好,欢迎来到SO社区!我们总是鼓励您添加一些文本来解释代码的功能,而不是自己粘贴它!