Python 如何在NLTK词性(POS)标记中仅获取所选标记的单词?
对不起,我是熊猫和NLTK的新手。我正在尝试构建一组自定义返回的POS。我的数据内容:Python 如何在NLTK词性(POS)标记中仅获取所选标记的单词?,python,list,pandas,tuples,nltk,Python,List,Pandas,Tuples,Nltk,对不起,我是熊猫和NLTK的新手。我正在尝试构建一组自定义返回的POS。我的数据内容: comment 0 [(have, VERB), (you, PRON), (pahae, VERB)] 1 [(radio, NOUN), (television, NOUN), (lid, NOUN)] 2 [(yes, ADV), (you're, ADJ)] 3 [(ooi, ADJ), (work, NOUN), (barisan, A
comment
0 [(have, VERB), (you, PRON), (pahae, VERB)]
1 [(radio, NOUN), (television, NOUN), (lid, NOUN)]
2 [(yes, ADV), (you're, ADJ)]
3 [(ooi, ADJ), (work, NOUN), (barisan, ADJ)]
4 [(national, ADJ), (debt, NOUN), (increased, VERB)]
你知道我怎样才能只得到与所选标记匹配的单词(动词
或名词
),如下所示?如果不匹配,则返回NaN
comment
0 [(have), (pahae)]
1 [(radio), (television), (lid)]
2 [NaN]
3 [(work)]
4 [(debt), (increased)]
您可以使用
列表理解
,然后将空的列表
替换为[NaN]
:
df = pd.DataFrame({'comment': [
[('have', 'VERB'), ('you', 'PRON'), ('pahae', 'VERB')],
[('radio', 'NOUN'), ('television', 'NOUN'), ('lid', 'NOUN')],
[('yes', 'ADV'), ("you're", 'ADJ')],
[('ooi', 'ADJ'), ('work', 'NOUN'), ('barisan', 'ADJ')],
[('national', 'ADJ'), ('debt', 'NOUN'), ('increased', 'VERB')]
]})
print (df)
comment
0 [(have, VERB), (you, PRON), (pahae, VERB)]
1 [(radio, NOUN), (television, NOUN), (lid, NOUN)]
2 [(yes, ADV), (you're, ADJ)]
3 [(ooi, ADJ), (work, NOUN), (barisan, ADJ)]
4 [(national, ADJ), (debt, NOUN), (increased, VE...
设置参考
解决方案
用于Python 3
佩尔@jezrael
s1 = s.apply(pd.Series).stack().apply(pd.Series)
s2 = s1.loc[s1[1].isin(['VERB', 'NOUN']), 0]
s3 = s2.groupby(level=0).apply(lambda x: list(zip(x))).reindex_like(s)
s3.loc[s3.isnull()] = [[np.nan]]
s3
请添加一个最简单的示例,以生成第一个示例中的数据谢谢@piRSquared,但输出仍有标记。感谢您快速回复@jezrael。是的!它起作用了!我花了好几年才弄明白。当然,我必须学习更多。
s = pd.Series([
[('have', 'VERB'), ('you', 'PRON'), ('pahae', 'VERB')],
[('radio', 'NOUN'), ('television', 'NOUN'), ('lid', 'NOUN')],
[('yes', 'ADV'), ("you're", 'ADJ')],
[('ooi', 'ADJ'), ('work', 'NOUN'), ('barisan', 'ADJ')],
[('national', 'ADJ'), ('debt', 'NOUN'), ('increased', 'VERB')]
], name='comment')
s
0 [(have, VERB), (you, PRON), (pahae, VERB)]
1 [(radio, NOUN), (television, NOUN), (lid, NOUN)]
2 [(yes, ADV), (you're, ADJ)]
3 [(ooi, ADJ), (work, NOUN), (barisan, ADJ)]
4 [(national, ADJ), (debt, NOUN), (increased, VE...
Name: comment, dtype: object
s1 = s.apply(pd.Series).stack().apply(pd.Series)
s2 = s1.loc[s1[1].isin(['VERB', 'NOUN']), 0]
s3 = s2.groupby(level=0).apply(zip).reindex_like(s)
s3.loc[s3.isnull()] = [[np.nan]]
s3
0 [(have,), (pahae,)]
1 [(radio,), (television,), (lid,)]
2 [nan]
3 [(work,)]
4 [(debt,), (increased,)]
Name: 0, dtype: object
s1 = s.apply(pd.Series).stack().apply(pd.Series)
s2 = s1.loc[s1[1].isin(['VERB', 'NOUN']), 0]
s3 = s2.groupby(level=0).apply(lambda x: list(zip(x))).reindex_like(s)
s3.loc[s3.isnull()] = [[np.nan]]
s3