Python 如何用元组过滤列表中的数据
POS标签过滤Python 如何用元组过滤列表中的数据,python,pandas,dataframe,nlp,Python,Pandas,Dataframe,Nlp,POS标签过滤 # Dummy data "Sukanya is getting married next year. " \ "Marriage is a big step in one’s life." \ "It is both exciting and frightening. " \ "But friendship is a sacred bond between people." \ "It is a special kind of love between us. " \
# Dummy data
"Sukanya is getting married next year. " \
"Marriage is a big step in one’s life." \
"It is both exciting and frightening. " \
"But friendship is a sacred bond between people." \
"It is a special kind of love between us. " \
"Many of you must have tried searching for a friend "\
"but never found the right one."
我有数据帧(df)。每一行都是一个列表列表,其中包含元组
示例行:
[[('Sukanya', 'NNP'), ('getting', 'VBG'), ('married', 'VBN'), ('next', 'JJ'), ('year', 'NN')],
[('Marriage', 'NN'), ('big', 'JJ'), ('step', 'NN'), ('one', 'CD'), ('’', 'NN'), ('life', 'NN')],
[('It', 'PRP'), ('exciting', 'VBG'), ('frightening', 'VBG')],
[('But', 'CC'), ('friendship', 'NN'), ('sacred', 'VBD'), ('bond', 'NN'), ('people', 'NNS')],
[('It', 'PRP'), ('special', 'JJ'), ('kind', 'NN'), ('love', 'VB'), ('us', 'PRP')],
[('Many', 'JJ'), ('must', 'MD'), ('tried', 'VB'), ('searching', 'VBG'), ('friend', 'NN'), ('never','RB'),
('found', 'VBD'), ('right', 'RB'), ('one', 'CD')]]
现在,我尝试将形容词、名词、动词、副词的词性标记过滤到一个单独的列过滤的\u标记
def filter_pos_tags(tagged_text):
filtererd_tags = []
for i in tagged_text:
for j in i:
if j[-1].startswith(("J", "V", "N", "R")): filtered_tags.append(j[0])
return filtered_tags
df["filtered_tags"] = df["tagged"].apply(lambda x: get_pos_tags(x))
我得到的结果是:
['Sukanya', 'getting', 'married', 'next', 'year', 'Marriage', 'big', 'step', 'life', 'exciting', 'frightening', 'friendship', 'sacred', 'bond', 'people', 'special', 'kind', 'love', 'Many', 'tried', searching', 'friend', 'found', 'right']
所需输出
[['Sukanya', 'getting', 'married', 'next', 'year'], ['Marriage', 'big', 'step', 'life' ], ['exciting', 'frightening'], ['friendship', 'sacred', 'bond', 'people'], ['special', 'kind', 'love'], ['Many', 'tried', searching', 'friend'], ['found', 'right']]
如果更改函数,在标记文本中的每个项目中添加一个列表到
filtered\u tags
,则可以达到预期效果
使用以下filter\u pos\u tags()
def filter_pos_tags(tagged_text):
filtered_tags = []
for index, i in enumerate(tagged_text):
filtered_tags.append([])
for j in i:
#print(i,j)
if j[-1].startswith(("J", "V", "N", "R")): filtered_tags[index].append(j[0])
return filtered_tags
注意:
您提供的示例行只有6个元素,其中在虚拟数据中似乎有7个句子。试一试:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from nltk.stem import WordNetLemmatizer
text = """Sukanya is getting married next year.
Marriage is a big step in one's life.
It is both exciting and frightening.
But friendship is a sacred bond between people.
It is a special kind of love between us.
Many of you must have tried searching for a friend
but never found the right one."""
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
def get_pos_tags(text):
tokenized = sent_tokenize(text)
for i in tokenized:
# Word tokenizers is used to find the words
# and punctuation in a string
wordsList = nltk.word_tokenize(i)
# removing stop words from wordList
wordsList = [w for w in wordsList if not w in stop_words]
# Using a Tagger. Which is part-of-speech
# tagger or POS-tagger.
tagged = nltk.pos_tag(wordsList, tagset='universal')
return tagged
def get_filtered(tagged_text):
valid_tags = set(['ADJ', 'NOUN', 'VERB', 'ADV'])
filtered = filter(lambda word_entry : lemmatizer.lemmatize(word_entry[1]) in valid_tags, tagged_text)
final = map(lambda match: match[0], filtered)
return list(final)
df = pd.DataFrame({
'text': text.split("\n")
})
df["tagged"] = df["text"].apply(lambda x: get_pos_tags(x))
df['filtered'] = df['tagged'].apply(get_filtered)
print(df['filtered'])
输出为:
0 [Sukanya, getting, married, next, year]
1 [Marriage, big, step, life]
2 [exciting, frightening]
3 [friendship, sacred, bond, people]
4 [special, kind, love]
5 [Many, must, tried, searching, friend]
6 [never, found, right]
你为什么又打电话给“获取位置标签”
df[“filtered_tags”]=df[“tagged”]。应用(lambda x:get_pos_tags(x))
您不应该在这里调用filtered_pos_tags
吗?添加了lemmatizer。
0 [Sukanya, getting, married, next, year]
1 [Marriage, big, step, life]
2 [exciting, frightening]
3 [friendship, sacred, bond, people]
4 [special, kind, love]
5 [Many, must, tried, searching, friend]
6 [never, found, right]