Python 如何修复TypeError:unhabable type:';列';在pyspark数据帧中?
我有一个数据框,每行包含一个列表 例如:Python 如何修复TypeError:unhabable type:';列';在pyspark数据帧中?,python,dataframe,pyspark,Python,Dataframe,Pyspark,我有一个数据框,每行包含一个列表 例如: +--------------------+-----+ | removed|stars| +--------------------+-----+ |[giant, best, buy...| 3.0| |[wow, surprised, ...| 4.0| |[one, day, satisf...| 3.0| 我想在每一行用柠檬汁涂抹 from nltk.stem import WordNetLemmatizer
+--------------------+-----+
| removed|stars|
+--------------------+-----+
|[giant, best, buy...| 3.0|
|[wow, surprised, ...| 4.0|
|[one, day, satisf...| 3.0|
我想在每一行用柠檬汁涂抹
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
df_list = df_removed.withColumn("removed",lemmatizer.lemmatize(df_removed["removed"]))
我得到一个错误:
TypeError: unhashable type: 'Column'
我不想使用rdd
和map
函数,只需在数据帧上使用lemmatizer即可。
我该怎么做?如何修复此错误?函数的
FreqDist
接受可散列对象的iterable(被设置为字符串,但可能与任何对象一起使用)。您得到的错误是因为您传入了一个列表的iterable。正如您所建议的,这是因为您所做的更改:
df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)
如果我理解正确,那一行将nltk.word\u tokenize
函数应用于某个系列<代码>单词标记化返回单词列表
作为解决方案,在尝试应用FreqDist
之前,只需将列表添加到一起,如下所示:
allWords = []
for wordList in words:
allWords += wordList
FreqDist(allWords)
一个更完整的修订,做你想做的。如果您只需要识别第二组100,请注意,mclist
将在第二次识别
df = pd.read_csv('CountryResponses.csv', encoding='utf-8', skiprows=0, error_bad_lines=False)
tokenizer = RegexpTokenizer(r'\w+')
df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)
lists = df['tokenized_sents']
words = []
for wordList in lists:
words += wordList
#remove 100 most common words based on Brown corpus
fdist = FreqDist(brown.words())
mostcommon = fdist.most_common(100)
mclist = []
for i in range(len(mostcommon)):
mclist.append(mostcommon[i][0])
words = [w for w in words if w not in mclist]
#Out: ['the',
# ',',
# '.',
# 'of',
# 'and',
#...]
#keep only most common words
fdist = FreqDist(words)
mostcommon = fdist.most_common(100)
mclist = []
for i in range(len(mostcommon)):
mclist.append(mostcommon[i][0])
# mclist contains second-most common set of 100 words
words = [w for w in words if w in mclist]
# this will keep ALL occurrences of the words in mclist
FreqDist
函数接受可散列对象的一个iterable(被设置为字符串,但它可能与任何对象一起工作)。您得到的错误是因为您传入了一个列表的iterable。正如您所建议的,这是因为您所做的更改:
df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)
如果我理解正确,那一行将nltk.word\u tokenize
函数应用于某个系列<代码>单词标记化返回单词列表
作为解决方案,在尝试应用FreqDist
之前,只需将列表添加到一起,如下所示:
allWords = []
for wordList in words:
allWords += wordList
FreqDist(allWords)
一个更完整的修订,做你想做的。如果您只需要识别第二组100,请注意,mclist
将在第二次识别
df = pd.read_csv('CountryResponses.csv', encoding='utf-8', skiprows=0, error_bad_lines=False)
tokenizer = RegexpTokenizer(r'\w+')
df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)
lists = df['tokenized_sents']
words = []
for wordList in lists:
words += wordList
#remove 100 most common words based on Brown corpus
fdist = FreqDist(brown.words())
mostcommon = fdist.most_common(100)
mclist = []
for i in range(len(mostcommon)):
mclist.append(mostcommon[i][0])
words = [w for w in words if w not in mclist]
#Out: ['the',
# ',',
# '.',
# 'of',
# 'and',
#...]
#keep only most common words
fdist = FreqDist(words)
mostcommon = fdist.most_common(100)
mclist = []
for i in range(len(mostcommon)):
mclist.append(mostcommon[i][0])
# mclist contains second-most common set of 100 words
words = [w for w in words if w in mclist]
# this will keep ALL occurrences of the words in mclist
查找
word\u tokenize
ex:df['tokenized\u sents']=df['Responses'].apply(nltk.word\u tokenize)
创建函数def fun(x):返回[lemmatizer.lemmatize(i)for i in x]
,并在链接的回答中用translate
替换,为什么要用word\u tokenize?我已经将我的单词拆分了,我只需要将它们标记化查找word\u tokenize
ex:df['tokenized\u sents]=df['Responses'].apply(nltk.word\u tokenize)
创建函数def fun(x):返回[lemmatizer.lemmatize(I)for I in x]
并将链接答案中的fun
替换为translate
。为什么要将单词标记化?我已经把我的话分开了,我只需要把它们柠檬化