Python TypeError:normalize()参数2必须是str,而不是带有字符串数据帧的序列
我有一个包含每天新闻的数据框架,我试图分析一天的感觉强度,也就是说,从新闻中得到的一天的总体感觉是积极的、消极的还是中性的。以下是DFU新闻的数据框:Python TypeError:normalize()参数2必须是str,而不是带有字符串数据帧的序列,python,python-3.x,string,nltk,typeerror,Python,Python 3.x,String,Nltk,Typeerror,我有一个包含每天新闻的数据框架,我试图分析一天的感觉强度,也就是说,从新闻中得到的一天的总体感觉是积极的、消极的还是中性的。以下是DFU新闻的数据框: Date name 0 2017-10-20 Gucci debuts art installation at its Ginza sto... 1 2018-08-01 Gucci Joins Paris Fashion Week for Its Spring ... 2 2018-04-20 Gucci lau
Date name
0 2017-10-20 Gucci debuts art installation at its Ginza sto...
1 2018-08-01 Gucci Joins Paris Fashion Week for Its Spring ...
2 2018-04-20 Gucci launches its new creative hub Gucci ArtL...
3 2017-10-20 Gucci to launch homeware line Gucci Decor - CP...
4 2017-12-07 GUCCI opens new store at Miami Design District...
5 2018-01-12 Gucci opens Gucci Garden in Florence - LUXUO
6 2018-02-26 GUCCI's wild experiment with the Fall Winter 2...
7 2018-08-09 Gucci Revamped London Flagship Store | The Imp...
8 2018-08-01 Alessandro Michele Announces new Gucci Home co...
9 2017-10-20 Before He Picks Up the CFDA’s International Aw...
我试图通过他使用的以下代码获得强烈的感觉:
但是,对于某些日期,我会得到一个类型错误。多亏了try catch
,您没有将其考虑在内,并绘制下表:
name compound neg neu pos
Date
2017-10-20 Gucci debuts art installation at its Ginza sto...
2018-08-01 Gucci Joins Paris Fashion Week for Its Spring ...
2018-04-20 Gucci launches its new creative hub Gucci ArtL... 0.4404 0 0.756 0.244
2017-10-20 Gucci to launch homeware line Gucci Decor - CP...
2017-12-07 GUCCI opens new store at Miami Design District... 0 0 1 0
2018-01-12 Gucci opens Gucci Garden in Florence - LUXUO 0 0 1 0
2018-02-26 GUCCI's wild experiment with the Fall Winter 2... 0 0 1 0
2018-08-09 Gucci Revamped London Flagship Store | The Imp... 0.3182 0 0.602 0.398
2018-08-01 Alessandro Michele Announces new Gucci Home co...
2017-10-20 Before He Picks Up the CFDA’s International Aw...
但是,当我删除try catch以了解其失败的原因时,我得到以下错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-26-2e9dbfc62bce> in <module>
4 for date, row in df_news.T.iteritems():
5 # try:
----> 6 sentence = unicodedata.normalize('NFKD', df_news.loc[date, 'name']).encode('ascii','ignore')
7 #print((sentence))
8 ss = sid.polarity_scores(str(sentence))
TypeError: normalize() argument 2 must be str, not Series
---------------------------------------------------------------------------
IndexingError Traceback (most recent call last)
<ipython-input-173-1bc93a0a065c> in <module>
5 try:
6 #sentence = unicodedata.normalize('NFKD', df_news.loc[date, 'name']).encode('ascii','ignore')
----> 7 sentence = df_news.loc[date, 'name'].apply(lambda x: unicodedata.normalize('NFKD', x).encode('ascii','ignore'))
8 ss = sid.polarity_scores(str(sentence))
9 df_news.set_value(date, 'compound', ss['compound'])
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
1470 except (KeyError, IndexError):
1471 pass
-> 1472 return self._getitem_tuple(key)
1473 else:
1474 # we by definition only have the 0th axis
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup)
873
874 # no multi-index, so validate all of the indexers
--> 875 self._has_valid_tuple(tup)
876
877 # ugly hack for GH #836
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py in _has_valid_tuple(self, key)
218 for i, k in enumerate(key):
219 if i >= self.obj.ndim:
--> 220 raise IndexingError('Too many indexers')
221 try:
222 self._validate_key(k, i)
IndexingError: Too many indexers
获取数据
它应该回馈:
Date name
0 2017-10-20 Gucci debuts art installation at its Ginza sto...
1 2018-08-01 Gucci Joins Paris Fashion Week for Its Spring ...
2 2018-04-20 Gucci launches its new creative hub Gucci ArtL...
3 2017-10-20 Gucci to launch homeware line Gucci Decor - CP...
4 2017-12-07 GUCCI opens new store at Miami Design District...
5 2018-01-12 Gucci opens Gucci Garden in Florence - LUXUO
6 2018-02-26 GUCCI's wild experiment with the Fall Winter 2...
7 2018-08-09 Gucci Revamped London Flagship Store | The Imp...
8 2018-08-01 Alessandro Michele Announces new Gucci Home co...
9 2017-10-20 Before He Picks Up the CFDA’s International Aw...
编辑:
我对当天出现的文章进行了分组,并将它们放在列表中
# get date out of the index to column
df_news = df_news.reset_index()
# optional
df_news['Date'] = pd.to_datetime(df_news['Date'])
# groupby and output group rows as list
df_news = df_news.groupby('Date')['name'].apply(list)
df_news.head()
它还给了我:
Date
2017-10-20 [Gucci debuts art installation at its Ginza st...
2017-12-07 [GUCCI opens new store at Miami Design Distric...
2018-01-12 [Gucci opens Gucci Garden in Florence - LUXUO]
2018-02-26 [GUCCI's wild experiment with the Fall Winter ...
2018-04-20 [Gucci launches its new creative hub Gucci Art...
2018-08-01 [Gucci Joins Paris Fashion Week for Its Spring...
2018-08-09 [Gucci Revamped London Flagship Store | The Im...
Name: name, dtype: object
因此,当我尝试应用Stael的答案时:
sentence = df_news.loc[date, 'name'].apply(lambda x: unicodedata.normalize('NFKD', x).encode('ascii','ignore'))
也就是说,对系列中的每个项目进行规范化
我得到以下错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-26-2e9dbfc62bce> in <module>
4 for date, row in df_news.T.iteritems():
5 # try:
----> 6 sentence = unicodedata.normalize('NFKD', df_news.loc[date, 'name']).encode('ascii','ignore')
7 #print((sentence))
8 ss = sid.polarity_scores(str(sentence))
TypeError: normalize() argument 2 must be str, not Series
---------------------------------------------------------------------------
IndexingError Traceback (most recent call last)
<ipython-input-173-1bc93a0a065c> in <module>
5 try:
6 #sentence = unicodedata.normalize('NFKD', df_news.loc[date, 'name']).encode('ascii','ignore')
----> 7 sentence = df_news.loc[date, 'name'].apply(lambda x: unicodedata.normalize('NFKD', x).encode('ascii','ignore'))
8 ss = sid.polarity_scores(str(sentence))
9 df_news.set_value(date, 'compound', ss['compound'])
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
1470 except (KeyError, IndexError):
1471 pass
-> 1472 return self._getitem_tuple(key)
1473 else:
1474 # we by definition only have the 0th axis
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup)
873
874 # no multi-index, so validate all of the indexers
--> 875 self._has_valid_tuple(tup)
876
877 # ugly hack for GH #836
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py in _has_valid_tuple(self, key)
218 for i, k in enumerate(key):
219 if i >= self.obj.ndim:
--> 220 raise IndexingError('Too many indexers')
221 try:
222 self._validate_key(k, i)
IndexingError: Too many indexers
在我看来是这样的:
sentence = unicodedata.normalize('NFKD', df_news.loc[date, 'name']).encode('ascii','ignore')
您试图调用df.news.loc[…]系列中的每个项目的normalise
但是pandas没有为您在整个系列中应用该功能-我认为您想要做的是这样的:
sentence = df_news.loc[date, 'name'].apply(lambda x: unicodedata.normalize('NFKD', x).encode('ascii','ignore')
df['new_column'] = [i['example_key'] for i in scores]
这是一种将函数(规格化)应用于系列中每个项的方法
编辑: 理论2-当你调用
df_news.loc[date,'name']
时,你选择的是index==date
和column==name'
的项目,但从你的问题来看,有些日期在你的索引中是重复的,这意味着,有时,不是获取一条记录,在其中调用unicodedata.normalize
,而是获取一个序列,这会导致错误
您会注意到,使用'try:except:'子句时未填充的记录是具有重复日期的记录
您需要以某种方式来处理这个问题,也许可以使用iteritems中的
row
,而不是date,但这需要您自己来解决 看到你在一篇文章中又犯了第三个错误,我想我还得再做一次
首先也是最重要的一点是,我觉得您不太理解自己的代码。像AttributeError:“list”对象没有属性“apply”
这样的错误对我来说意味着,在对它们进行操作时,您不知道变量是什么,因此我认为在进入下一节之前,您需要更慢、更仔细地理解代码的每一部分都在做什么
也就是说,您的问题并不像您所做的那么复杂-您正在尝试应用这两行代码
sentence = unicodedata.normalize('NFKD', df_news.loc[date, 'name']).encode('ascii','ignore')
ss = sid.polarity_scores(str(sentence))
数据框中“name”列中的每个条目,这并不难
您可以很容易地做到这一点:
scores = []
for entry in df['name']:
sentence = unicodedata.normalize('NFKD', entry).encode('ascii','ignore')
scores.append(sid.polarity_scores(str(sentence)))
这将为您提供一个您正在调用ss
您可以将这些列作为数据帧中的列应用,如下所示:
sentence = df_news.loc[date, 'name'].apply(lambda x: unicodedata.normalize('NFKD', x).encode('ascii','ignore')
df['new_column'] = [i['example_key'] for i in scores]
这不是最好或最有效的方法,但它是一个非常简单的方法,让你实现你想要做的事情
祝你好运
如果您以前按天分组,并列出了字符串(顺便说一句,我认为您不应该这样做),那么您需要另一层迭代
scores = []
for sentence_list in df['name']:
for entry in sentence_list:
sentence = unicodedata.normalize('NFKD', entry).encode('ascii','ignore')
scores.append(sid.polarity_scores(str(sentence)))
嗯,然后它会在
ss=sid上创建一个SyntaxError:invalid syntaxe
。极性评分(str(句子))
您希望变量句子是什么?在第一种情况下,您正在操作一个系列,因此您可能希望从中产生类似于一个系列的内容-您不能接受str或一个系列,这没有意义。抱歉!!我缺少了一个括号,实际错误是AttributeError:'str'对象在df_news.loc[date,'name'].apply(lambda…
ok,我想我开始理解了-我想df_news.loc[date,'name']
有时会给你一个字符串,有时会给你一个系列
。我从你的问题中看到,日期'2017-10-20'
在索引中出现了两次。在这种情况下,你会得到一个系列,而不是一个字符串。你需要以某种方式处理它,然后才能将其正常化@乘客:我已经编辑了我的答案,试图让它更清楚。谢谢你的帮助。但是我仍然有一个类型错误带有语句=unicodedata.normalize('NFKD',entry.)。encode('ascii','ignore')
,因为条目是一个列表。但是当我这样做时,语句=df_news.loc[date,'name']
我可以应用ss=sid.polarity\u分数(str(句子))
在某些句子中,如果neu
的返回分数为1,则另一个似乎不起作用。我认为没有任何理由将字符串分组到列表中。我认为您这样做是因为日期索引中有重复项,但这不是一个真正的问题-这种方法应该可以处理重复的日期.