Python TypeError:normalize（）参数2必须是str，而不是带有字符串数据帧的序列_Python_Python 3.x_String_Nltk_Typeerror

Python TypeError:normalize（）参数2必须是str，而不是带有字符串数据帧的序列

python python-3.x string

Python TypeError:normalize（）参数2必须是str，而不是带有字符串数据帧的序列,python,python-3.x,string,nltk,typeerror,Python,Python 3.x,String,Nltk,Typeerror,我有一个包含每天新闻的数据框架，我试图分析一天的感觉强度，也就是说，从新闻中得到的一天的总体感觉是积极的、消极的还是中性的。以下是DFU新闻的数据框： Date name 0 2017-10-20 Gucci debuts art installation at its Ginza sto... 1 2018-08-01 Gucci Joins Paris Fashion Week for Its Spring ... 2 2018-04-20 Gucci lau

我有一个包含每天新闻的数据框架，我试图分析一天的感觉强度，也就是说，从新闻中得到的一天的总体感觉是积极的、消极的还是中性的。以下是DFU新闻的数据框：

    Date    name
0   2017-10-20  Gucci debuts art installation at its Ginza sto...
1   2018-08-01  Gucci Joins Paris Fashion Week for Its Spring ...
2   2018-04-20  Gucci launches its new creative hub Gucci ArtL...
3   2017-10-20  Gucci to launch homeware line Gucci Decor - CP...
4   2017-12-07  GUCCI opens new store at Miami Design District...
5   2018-01-12  Gucci opens Gucci Garden in Florence - LUXUO
6   2018-02-26  GUCCI's wild experiment with the Fall Winter 2...
7   2018-08-09  Gucci Revamped London Flagship Store | The Imp...
8   2018-08-01  Alessandro Michele Announces new Gucci Home co...
9   2017-10-20  Before He Picks Up the CFDA’s International Aw...

我试图通过他使用的以下代码获得强烈的感觉：

但是，对于某些日期，我会得到一个类型错误。多亏了

try catch

，您没有将其考虑在内，并绘制下表：

    name    compound    neg neu pos
Date                    
2017-10-20  Gucci debuts art installation at its Ginza sto...               
2018-08-01  Gucci Joins Paris Fashion Week for Its Spring ...               
2018-04-20  Gucci launches its new creative hub Gucci ArtL...   0.4404  0   0.756   0.244
2017-10-20  Gucci to launch homeware line Gucci Decor - CP...               
2017-12-07  GUCCI opens new store at Miami Design District...   0   0   1   0
2018-01-12  Gucci opens Gucci Garden in Florence - LUXUO    0   0   1   0
2018-02-26  GUCCI's wild experiment with the Fall Winter 2...   0   0   1   0
2018-08-09  Gucci Revamped London Flagship Store | The Imp...   0.3182  0   0.602   0.398
2018-08-01  Alessandro Michele Announces new Gucci Home co...               
2017-10-20  Before He Picks Up the CFDA’s International Aw...

但是，当我删除try catch以了解其失败的原因时，我得到以下错误：

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-26-2e9dbfc62bce> in <module>
      4 for date, row in df_news.T.iteritems():
      5 #    try:
----> 6     sentence = unicodedata.normalize('NFKD', df_news.loc[date, 'name']).encode('ascii','ignore')
      7     #print((sentence))
      8     ss = sid.polarity_scores(str(sentence))

TypeError: normalize() argument 2 must be str, not Series

---------------------------------------------------------------------------
IndexingError                             Traceback (most recent call last)
<ipython-input-173-1bc93a0a065c> in <module>
      5     try:
      6         #sentence = unicodedata.normalize('NFKD', df_news.loc[date, 'name']).encode('ascii','ignore')
----> 7         sentence = df_news.loc[date, 'name'].apply(lambda x: unicodedata.normalize('NFKD', x).encode('ascii','ignore'))
      8         ss = sid.polarity_scores(str(sentence))
      9         df_news.set_value(date, 'compound', ss['compound'])

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
   1470             except (KeyError, IndexError):
   1471                 pass
-> 1472             return self._getitem_tuple(key)
   1473         else:
   1474             # we by definition only have the 0th axis

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup)
    873 
    874         # no multi-index, so validate all of the indexers
--> 875         self._has_valid_tuple(tup)
    876 
    877         # ugly hack for GH #836

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py in _has_valid_tuple(self, key)
    218         for i, k in enumerate(key):
    219             if i >= self.obj.ndim:
--> 220                 raise IndexingError('Too many indexers')
    221             try:
    222                 self._validate_key(k, i)

IndexingError: Too many indexers

获取数据它应该回馈：

    Date    name
0   2017-10-20  Gucci debuts art installation at its Ginza sto...
1   2018-08-01  Gucci Joins Paris Fashion Week for Its Spring ...
2   2018-04-20  Gucci launches its new creative hub Gucci ArtL...
3   2017-10-20  Gucci to launch homeware line Gucci Decor - CP...
4   2017-12-07  GUCCI opens new store at Miami Design District...
5   2018-01-12  Gucci opens Gucci Garden in Florence - LUXUO
6   2018-02-26  GUCCI's wild experiment with the Fall Winter 2...
7   2018-08-09  Gucci Revamped London Flagship Store | The Imp...
8   2018-08-01  Alessandro Michele Announces new Gucci Home co...
9   2017-10-20  Before He Picks Up the CFDA’s International Aw...

编辑：我对当天出现的文章进行了分组，并将它们放在列表中

# get date out of the index to column    
df_news = df_news.reset_index()
# optional
df_news['Date'] = pd.to_datetime(df_news['Date'])
# groupby and output group rows as list
df_news = df_news.groupby('Date')['name'].apply(list)
df_news.head()

它还给了我：

Date
2017-10-20    [Gucci debuts art installation at its Ginza st...
2017-12-07    [GUCCI opens new store at Miami Design Distric...
2018-01-12       [Gucci opens Gucci Garden in Florence - LUXUO]
2018-02-26    [GUCCI's wild experiment with the Fall Winter ...
2018-04-20    [Gucci launches its new creative hub Gucci Art...
2018-08-01    [Gucci Joins Paris Fashion Week for Its Spring...
2018-08-09    [Gucci Revamped London Flagship Store | The Im...
Name: name, dtype: object

因此，当我尝试应用Stael的答案时：

sentence = df_news.loc[date, 'name'].apply(lambda x: unicodedata.normalize('NFKD', x).encode('ascii','ignore'))

也就是说，对系列中的每个项目进行规范化

我得到以下错误：

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-26-2e9dbfc62bce> in <module>
      4 for date, row in df_news.T.iteritems():
      5 #    try:
----> 6     sentence = unicodedata.normalize('NFKD', df_news.loc[date, 'name']).encode('ascii','ignore')
      7     #print((sentence))
      8     ss = sid.polarity_scores(str(sentence))

TypeError: normalize() argument 2 must be str, not Series

---------------------------------------------------------------------------
IndexingError                             Traceback (most recent call last)
<ipython-input-173-1bc93a0a065c> in <module>
      5     try:
      6         #sentence = unicodedata.normalize('NFKD', df_news.loc[date, 'name']).encode('ascii','ignore')
----> 7         sentence = df_news.loc[date, 'name'].apply(lambda x: unicodedata.normalize('NFKD', x).encode('ascii','ignore'))
      8         ss = sid.polarity_scores(str(sentence))
      9         df_news.set_value(date, 'compound', ss['compound'])

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
   1470             except (KeyError, IndexError):
   1471                 pass
-> 1472             return self._getitem_tuple(key)
   1473         else:
   1474             # we by definition only have the 0th axis

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup)
    873 
    874         # no multi-index, so validate all of the indexers
--> 875         self._has_valid_tuple(tup)
    876 
    877         # ugly hack for GH #836

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py in _has_valid_tuple(self, key)
    218         for i, k in enumerate(key):
    219             if i >= self.obj.ndim:
--> 220                 raise IndexingError('Too many indexers')
    221             try:
    222                 self._validate_key(k, i)

IndexingError: Too many indexers

在我看来是这样的：

sentence = unicodedata.normalize('NFKD', df_news.loc[date, 'name']).encode('ascii','ignore')

您试图调用df.news.loc[…]系列中的每个项目的normalise

但是pandas没有为您在整个系列中应用该功能-我认为您想要做的是这样的：

sentence = df_news.loc[date, 'name'].apply(lambda x: unicodedata.normalize('NFKD', x).encode('ascii','ignore')

df['new_column'] = [i['example_key'] for i in scores]

这是一种将函数（规格化）应用于系列中每个项的方法

编辑：

理论2-当你调用

df_news.loc[date，'name']

时，你选择的是

index==date

和

column==name'

的项目，但从你的问题来看，有些日期在你的索引中是重复的，这意味着，有时，不是获取一条记录，在其中调用

unicodedata.normalize

，而是获取一个序列，这会导致错误

您会注意到，使用'try:except:'子句时未填充的记录是具有重复日期的记录

您需要以某种方式来处理这个问题，也许可以使用iteritems中的

row

，而不是date，但这需要您自己来解决

看到你在一篇文章中又犯了第三个错误，我想我还得再做一次

首先也是最重要的一点是，我觉得您不太理解自己的代码。像

AttributeError:“list”对象没有属性“apply”

这样的错误对我来说意味着，在对它们进行操作时，您不知道变量是什么，因此我认为在进入下一节之前，您需要更慢、更仔细地理解代码的每一部分都在做什么

也就是说，您的问题并不像您所做的那么复杂-您正在尝试应用这两行代码

    sentence = unicodedata.normalize('NFKD', df_news.loc[date, 'name']).encode('ascii','ignore')

    ss = sid.polarity_scores(str(sentence))

数据框中“name”列中的每个条目，这并不难

您可以很容易地做到这一点：

scores = []
for entry in df['name']:
    sentence = unicodedata.normalize('NFKD', entry).encode('ascii','ignore')
    scores.append(sid.polarity_scores(str(sentence)))

这将为您提供一个您正在调用

ss

您可以将这些列作为数据帧中的列应用，如下所示：

sentence = df_news.loc[date, 'name'].apply(lambda x: unicodedata.normalize('NFKD', x).encode('ascii','ignore')

df['new_column'] = [i['example_key'] for i in scores]

这不是最好或最有效的方法，但它是一个非常简单的方法，让你实现你想要做的事情

祝你好运

如果您以前按天分组，并列出了字符串（顺便说一句，我认为您不应该这样做），那么您需要另一层迭代

scores = []
for sentence_list in df['name']:
    for entry in sentence_list:
        sentence = unicodedata.normalize('NFKD', entry).encode('ascii','ignore')
        scores.append(sid.polarity_scores(str(sentence)))

嗯，然后它会在

ss=sid上创建一个SyntaxError:invalid syntaxe
。极性评分（str（句子））

您希望变量句子是什么？在第一种情况下，您正在操作一个系列，因此您可能希望从中产生类似于一个系列的内容-您不能接受str或一个系列，这没有意义。抱歉！！我缺少了一个括号，实际错误是

AttributeError:'str'对象在df_news.loc[date，'name'].apply（lambda…
ok，我想我开始理解了-我想df_news.loc[date，'name']
有时会给你一个字符串，有时会给你一个系列
。我从你的问题中看到，日期'2017-10-20'
在索引中出现了两次。在这种情况下，你会得到一个系列，而不是一个字符串。你需要以某种方式处理它，然后才能将其正常化@乘客：我已经编辑了我的答案，试图让它更清楚。谢谢你的帮助。但是我仍然有一个类型错误带有语句=unicodedata.normalize（'NFKD'，entry.）。encode（'ascii'，'ignore'）
，因为条目是一个列表。但是当我这样做时，语句=df_news.loc[date，'name']
我可以应用ss=sid.polarity\u分数（str（句子））
在某些句子中，如果neu
的返回分数为1，则另一个似乎不起作用。我认为没有任何理由将字符串分组到列表中。我认为您这样做是因为日期索引中有重复项，但这不是一个真正的问题-这种方法应该可以处理重复的日期.