Pandas 将包含字符串列表的系列拆分为多列_Pandas

Pandas 将包含字符串列表的系列拆分为多列

pandas

Pandas 将包含字符串列表的系列拆分为多列,pandas,Pandas,我正在使用pandas从Twitter数据集中执行一些字符串匹配我已经导入了一个tweet的CSV，并使用日期编制了索引。然后，我创建了一个包含文本匹配项的新列： In [1]: import pandas as pd indata = pd.read_csv('tweets.csv') indata.index = pd.to_datetime(indata["Date"]) indata["matches"] = indata.Tweet.str.findall("rudd|abbott"

我正在使用pandas从Twitter数据集中执行一些字符串匹配

我已经导入了一个tweet的CSV，并使用日期编制了索引。然后，我创建了一个包含文本匹配项的新列：

In [1]:
import pandas as pd
indata = pd.read_csv('tweets.csv')
indata.index = pd.to_datetime(indata["Date"])
indata["matches"] = indata.Tweet.str.findall("rudd|abbott")
only_results = pd.Series(indata["matches"])
only_results.head(10)

Out[1]:
Date
2013-08-06 16:03:17          []
2013-08-06 16:03:12          []
2013-08-06 16:03:10          []
2013-08-06 16:03:09          []
2013-08-06 16:03:08          []
2013-08-06 16:03:07          []
2013-08-06 16:03:07    [abbott]
2013-08-06 16:03:06          []
2013-08-06 16:03:02          []
2013-08-06 16:03:00      [rudd]
Name: matches, dtype: object

我想要得到的是一个数据框，按天/月分组，我可以将不同的搜索词绘制为列，然后绘制

我在另一个SO答案（）中遇到了看起来完美的解决方案，但在尝试应用于本系列时，我遇到了一个例外：

In [2]: only_results.apply(lambda x: pd.Series(1,index=x)).fillna(0)
Out [2]: Exception - Traceback (most recent call last)
...
Exception: Reindexing only valid with uniquely valued Index objects

我真的希望能够在dataframe中应用更改，以应用和重新应用groupby条件，并高效地执行绘图-并且希望了解更多关于.apply（）方法的工作原理

提前谢谢

成功回答后更新

问题在于“匹配”列中的重复项，我没有看到。我迭代了该列以删除重复项，然后使用上面@Jeff linked的原始解决方案。这是成功的，我现在可以在生成的系列上使用.groupby（）查看每日、每小时等趋势。下面是结果图的一个示例：

In [3]: successful_run = only_results.apply(lambda x: pd.Series(1,index=x)).fillna(0)
In [4]: successful_run.groupby([successful_run.index.day,successful_run.index.hour]).sum().plot()

Out [4]: <matplotlib.axes.AxesSubplot at 0x110b51650>

[3]中的

successful_run=only_results.apply（lambda x:pd.Series（1，index=x））.fillna（0）
[4]中：successful\u run.groupby（[successful\u run.index.day，successful\u run.index.hour]）.sum（）.plot（）
出[4]：

您得到了一些重复的结果（例如，陆克文在一条推文中出现多次），因此出现了例外（见下文）

我认为最好是统计发生次数，而不是从findall中列出（pandas数据结构不设计为包含列表，尽管str.findall使用列表）。
我建议您使用以下方法：

In [1]: s = pd.Series(['aa', 'aba', 'b'])

In [2]: pd.DataFrame({key: s.str.count(key) for key in ['a', 'b']})
Out[2]: 
   a  b
0  2  0
1  2  1
2  0  1

注意（由于在前两行中发现重复的“a”，因此出现异常）：

首先重置索引，然后使用您提到的解决方案：

In [28]: s
Out[28]:
Date
2013-08-06 16:03:17          []
2013-08-06 16:03:12          []
2013-08-06 16:03:10          []
2013-08-06 16:03:09          []
2013-08-06 16:03:08          []
2013-08-06 16:03:07          []
2013-08-06 16:03:07    [abbott]
2013-08-06 16:03:06          []
2013-08-06 16:03:02          []
2013-08-06 16:03:00      [rudd]
Name: matches, dtype: object

In [29]: df = s.reset_index()

In [30]: df.join(df.matches.apply(lambda x: Series(1, index=x)).fillna(0))
Out[30]:
                 Date   matches  abbott  rudd
0 2013-08-06 16:03:17        []       0     0
1 2013-08-06 16:03:12        []       0     0
2 2013-08-06 16:03:10        []       0     0
3 2013-08-06 16:03:09        []       0     0
4 2013-08-06 16:03:08        []       0     0
5 2013-08-06 16:03:07        []       0     0
6 2013-08-06 16:03:07  [abbott]       1     0
7 2013-08-06 16:03:06        []       0     0
8 2013-08-06 16:03:02        []       0     0
9 2013-08-06 16:03:00    [rudd]       0     1

除非您有一个明确的

DatetimeIndex

（通常涉及某种类型的重新采样，并且没有重复项）用例，否则最好将日期放在列中，因为它比将其作为索引更灵活，尤其是如果所述索引有重复项的话

就

apply

方法而言，它对不同的对象执行的操作略有不同。例如，

DataFrame.apply（）

将在默认情况下跨列应用传入的可调用项，但您可以传递

axis=1

以沿行应用它

Series.apply（）

将传入的可调用应用于

Series

实例的每个元素。在@Jeff提供的非常聪明的解决方案中，发生了以下情况：

In [12]: s
Out[12]:
Date
2013-08-06 16:03:17          []
2013-08-06 16:03:12          []
2013-08-06 16:03:10          []
2013-08-06 16:03:09          []
2013-08-06 16:03:08          []
2013-08-06 16:03:07          []
2013-08-06 16:03:07    [abbott]
2013-08-06 16:03:06          []
2013-08-06 16:03:02          []
2013-08-06 16:03:00      [rudd]
Name: matches, dtype: object

In [13]: pd.lib.map_infer(s.values, lambda x: Series(1, index=x)).tolist()
Out[13]:
[Series([], dtype: int64),
 Series([], dtype: int64),
 Series([], dtype: int64),
 Series([], dtype: int64),
 Series([], dtype: int64),
 Series([], dtype: int64),
 abbott    1
dtype: int64,
 Series([], dtype: int64),
 Series([], dtype: int64),
 rudd    1
dtype: int64]

In [14]: pd.core.frame._to_arrays(_13, columns=None)
Out[14]:
(array([[ nan,  nan,  nan,  nan,  nan,  nan,   1.,  nan,  nan,  nan],
       [ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,   1.]]),
 Index([u'abbott', u'rudd'], dtype=object))

每个空的

序列

Out[13]

都被赋予一个

nan值

，以表明在我们的两个列索引中都没有值。在本例中，该索引是

索引（[u'abbott'，u'rudd']，dtype=object）

。如果列索引处有值，则保留该值

请记住，这些都是用户通常不必担心的低级细节。我很好奇，所以我遵循了代码的轨迹。

我不认为出现异常的原因是因为DatetimeIndex中存在重复项，而是在系列的x中（1，index=x）（我可能错了）…@而且我认为你是对的。事实上，我无法使用OP的数据复制异常。您可以在findall中使用DUP复制异常（参见我的答案），我认为Jeff的apply解决方案应该可以处理DatetimeIndex中的DUP。@而且，当您的数据中有类似于

['rudd'，'rudd']

的内容时，这会失败（根据您的示例）。这是因为无法为重复

索引创建索引器。尝试i=Index（['a'，a']）；i、 get_indexer（i）哦，这更像是对杰夫机智回答的阐述。（我很困惑！）谢谢@Andy-今天早上我会运行代码（我在悉尼），并在分类后将其标记为已接受。我已将其标记为已接受，因为它指出了主要问题-lambda函数中的重复项。我其实并不需要这些副本，它们只是信息的“标签”。我没有真正遵循@Andy的方法，而是在前面的回答中成功地从Jeff那里删除了“匹配项”列。
In [12]: s
Out[12]:
Date
2013-08-06 16:03:17          []
2013-08-06 16:03:12          []
2013-08-06 16:03:10          []
2013-08-06 16:03:09          []
2013-08-06 16:03:08          []
2013-08-06 16:03:07          []
2013-08-06 16:03:07    [abbott]
2013-08-06 16:03:06          []
2013-08-06 16:03:02          []
2013-08-06 16:03:00      [rudd]
Name: matches, dtype: object

In [13]: pd.lib.map_infer(s.values, lambda x: Series(1, index=x)).tolist()
Out[13]:
[Series([], dtype: int64),
 Series([], dtype: int64),
 Series([], dtype: int64),
 Series([], dtype: int64),
 Series([], dtype: int64),
 Series([], dtype: int64),
 abbott    1
dtype: int64,
 Series([], dtype: int64),
 Series([], dtype: int64),
 rudd    1
dtype: int64]

In [14]: pd.core.frame._to_arrays(_13, columns=None)
Out[14]:
(array([[ nan,  nan,  nan,  nan,  nan,  nan,   1.,  nan,  nan,  nan],
       [ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,   1.]]),
 Index([u'abbott', u'rudd'], dtype=object))