Python 如何在dataframe中合并类别分类列?
我有一个数据帧:Python 如何在dataframe中合并类别分类列?,python,pandas,dataframe,pandas-groupby,Python,Pandas,Dataframe,Pandas Groupby,我有一个数据帧: Date Open High Low Close Struct Trend 2000-12-31 1477.87 1553.10 1254.19 1320.28 ohlc D 2001-12-31 1321.62 1383.37 944.07 1148.08 ohlc D 2002-12-31 114
Date Open High Low Close Struct Trend
2000-12-31 1477.87 1553.10 1254.19 1320.28 ohlc D
2001-12-31 1321.62 1383.37 944.07 1148.08 ohlc D
2002-12-31 1148.08 1176.97 768.58 879.82 ohlc D
2003-12-31 881.69 1112.52 788.90 1111.92 olhc U
2004-12-31 1112.61 1217.33 1060.74 1211.92 olhc U
2005-12-31 1213.43 1275.80 1136.22 1248.29 olhc U
2006-12-31 1252.03 1431.81 1219.29 1418.30 olhc U
2007-12-31 1418.03 1576.09 1364.14 1468.36 olhc U/D
2008-12-31 1468.36 1471.77 741.02 903.25 ohlc D
2009-12-31 903.25 1130.38 666.79 1115.10 olhc U/D
2010-12-31 1115.10 1262.60 1010.91 1257.64 olhc U
2011-12-31 1257.62 1370.58 1074.77 1257.60 ohlc U
2012-12-31 1258.86 1474.51 1258.86 1426.19 olhc U
2013-12-31 1426.19 1849.44 1426.19 1848.36 olhc U
2014-12-31 1845.86 2093.55 1737.92 2058.90 olhc U
2015-12-31 2058.90 2134.72 1867.01 2043.94 ohlc U
2016-12-31 2038.20 2277.53 1810.10 2238.83 olhc U
2017-12-31 2251.57 2694.97 2245.13 2673.61 olhc U
2018-12-31 2683.73 2940.91 2346.58 2506.85 ohlc U
数据有两个分类列“Struct”和“Trend”
我想按这两列对数据进行分组
当我这样做的时候:
groups = data.groupby(['Struct', 'Trend'])
熊猫可能获得6种不同的“结构”和“趋势”组合:
[('ohlc','D'),('ohlc','U'),('ohlc','U/D'),('olhc','D'),('olhc','U'),('olhc','U/D')]
如何合并组,其中“趋势”类别将“D”作为值的子字符串???
我预计只有4组::
Date Open High Low Close Struct Trend
2003-12-31 881.69 1112.52 788.90 1111.92 olhc U
2004-12-31 1112.61 1217.33 1060.74 1211.92 olhc U
2005-12-31 1213.43 1275.80 1136.22 1248.29 olhc U
2006-12-31 1252.03 1431.81 1219.29 1418.30 olhc U
2007-12-31 1418.03 1576.09 1364.14 1468.36 olhc U/D
2009-12-31 903.25 1130.38 666.79 1115.10 olhc U/D
2010-12-31 1115.10 1262.60 1010.91 1257.64 olhc U
2011-12-31 1257.62 1370.58 1074.77 1257.60 ohlc U
2012-12-31 1258.86 1474.51 1258.86 1426.19 olhc U
2013-12-31 1426.19 1849.44 1426.19 1848.36 olhc U
2014-12-31 1845.86 2093.55 1737.92 2058.90 olhc U
2015-12-31 2058.90 2134.72 1867.01 2043.94 ohlc U
2016-12-31 2038.20 2277.53 1810.10 2238.83 olhc U
2017-12-31 2251.57 2694.97 2245.13 2673.61 olhc U
2018-12-31 2683.73 2940.91 2346.58 2506.85 ohlc U
Date Open High Low Close Struct Trend
2000-12-31 1477.87 1553.10 1254.19 1320.28 ohlc D
2001-12-31 1321.62 1383.37 944.07 1148.08 ohlc D
2002-12-31 1148.08 1176.97 768.58 879.82 ohlc D
2007-12-31 1418.03 1576.09 1364.14 1468.36 olhc U/D
2008-12-31 1468.36 1471.77 741.02 903.25 ohlc D
2009-12-31 903.25 1130.38 666.79 1115.10 olhc U/D
我是这样做的,但我只得到数据帧,需要组:
trend_dtype = pd.api.types.CategoricalDtype(categories=['D', 'U/D'], ordered=False)
data['Trend'] = data['Trend'].astype(trend_dtype)
print(data.dropna())
你可以用
你可以用
您可以将问题视为重复
Trend
为U/D
的行。因此,这里有一种方法:
df = (df.iloc[:,:-1]
.join(df.Trend.str.split('/', expand=True))
.melt(id_vars=df.columns[:-1], value_name='Trend')
.dropna()
.drop('variable', axis=1)
)
您的df是:
Date Open High Low Close Struct Trend
0 2000-12-31 1477.87 1553.10 1254.19 1320.28 ohlc D
1 2001-12-31 1321.62 1383.37 944.07 1148.08 ohlc D
2 2002-12-31 1148.08 1176.97 768.58 879.82 ohlc D
3 2003-12-31 881.69 1112.52 788.90 1111.92 olhc U
4 2004-12-31 1112.61 1217.33 1060.74 1211.92 olhc U
5 2005-12-31 1213.43 1275.80 1136.22 1248.29 olhc U
6 2006-12-31 1252.03 1431.81 1219.29 1418.30 olhc U
7 2007-12-31 1418.03 1576.09 1364.14 1468.36 olhc U
8 2008-12-31 1468.36 1471.77 741.02 903.25 ohlc D
9 2009-12-31 903.25 1130.38 666.79 1115.10 olhc U
10 2010-12-31 1115.10 1262.60 1010.91 1257.64 olhc U
11 2011-12-31 1257.62 1370.58 1074.77 1257.60 ohlc U
12 2012-12-31 1258.86 1474.51 1258.86 1426.19 olhc U
13 2013-12-31 1426.19 1849.44 1426.19 1848.36 olhc U
14 2014-12-31 1845.86 2093.55 1737.92 2058.90 olhc U
15 2015-12-31 2058.90 2134.72 1867.01 2043.94 ohlc U
16 2016-12-31 2038.20 2277.53 1810.10 2238.83 olhc U
17 2017-12-31 2251.57 2694.97 2245.13 2673.61 olhc U
18 2018-12-31 2683.73 2940.91 2346.58 2506.85 ohlc U
26 2007-12-31 1418.03 1576.09 1364.14 1468.36 olhc D
28 2009-12-31 903.25 1130.38 666.79 1115.10 olhc D
请注意行
(7,26)
和(9,28)
您可以将问题视为重复趋势
为U/D
的行。因此,这里有一种方法:
df = (df.iloc[:,:-1]
.join(df.Trend.str.split('/', expand=True))
.melt(id_vars=df.columns[:-1], value_name='Trend')
.dropna()
.drop('variable', axis=1)
)
您的df是:
Date Open High Low Close Struct Trend
0 2000-12-31 1477.87 1553.10 1254.19 1320.28 ohlc D
1 2001-12-31 1321.62 1383.37 944.07 1148.08 ohlc D
2 2002-12-31 1148.08 1176.97 768.58 879.82 ohlc D
3 2003-12-31 881.69 1112.52 788.90 1111.92 olhc U
4 2004-12-31 1112.61 1217.33 1060.74 1211.92 olhc U
5 2005-12-31 1213.43 1275.80 1136.22 1248.29 olhc U
6 2006-12-31 1252.03 1431.81 1219.29 1418.30 olhc U
7 2007-12-31 1418.03 1576.09 1364.14 1468.36 olhc U
8 2008-12-31 1468.36 1471.77 741.02 903.25 ohlc D
9 2009-12-31 903.25 1130.38 666.79 1115.10 olhc U
10 2010-12-31 1115.10 1262.60 1010.91 1257.64 olhc U
11 2011-12-31 1257.62 1370.58 1074.77 1257.60 ohlc U
12 2012-12-31 1258.86 1474.51 1258.86 1426.19 olhc U
13 2013-12-31 1426.19 1849.44 1426.19 1848.36 olhc U
14 2014-12-31 1845.86 2093.55 1737.92 2058.90 olhc U
15 2015-12-31 2058.90 2134.72 1867.01 2043.94 ohlc U
16 2016-12-31 2038.20 2277.53 1810.10 2238.83 olhc U
17 2017-12-31 2251.57 2694.97 2245.13 2673.61 olhc U
18 2018-12-31 2683.73 2940.91 2346.58 2506.85 ohlc U
26 2007-12-31 1418.03 1576.09 1364.14 1468.36 olhc D
28 2009-12-31 903.25 1130.38 666.79 1115.10 olhc D
请注意第
(7,26)
和(9,28)
行,您能提供一个您想要的最终数据集的示例吗?您想要的组不是1/3相同吗?还有2/4?@BrianJoseph不,你误读了课文ohlc
和olhc
@Trenton_M,噢,哇,这让我很困惑。我花了好几次时间才正确阅读。你能提供一个你想要的最终数据集的样本吗?你想要的组不是1/3相同吗?还有2/4?@BrianJoseph不,你误读了课文ohlc
和olhc
@Trenton_M,噢,哇,这让人困惑,谢谢。我也花了好几次时间才看懂它。
df = (df.iloc[:,:-1]
.join(df.Trend.str.split('/', expand=True))
.melt(id_vars=df.columns[:-1], value_name='Trend')
.dropna()
.drop('variable', axis=1)
)
Date Open High Low Close Struct Trend
0 2000-12-31 1477.87 1553.10 1254.19 1320.28 ohlc D
1 2001-12-31 1321.62 1383.37 944.07 1148.08 ohlc D
2 2002-12-31 1148.08 1176.97 768.58 879.82 ohlc D
3 2003-12-31 881.69 1112.52 788.90 1111.92 olhc U
4 2004-12-31 1112.61 1217.33 1060.74 1211.92 olhc U
5 2005-12-31 1213.43 1275.80 1136.22 1248.29 olhc U
6 2006-12-31 1252.03 1431.81 1219.29 1418.30 olhc U
7 2007-12-31 1418.03 1576.09 1364.14 1468.36 olhc U
8 2008-12-31 1468.36 1471.77 741.02 903.25 ohlc D
9 2009-12-31 903.25 1130.38 666.79 1115.10 olhc U
10 2010-12-31 1115.10 1262.60 1010.91 1257.64 olhc U
11 2011-12-31 1257.62 1370.58 1074.77 1257.60 ohlc U
12 2012-12-31 1258.86 1474.51 1258.86 1426.19 olhc U
13 2013-12-31 1426.19 1849.44 1426.19 1848.36 olhc U
14 2014-12-31 1845.86 2093.55 1737.92 2058.90 olhc U
15 2015-12-31 2058.90 2134.72 1867.01 2043.94 ohlc U
16 2016-12-31 2038.20 2277.53 1810.10 2238.83 olhc U
17 2017-12-31 2251.57 2694.97 2245.13 2673.61 olhc U
18 2018-12-31 2683.73 2940.91 2346.58 2506.85 ohlc U
26 2007-12-31 1418.03 1576.09 1364.14 1468.36 olhc D
28 2009-12-31 903.25 1130.38 666.79 1115.10 olhc D