Python 如何在dataframe中合并类别分类列?

Python 如何在dataframe中合并类别分类列?,python,pandas,dataframe,pandas-groupby,Python,Pandas,Dataframe,Pandas Groupby,我有一个数据帧: Date Open High Low Close Struct Trend 2000-12-31 1477.87 1553.10 1254.19 1320.28 ohlc D 2001-12-31 1321.62 1383.37 944.07 1148.08 ohlc D 2002-12-31 114

我有一个数据帧:

Date        Open     High      Low     Close     Struct  Trend                                           
2000-12-31  1477.87  1553.10  1254.19  1320.28   ohlc     D
2001-12-31  1321.62  1383.37   944.07  1148.08   ohlc     D
2002-12-31  1148.08  1176.97   768.58   879.82   ohlc     D
2003-12-31   881.69  1112.52   788.90  1111.92   olhc     U
2004-12-31  1112.61  1217.33  1060.74  1211.92   olhc     U
2005-12-31  1213.43  1275.80  1136.22  1248.29   olhc     U
2006-12-31  1252.03  1431.81  1219.29  1418.30   olhc     U
2007-12-31  1418.03  1576.09  1364.14  1468.36   olhc   U/D
2008-12-31  1468.36  1471.77   741.02   903.25   ohlc     D
2009-12-31   903.25  1130.38   666.79  1115.10   olhc   U/D
2010-12-31  1115.10  1262.60  1010.91  1257.64   olhc     U
2011-12-31  1257.62  1370.58  1074.77  1257.60   ohlc     U
2012-12-31  1258.86  1474.51  1258.86  1426.19   olhc     U
2013-12-31  1426.19  1849.44  1426.19  1848.36   olhc     U
2014-12-31  1845.86  2093.55  1737.92  2058.90   olhc     U
2015-12-31  2058.90  2134.72  1867.01  2043.94   ohlc     U
2016-12-31  2038.20  2277.53  1810.10  2238.83   olhc     U
2017-12-31  2251.57  2694.97  2245.13  2673.61   olhc     U
2018-12-31  2683.73  2940.91  2346.58  2506.85   ohlc     U
数据有两个分类列“Struct”和“Trend”

我想按这两列对数据进行分组

当我这样做的时候:

groups = data.groupby(['Struct', 'Trend'])
熊猫可能获得6种不同的“结构”和“趋势”组合: [('ohlc','D'),('ohlc','U'),('ohlc','U/D'),('olhc','D'),('olhc','U'),('olhc','U/D')]

如何合并组,其中“趋势”类别将“D”作为值的子字符串???

我预计只有4组::

  • ('ohlc',D')+('ohlc',U/D')-->('ohlc',D'))
  • (‘ohlc’、‘U’)+(‘ohlc’、‘U/D’)-->(‘ohlc’、‘U’)
  • ('olhc','D')+('ohlc','U/D')-->('olhc','D'))
  • ('olhc','U')+('ohlc','U/D')-->('olhc','U')
  • 简单地说,每个组“D”必须包括所有数据“D”和“U/D”。每组“U”必须包括数据“U”和“U/D”

    编辑:

    上述样本的预期结果:

    Date        Open     High      Low     Close     Struct  Trend                                           
    2003-12-31   881.69  1112.52   788.90  1111.92   olhc     U
    2004-12-31  1112.61  1217.33  1060.74  1211.92   olhc     U
    2005-12-31  1213.43  1275.80  1136.22  1248.29   olhc     U
    2006-12-31  1252.03  1431.81  1219.29  1418.30   olhc     U
    2007-12-31  1418.03  1576.09  1364.14  1468.36   olhc   U/D
    2009-12-31   903.25  1130.38   666.79  1115.10   olhc   U/D
    2010-12-31  1115.10  1262.60  1010.91  1257.64   olhc     U
    2011-12-31  1257.62  1370.58  1074.77  1257.60   ohlc     U
    2012-12-31  1258.86  1474.51  1258.86  1426.19   olhc     U
    2013-12-31  1426.19  1849.44  1426.19  1848.36   olhc     U
    2014-12-31  1845.86  2093.55  1737.92  2058.90   olhc     U
    2015-12-31  2058.90  2134.72  1867.01  2043.94   ohlc     U
    2016-12-31  2038.20  2277.53  1810.10  2238.83   olhc     U
    2017-12-31  2251.57  2694.97  2245.13  2673.61   olhc     U
    2018-12-31  2683.73  2940.91  2346.58  2506.85   ohlc     U
    
    
    
    Date        Open     High      Low     Close     Struct  Trend                                           
    2000-12-31  1477.87  1553.10  1254.19  1320.28   ohlc     D
    2001-12-31  1321.62  1383.37   944.07  1148.08   ohlc     D
    2002-12-31  1148.08  1176.97   768.58   879.82   ohlc     D
    2007-12-31  1418.03  1576.09  1364.14  1468.36   olhc   U/D
    2008-12-31  1468.36  1471.77   741.02   903.25   ohlc     D
    2009-12-31   903.25  1130.38   666.79  1115.10   olhc   U/D
    
    我是这样做的,但我只得到数据帧,需要组:

    trend_dtype = pd.api.types.CategoricalDtype(categories=['D', 'U/D'], ordered=False)
    data['Trend'] = data['Trend'].astype(trend_dtype)
    print(data.dropna())
    
    你可以用




    你可以用





    您可以将问题视为重复
    Trend
    U/D
    的行。因此,这里有一种方法:

    df = (df.iloc[:,:-1]
       .join(df.Trend.str.split('/', expand=True))
       .melt(id_vars=df.columns[:-1], value_name='Trend')
       .dropna()
       .drop('variable', axis=1)
    )
    
    您的df是:

              Date     Open     High      Low    Close Struct Trend
    0   2000-12-31  1477.87  1553.10  1254.19  1320.28   ohlc     D
    1   2001-12-31  1321.62  1383.37   944.07  1148.08   ohlc     D
    2   2002-12-31  1148.08  1176.97   768.58   879.82   ohlc     D
    3   2003-12-31   881.69  1112.52   788.90  1111.92   olhc     U
    4   2004-12-31  1112.61  1217.33  1060.74  1211.92   olhc     U
    5   2005-12-31  1213.43  1275.80  1136.22  1248.29   olhc     U
    6   2006-12-31  1252.03  1431.81  1219.29  1418.30   olhc     U
    7   2007-12-31  1418.03  1576.09  1364.14  1468.36   olhc     U
    8   2008-12-31  1468.36  1471.77   741.02   903.25   ohlc     D
    9   2009-12-31   903.25  1130.38   666.79  1115.10   olhc     U
    10  2010-12-31  1115.10  1262.60  1010.91  1257.64   olhc     U
    11  2011-12-31  1257.62  1370.58  1074.77  1257.60   ohlc     U
    12  2012-12-31  1258.86  1474.51  1258.86  1426.19   olhc     U
    13  2013-12-31  1426.19  1849.44  1426.19  1848.36   olhc     U
    14  2014-12-31  1845.86  2093.55  1737.92  2058.90   olhc     U
    15  2015-12-31  2058.90  2134.72  1867.01  2043.94   ohlc     U
    16  2016-12-31  2038.20  2277.53  1810.10  2238.83   olhc     U
    17  2017-12-31  2251.57  2694.97  2245.13  2673.61   olhc     U
    18  2018-12-31  2683.73  2940.91  2346.58  2506.85   ohlc     U
    26  2007-12-31  1418.03  1576.09  1364.14  1468.36   olhc     D
    28  2009-12-31   903.25  1130.38   666.79  1115.10   olhc     D
    

    请注意行
    (7,26)
    (9,28)
    您可以将问题视为重复
    趋势
    U/D
    的行。因此,这里有一种方法:

    df = (df.iloc[:,:-1]
       .join(df.Trend.str.split('/', expand=True))
       .melt(id_vars=df.columns[:-1], value_name='Trend')
       .dropna()
       .drop('variable', axis=1)
    )
    
    您的df是:

              Date     Open     High      Low    Close Struct Trend
    0   2000-12-31  1477.87  1553.10  1254.19  1320.28   ohlc     D
    1   2001-12-31  1321.62  1383.37   944.07  1148.08   ohlc     D
    2   2002-12-31  1148.08  1176.97   768.58   879.82   ohlc     D
    3   2003-12-31   881.69  1112.52   788.90  1111.92   olhc     U
    4   2004-12-31  1112.61  1217.33  1060.74  1211.92   olhc     U
    5   2005-12-31  1213.43  1275.80  1136.22  1248.29   olhc     U
    6   2006-12-31  1252.03  1431.81  1219.29  1418.30   olhc     U
    7   2007-12-31  1418.03  1576.09  1364.14  1468.36   olhc     U
    8   2008-12-31  1468.36  1471.77   741.02   903.25   ohlc     D
    9   2009-12-31   903.25  1130.38   666.79  1115.10   olhc     U
    10  2010-12-31  1115.10  1262.60  1010.91  1257.64   olhc     U
    11  2011-12-31  1257.62  1370.58  1074.77  1257.60   ohlc     U
    12  2012-12-31  1258.86  1474.51  1258.86  1426.19   olhc     U
    13  2013-12-31  1426.19  1849.44  1426.19  1848.36   olhc     U
    14  2014-12-31  1845.86  2093.55  1737.92  2058.90   olhc     U
    15  2015-12-31  2058.90  2134.72  1867.01  2043.94   ohlc     U
    16  2016-12-31  2038.20  2277.53  1810.10  2238.83   olhc     U
    17  2017-12-31  2251.57  2694.97  2245.13  2673.61   olhc     U
    18  2018-12-31  2683.73  2940.91  2346.58  2506.85   ohlc     U
    26  2007-12-31  1418.03  1576.09  1364.14  1468.36   olhc     D
    28  2009-12-31   903.25  1130.38   666.79  1115.10   olhc     D
    

    请注意第
    (7,26)
    (9,28)

    行,您能提供一个您想要的最终数据集的示例吗?您想要的组不是1/3相同吗?还有2/4?@BrianJoseph不,你误读了课文
    ohlc
    olhc
    @Trenton_M,噢,哇,这让我很困惑。我花了好几次时间才正确阅读。你能提供一个你想要的最终数据集的样本吗?你想要的组不是1/3相同吗?还有2/4?@BrianJoseph不,你误读了课文
    ohlc
    olhc
    @Trenton_M,噢,哇,这让人困惑,谢谢。我也花了好几次时间才看懂它。
    df = (df.iloc[:,:-1]
       .join(df.Trend.str.split('/', expand=True))
       .melt(id_vars=df.columns[:-1], value_name='Trend')
       .dropna()
       .drop('variable', axis=1)
    )
    
              Date     Open     High      Low    Close Struct Trend
    0   2000-12-31  1477.87  1553.10  1254.19  1320.28   ohlc     D
    1   2001-12-31  1321.62  1383.37   944.07  1148.08   ohlc     D
    2   2002-12-31  1148.08  1176.97   768.58   879.82   ohlc     D
    3   2003-12-31   881.69  1112.52   788.90  1111.92   olhc     U
    4   2004-12-31  1112.61  1217.33  1060.74  1211.92   olhc     U
    5   2005-12-31  1213.43  1275.80  1136.22  1248.29   olhc     U
    6   2006-12-31  1252.03  1431.81  1219.29  1418.30   olhc     U
    7   2007-12-31  1418.03  1576.09  1364.14  1468.36   olhc     U
    8   2008-12-31  1468.36  1471.77   741.02   903.25   ohlc     D
    9   2009-12-31   903.25  1130.38   666.79  1115.10   olhc     U
    10  2010-12-31  1115.10  1262.60  1010.91  1257.64   olhc     U
    11  2011-12-31  1257.62  1370.58  1074.77  1257.60   ohlc     U
    12  2012-12-31  1258.86  1474.51  1258.86  1426.19   olhc     U
    13  2013-12-31  1426.19  1849.44  1426.19  1848.36   olhc     U
    14  2014-12-31  1845.86  2093.55  1737.92  2058.90   olhc     U
    15  2015-12-31  2058.90  2134.72  1867.01  2043.94   ohlc     U
    16  2016-12-31  2038.20  2277.53  1810.10  2238.83   olhc     U
    17  2017-12-31  2251.57  2694.97  2245.13  2673.61   olhc     U
    18  2018-12-31  2683.73  2940.91  2346.58  2506.85   ohlc     U
    26  2007-12-31  1418.03  1576.09  1364.14  1468.36   olhc     D
    28  2009-12-31   903.25  1130.38   666.79  1115.10   olhc     D