Python 由于多列中的标签相同，Sankey plot具有自循环_Python_Plotly_Sankey Diagram

Python 由于多列中的标签相同，Sankey plot具有自循环

python

Python 由于多列中的标签相同，Sankey plot具有自循环,python,plotly,sankey-diagram,Python,Plotly,Sankey Diagram,考虑一个数据帧： animal animal2 size count dog dog small 2 dog cat large 3 cat dog small 1 dog pig large 5 cat cat large 3 pig dog small 9 pig cat large 2 cat pig large 3 我想构建一个3列

考虑一个数据帧：

animal  animal2  size    count
dog     dog  small     2
dog     cat  large     3
cat     dog  small     1
dog     pig  large     5
cat     cat  large     3
pig     dog  small     9
pig     cat  large     2 
cat     pig  large     3

我想构建一个3列的Sankey图，显示类别之间的流（Sankey中的每一行都是一列中的一对元素同时出现的次数）

我发现的这段代码似乎有效，但它有很多自循环，因为我在多个列中有类似的类别：

def genSankey(df,cat_cols=[],value_cols='',title='Sankey Diagram'):
    # maximum of 6 value cols -> 6 colors
    colorPalette = ['#4B8BBE','#306998','#FFE873','#FFD43B','#646464']
    labelList = []
    colorNumList = []
    for catCol in cat_cols:
        labelListTemp =  list(set(df[catCol].values))
        colorNumList.append(len(labelListTemp))
        labelList = labelList + labelListTemp

    # remove duplicates from labelList
    labelList = list(dict.fromkeys(labelList))

    # define colors based on number of levels
    colorList = []
    for idx, colorNum in enumerate(colorNumList):
        colorList = colorList + [colorPalette[idx]]*colorNum

    # transform df into a source-target pair
    for i in range(len(cat_cols)-1):
        if i==0:
            sourceTargetDf = df[[cat_cols[i],cat_cols[i+1],value_cols]]
            sourceTargetDf.columns = ['source','target','count']
        else:
            tempDf = df[[cat_cols[i],cat_cols[i+1],value_cols]]
            tempDf.columns = ['source','target','count']
            sourceTargetDf = pd.concat([sourceTargetDf,tempDf])
        sourceTargetDf = sourceTargetDf.groupby(['source','target']).agg({'count':'sum'}).reset_index()

    # add index for source-target pair
    sourceTargetDf['sourceID'] = sourceTargetDf['source'].apply(lambda x: labelList.index(x))
    sourceTargetDf['targetID'] = sourceTargetDf['target'].apply(lambda x: labelList.index(x))

    # creating the sankey diagram
    data = dict(
        type='sankey',
        node = dict(
          pad = 15,
          thickness = 20,
          line = dict(
            color = "black",
            width = 1.5   # was 0.5
          ),
          label = labelList,
          color = colorList
        ),
        link = dict(
          source = sourceTargetDf['sourceID'],
          target = sourceTargetDf['targetID'],
          value = sourceTargetDf['count']
        )
      )

    layout =  dict(
        title = title,
        font = dict(
          size = 20    # was 10
        )
    )

    fig = dict(data=[data], layout=layout)
    return fig

可按以下方式运行：

import pandas as pd
import plotly
import chart_studio.plotly as py


fig = genSankey(df,cat_cols=['animal1','animal2','size'],value_cols='count',title='Animal List')
plotly.offline.plot(fig, validate=False)

在这个函数中是否有一些我可以简单地更改的东西来停止获取自循环

找到解决方案了吗？删除两个不同列（图中表示的集合）中具有相同名称的元素。不幸的是，没有其他问题。您找到解决方案了吗？删除两个不同列（在绘图中表示的集合）中具有相同名称的元素。不幸的是，没有别的。