Python 由于多列中的标签相同,Sankey plot具有自循环
考虑一个数据帧:Python 由于多列中的标签相同,Sankey plot具有自循环,python,plotly,sankey-diagram,Python,Plotly,Sankey Diagram,考虑一个数据帧: animal animal2 size count dog dog small 2 dog cat large 3 cat dog small 1 dog pig large 5 cat cat large 3 pig dog small 9 pig cat large 2 cat pig large 3 我想构建一个3列
animal animal2 size count
dog dog small 2
dog cat large 3
cat dog small 1
dog pig large 5
cat cat large 3
pig dog small 9
pig cat large 2
cat pig large 3
我想构建一个3列的Sankey图,显示类别之间的流(Sankey中的每一行都是一列中的一对元素同时出现的次数)
我发现的这段代码似乎有效,但它有很多自循环,因为我在多个列中有类似的类别:
def genSankey(df,cat_cols=[],value_cols='',title='Sankey Diagram'):
# maximum of 6 value cols -> 6 colors
colorPalette = ['#4B8BBE','#306998','#FFE873','#FFD43B','#646464']
labelList = []
colorNumList = []
for catCol in cat_cols:
labelListTemp = list(set(df[catCol].values))
colorNumList.append(len(labelListTemp))
labelList = labelList + labelListTemp
# remove duplicates from labelList
labelList = list(dict.fromkeys(labelList))
# define colors based on number of levels
colorList = []
for idx, colorNum in enumerate(colorNumList):
colorList = colorList + [colorPalette[idx]]*colorNum
# transform df into a source-target pair
for i in range(len(cat_cols)-1):
if i==0:
sourceTargetDf = df[[cat_cols[i],cat_cols[i+1],value_cols]]
sourceTargetDf.columns = ['source','target','count']
else:
tempDf = df[[cat_cols[i],cat_cols[i+1],value_cols]]
tempDf.columns = ['source','target','count']
sourceTargetDf = pd.concat([sourceTargetDf,tempDf])
sourceTargetDf = sourceTargetDf.groupby(['source','target']).agg({'count':'sum'}).reset_index()
# add index for source-target pair
sourceTargetDf['sourceID'] = sourceTargetDf['source'].apply(lambda x: labelList.index(x))
sourceTargetDf['targetID'] = sourceTargetDf['target'].apply(lambda x: labelList.index(x))
# creating the sankey diagram
data = dict(
type='sankey',
node = dict(
pad = 15,
thickness = 20,
line = dict(
color = "black",
width = 1.5 # was 0.5
),
label = labelList,
color = colorList
),
link = dict(
source = sourceTargetDf['sourceID'],
target = sourceTargetDf['targetID'],
value = sourceTargetDf['count']
)
)
layout = dict(
title = title,
font = dict(
size = 20 # was 10
)
)
fig = dict(data=[data], layout=layout)
return fig
可按以下方式运行:
import pandas as pd
import plotly
import chart_studio.plotly as py
fig = genSankey(df,cat_cols=['animal1','animal2','size'],value_cols='count',title='Animal List')
plotly.offline.plot(fig, validate=False)
在这个函数中是否有一些我可以简单地更改的东西来停止获取自循环 找到解决方案了吗?删除两个不同列(图中表示的集合)中具有相同名称的元素。不幸的是,没有其他问题。您找到解决方案了吗?删除两个不同列(在绘图中表示的集合)中具有相同名称的元素。不幸的是,没有别的。