Python 提高数据链的可读性_Python_Pandas

Python 提高数据链的可读性

python pandas

Python 提高数据链的可读性,python,pandas,Python,Pandas,我习惯于为我所做的任何操作/更改编写一行程序，回到这里后，阅读/理解它对我来说有点困难（类似于编写一条长SQL语句…）。有哪些方法可以提高链式操作的可读性？现在我正在尝试类似的东西： res = ( # (1) we filter on new__status_group = 'Unknown' df[df['new__status_group'] == 'UNKNOWN'] # (2) we only care about these two files [[

我习惯于为我所做的任何操作/更改编写一行程序，回到这里后，阅读/理解它对我来说有点困难（类似于编写一条长SQL语句…）。有哪些方法可以提高链式操作的可读性？现在我正在尝试类似的东西：

res = (
    # (1) we filter on new__status_group = 'Unknown'
    df[df['new__status_group'] == 'UNKNOWN']

    # (2) we only care about these two files
    [['new__status', 'file_name']]

    # (3) group by the new status
    .groupby('new__status')

    # (4) we want to get the count and value of file_name
    .agg({'file_name': 'first', 'new__status': 'size'})

    # (5) rename the dummy column we used to grab the count
    .rename(columns={'new__status': 'count'})

    # (6) sort the values by count desc
    .sort_values('count', ascending=False)

    # (7) now that we're all good, reset the index so its like a normal data frame with all the fields
    .reset_index()

    # (8) limit to the top ten
    .head(10)

    # (9) and finally we want to pass it as a list of records (dict) for the end usage
    .to_dict('records')

)

这个款式好吗？还是说这种方式过于繁重和冗长？还有哪些其他方法可以提高pandas函数的可读性？

两种改进，第一种是使用

loc

非链式切片，第二种是

agg

可以传递名称，因此您不需要

重命名

res = (
    df.loc[df['new__status_group'] == 'UNKNOWN', ['new__status', 'file_name']]
    .groupby('new__status')
    .agg(file_name=('file_name','first'), count=('new__status', 'size'))
    .sort_values('count', ascending=False)
    .reset_index()
    .head(10)
    .to_dict('records')
)

两个改进，第一个使用

loc

不链式切片，第二个

agg

可以传递名称，所以您不需要

重命名

res = (
    df.loc[df['new__status_group'] == 'UNKNOWN', ['new__status', 'file_name']]
    .groupby('new__status')
    .agg(file_name=('file_name','first'), count=('new__status', 'size'))
    .sort_values('count', ascending=False)
    .reset_index()
    .head(10)
    .to_dict('records')
)

例如，您的（3）和（6）评论没有添加新内容information@PaulH同意，我只是或多或少地尝试在每个链操作中添加一条注释，无论是否有用，用于演示目的。例如，您的（3）和（6）注释没有添加新的注释information@PaulH同意，我或多或少只是想给每个连锁店增加一条评论，无论是否有用，用于演示目的。酷。你能解释一下为什么在上面使用

df.loc

，而不是我的吗？@samuelbrody1249在你的情况下，只需过滤这将正常工作，但如果你想分配值，df[df['x']==10]['value']=1000链片将失败，然而，df.loc[df['x']==10，'value']=1000 workedcool。你能解释一下为什么在上面使用

df.loc

，而不是我的吗？@samuelbrody1249在你的情况下，只需过滤这将正常工作，但如果你想分配值，df[df['x']==10]['value']=1000链片将失败，然而，df.loc[df['x']==10，'value']=1000工作