Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/excel/23.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 蟒蛇,熊猫,df 2部分问题:1。如何根据特定条件向列表中添加列2。如何从df中删除这些列_Python_Excel_Pandas_Dataframe - Fatal编程技术网

Python 蟒蛇,熊猫,df 2部分问题:1。如何根据特定条件向列表中添加列2。如何从df中删除这些列

Python 蟒蛇,熊猫,df 2部分问题:1。如何根据特定条件向列表中添加列2。如何从df中删除这些列,python,excel,pandas,dataframe,Python,Excel,Pandas,Dataframe,真正的问题就在下面,但这是背景信息 我的最终目标是创建一个通用python文件(script?.py文件),用于打开excel文件,确定数据的组织方式,“清理”不可用的数据,然后运行多元线性回归分析。我被困在“清洁”部分,但我有想法,只是不知道如何去做 这是数据的外观(在excel文件中): 下面是我到目前为止的代码(带说明) 到目前为止,我所做的只是确定数据的方向。现在我需要“清理”数据 ######this is the real work indent...########

真正的问题就在下面,但这是背景信息

我的最终目标是创建一个通用python文件(script?.py文件),用于打开excel文件,确定数据的组织方式,“清理”不可用的数据,然后运行多元线性回归分析。我被困在“清洁”部分,但我有想法,只是不知道如何去做

这是数据的外观(在excel文件中):

下面是我到目前为止的代码(带说明)

到目前为止,我所做的只是确定数据的方向。现在我需要“清理”数据

        ######this is the real work indent...##########################

        #quantify how many how many empty values are in each column..............https://datascience.stackexchange.com/questions/12645/how-to-count-the-number-of-missing-values-in-each-row-in-pandas-dataframe
        num_of_empty_cells_in_columns = df.isnull().sum(axis=0)
        # sort the coulmns based on how many empty values they  have............https://data-flair.training/blogs/sort-pandas-dataframes-series-array/
        columns_pd_sorted = num_of_empty_cells_in_columns.sort_values(ascending=True)
现在我对列进行了“排序”(基于整个列中有多少空单元格),我只需要选择第一列作为“最低值”。这意味着这要么是Y值(对于后面的多元线性回归分析),要么是数据量与Y值相同的数据字段

        #find the lowest value (this is the value of the already sorted array)
        lowest_value=columns_pd_sorted[0]

我还想取所有空字段的平均值。这个平均值(平均值)将在以后使用(我想)

我的最终目标是:

  • 要求用户验证(或选择)y变量
  • 消除大部分字段为空的列
  • 消除没有完整数据的行(所有包含数据的列)
  • **!!!!!!!!!!!!!!!!!!!!!我认为我如何才能解决它(这才是真正的问题)**

    我想我可以通过以下步骤解决所有问题,但我不知道如何编写代码。最终我的问题是我找不到办法强迫熊猫给我列名和数据

  • 列出具有等效空单元格作为最低_值的列。我假设某种迭代?下面的代码实际上是暗中刺伤
  • 再次列出所有空单元格超过平均值的列(mean\u empty\u cells)。我想我可以通过迭代来实现这一点
  • 将y_变量设置为第一个“最低值”,但要求用户确认
  • 返回df并删除所有与
  • 所需的最终数据:

    y value data 1  data 2  data 3  data 5  data 6
    282     1       215     169     14      147
    148     0       250     307     232     134
    351     1       191     343     189     9
    31      0       32      327     8       201
    33      0       503     484     85      166
    
    我想对上述预期的最终数据进行多元线性回归分析


    非常感谢您的帮助,谢谢

    下面是一种使用
    dropna()
    函数的方法。首先,我们有初始数据帧:

    print(df)   # initial data frame
        y_value  data_1  data_2  data_3  data_4  data_5  data_6
    0       282       1   215.0   169.0     NaN    14.0   147.0
    1       148       0   250.0   307.0     NaN   232.0   134.0
    2       351       1   191.0   343.0     NaN   189.0     9.0
    3        31       0    32.0   327.0     NaN     8.0   201.0
    4        33       0   503.0   484.0     NaN    85.0   166.0
    5       973       0   651.0   134.0     NaN   128.0     NaN
    6       329       0   300.0   186.0     NaN   195.0     NaN
    7       271       1   543.0    18.0     NaN     NaN     NaN
    8       814       1   544.0   123.0     NaN     NaN     NaN
    9       274       1   349.0   209.0     NaN     NaN     NaN
    10      425       1     NaN     NaN     NaN     NaN     NaN
    
    接下来,(a)如果每个值都是NaN,则删除列;如果任何值都是NaN,则删除行:

    # un-comment the next line to transpose the data frame (e.g., based on user input / user confirmation)
    # df = df.transpose()
    
    # delete columns with all NaN
    df = df.dropna(axis=1, how='all')
    
    # delete rows with 1 or more NaN
    df = df.dropna(axis=0, how='any')
    
    print(df)
    
       y_value  data_1  data_2  data_3  data_5  data_6
    0      282       1   215.0   169.0    14.0   147.0
    1      148       0   250.0   307.0   232.0   134.0
    2      351       1   191.0   343.0   189.0     9.0
    3       31       0    32.0   327.0     8.0   201.0
    4       33       0   503.0   484.0    85.0   166.0
    
    dropna()

            y_variable_candidates = []
            for col in num_of_empty_cells_in_columns:
               if col=lowest_value:
                  y_variable_candidates=y_variable_candidates + col
    
            y_variable = y_variable_candidates[1]
    
            y_variable_confirmation = input('currently your y variable is ' + str(y_variable) +' it appears that there are many y variable candidates, such as' + str(y_variable_candidates) + 'press enter if the current y variable is okay, otherwise press a number key to indicate which column should be the y variable')
            #... more code later on
    
            mostly_empty_columns = []
            for col in num_of_empty_cells_in_columns:
               if col>mean_empty_cells:
                  mostly_empty_columns=mostly_empty_columns + col
    
            #some code to get user to confirm to delete all the selected columns
    
    y value data 1  data 2  data 3  data 5  data 6
    282     1       215     169     14      147
    148     0       250     307     232     134
    351     1       191     343     189     9
    31      0       32      327     8       201
    33      0       503     484     85      166
    
    print(df)   # initial data frame
        y_value  data_1  data_2  data_3  data_4  data_5  data_6
    0       282       1   215.0   169.0     NaN    14.0   147.0
    1       148       0   250.0   307.0     NaN   232.0   134.0
    2       351       1   191.0   343.0     NaN   189.0     9.0
    3        31       0    32.0   327.0     NaN     8.0   201.0
    4        33       0   503.0   484.0     NaN    85.0   166.0
    5       973       0   651.0   134.0     NaN   128.0     NaN
    6       329       0   300.0   186.0     NaN   195.0     NaN
    7       271       1   543.0    18.0     NaN     NaN     NaN
    8       814       1   544.0   123.0     NaN     NaN     NaN
    9       274       1   349.0   209.0     NaN     NaN     NaN
    10      425       1     NaN     NaN     NaN     NaN     NaN
    
    # un-comment the next line to transpose the data frame (e.g., based on user input / user confirmation)
    # df = df.transpose()
    
    # delete columns with all NaN
    df = df.dropna(axis=1, how='all')
    
    # delete rows with 1 or more NaN
    df = df.dropna(axis=0, how='any')
    
    print(df)
    
       y_value  data_1  data_2  data_3  data_5  data_6
    0      282       1   215.0   169.0    14.0   147.0
    1      148       0   250.0   307.0   232.0   134.0
    2      351       1   191.0   343.0   189.0     9.0
    3       31       0    32.0   327.0     8.0   201.0
    4       33       0   503.0   484.0    85.0   166.0