Python 蟒蛇，熊猫，df 2部分问题：1。如何根据特定条件向列表中添加列2。如何从df中删除这些列_Python_Excel_Pandas_Dataframe

Python 蟒蛇，熊猫，df 2部分问题：1。如何根据特定条件向列表中添加列2。如何从df中删除这些列

python excel pandas dataframe

Python 蟒蛇，熊猫，df 2部分问题：1。如何根据特定条件向列表中添加列2。如何从df中删除这些列,python,excel,pandas,dataframe,Python,Excel,Pandas,Dataframe,真正的问题就在下面，但这是背景信息我的最终目标是创建一个通用python文件（script？.py文件），用于打开excel文件，确定数据的组织方式，“清理”不可用的数据，然后运行多元线性回归分析。我被困在“清洁”部分，但我有想法，只是不知道如何去做这是数据的外观（在excel文件中）：下面是我到目前为止的代码（带说明）到目前为止，我所做的只是确定数据的方向。现在我需要“清理”数据 ######this is the real work indent...########

真正的问题就在下面，但这是背景信息

我的最终目标是创建一个通用python文件（script？.py文件），用于打开excel文件，确定数据的组织方式，“清理”不可用的数据，然后运行多元线性回归分析。我被困在“清洁”部分，但我有想法，只是不知道如何去做

这是数据的外观（在excel文件中）：

下面是我到目前为止的代码（带说明）

到目前为止，我所做的只是确定数据的方向。现在我需要“清理”数据

        ######this is the real work indent...##########################

        #quantify how many how many empty values are in each column..............https://datascience.stackexchange.com/questions/12645/how-to-count-the-number-of-missing-values-in-each-row-in-pandas-dataframe
        num_of_empty_cells_in_columns = df.isnull().sum(axis=0)
        # sort the coulmns based on how many empty values they  have............https://data-flair.training/blogs/sort-pandas-dataframes-series-array/
        columns_pd_sorted = num_of_empty_cells_in_columns.sort_values(ascending=True)

现在我对列进行了“排序”（基于整个列中有多少空单元格），我只需要选择第一列作为“最低值”。这意味着这要么是Y值（对于后面的多元线性回归分析），要么是数据量与Y值相同的数据字段

        #find the lowest value (this is the value of the already sorted array)
        lowest_value=columns_pd_sorted[0]

我还想取所有空字段的平均值。这个平均值（平均值）将在以后使用（我想）

我的最终目标是：

要求用户验证（或选择）y变量

消除大部分字段为空的列

消除没有完整数据的行（所有包含数据的列）

**！！！！！！！！！！！！！！！！！！！！！我认为我如何才能解决它（这才是真正的问题）**

我想我可以通过以下步骤解决所有问题，但我不知道如何编写代码。最终我的问题是我找不到办法强迫熊猫给我列名和数据

列出具有等效空单元格作为最低_值的列。我假设某种迭代？下面的代码实际上是暗中刺伤

再次列出所有空单元格超过平均值的列（mean\u empty\u cells）。我想我可以通过迭代来实现这一点

将y_变量设置为第一个“最低值”，但要求用户确认

返回df并删除所有与

所需的最终数据：

y value data 1  data 2  data 3  data 5  data 6
282     1       215     169     14      147
148     0       250     307     232     134
351     1       191     343     189     9
31      0       32      327     8       201
33      0       503     484     85      166

我想对上述预期的最终数据进行多元线性回归分析

非常感谢您的帮助，谢谢

下面是一种使用

dropna（）

函数的方法。首先，我们有初始数据帧：

print(df)   # initial data frame
    y_value  data_1  data_2  data_3  data_4  data_5  data_6
0       282       1   215.0   169.0     NaN    14.0   147.0
1       148       0   250.0   307.0     NaN   232.0   134.0
2       351       1   191.0   343.0     NaN   189.0     9.0
3        31       0    32.0   327.0     NaN     8.0   201.0
4        33       0   503.0   484.0     NaN    85.0   166.0
5       973       0   651.0   134.0     NaN   128.0     NaN
6       329       0   300.0   186.0     NaN   195.0     NaN
7       271       1   543.0    18.0     NaN     NaN     NaN
8       814       1   544.0   123.0     NaN     NaN     NaN
9       274       1   349.0   209.0     NaN     NaN     NaN
10      425       1     NaN     NaN     NaN     NaN     NaN

接下来，（a）如果每个值都是NaN，则删除列；如果任何值都是NaN，则删除行：

# un-comment the next line to transpose the data frame (e.g., based on user input / user confirmation)
# df = df.transpose()

# delete columns with all NaN
df = df.dropna(axis=1, how='all')

# delete rows with 1 or more NaN
df = df.dropna(axis=0, how='any')

print(df)

   y_value  data_1  data_2  data_3  data_5  data_6
0      282       1   215.0   169.0    14.0   147.0
1      148       0   250.0   307.0   232.0   134.0
2      351       1   191.0   343.0   189.0     9.0
3       31       0    32.0   327.0     8.0   201.0
4       33       0   503.0   484.0    85.0   166.0

dropna（）
        y_variable_candidates = []
        for col in num_of_empty_cells_in_columns:
           if col=lowest_value:
              y_variable_candidates=y_variable_candidates + col

        y_variable = y_variable_candidates[1]

        y_variable_confirmation = input('currently your y variable is ' + str(y_variable) +' it appears that there are many y variable candidates, such as' + str(y_variable_candidates) + 'press enter if the current y variable is okay, otherwise press a number key to indicate which column should be the y variable')
        #... more code later on

        mostly_empty_columns = []
        for col in num_of_empty_cells_in_columns:
           if col>mean_empty_cells:
              mostly_empty_columns=mostly_empty_columns + col

        #some code to get user to confirm to delete all the selected columns

y value data 1  data 2  data 3  data 5  data 6
282     1       215     169     14      147
148     0       250     307     232     134
351     1       191     343     189     9
31      0       32      327     8       201
33      0       503     484     85      166

print(df)   # initial data frame
    y_value  data_1  data_2  data_3  data_4  data_5  data_6
0       282       1   215.0   169.0     NaN    14.0   147.0
1       148       0   250.0   307.0     NaN   232.0   134.0
2       351       1   191.0   343.0     NaN   189.0     9.0
3        31       0    32.0   327.0     NaN     8.0   201.0
4        33       0   503.0   484.0     NaN    85.0   166.0
5       973       0   651.0   134.0     NaN   128.0     NaN
6       329       0   300.0   186.0     NaN   195.0     NaN
7       271       1   543.0    18.0     NaN     NaN     NaN
8       814       1   544.0   123.0     NaN     NaN     NaN
9       274       1   349.0   209.0     NaN     NaN     NaN
10      425       1     NaN     NaN     NaN     NaN     NaN

# un-comment the next line to transpose the data frame (e.g., based on user input / user confirmation)
# df = df.transpose()

# delete columns with all NaN
df = df.dropna(axis=1, how='all')

# delete rows with 1 or more NaN
df = df.dropna(axis=0, how='any')

print(df)

   y_value  data_1  data_2  data_3  data_5  data_6
0      282       1   215.0   169.0    14.0   147.0
1      148       0   250.0   307.0   232.0   134.0
2      351       1   191.0   343.0   189.0     9.0
3       31       0    32.0   327.0     8.0   201.0
4       33       0   503.0   484.0    85.0   166.0