Python 尝试操作/筛选groupby操作创建的数据帧时，在使用（>；=&；<；=）时遇到问题_Python_Pandas_Pandas Groupby

Python 尝试操作/筛选groupby操作创建的数据帧时，在使用（>；=&；<；=）时遇到问题

python pandas

Python 尝试操作/筛选groupby操作创建的数据帧时，在使用（>；=&；<；=）时遇到问题,python,pandas,pandas-groupby,Python,Pandas,Pandas Groupby,所以我试图从我的数据集中删除异常值。这是房地产数据，因此我使用groupby按“区域/面积”分组（在代码上实际显示为“区域”），并使用每个“区域/面积”的价格计算IQR，但现在我尝试使用“>=&以防这有用，我能够通过以下函数实现删除异常值的目标： def remove_outliers(df, column): "This function takes in a dataframe and removes the outliers using the values of the spec

所以我试图从我的数据集中删除异常值。这是房地产数据，因此我使用groupby按“区域/面积”分组（在代码上实际显示为“区域”），并使用每个“区域/面积”的价格计算IQR，但现在我尝试使用“>=&以防这有用，我能够通过以下函数实现删除异常值的目标：

def remove_outliers(df, column):
    "This function takes in a dataframe and removes the outliers using the values of the specified column"
    #Use the describe() method to identify the statistics of interest
    describe = df[column].describe()

    #Create a dictionary for each of the values from the column of interest
    describe_dict = {"count":0,"mean":1,"std":2,"min":3,"25":4,"50":5,"75":6,"max":7}

    #Extract quartiles (Q1, Q3)
    Q1 = describe[describe_dict['25']]
    Q3 = describe[describe_dict['75']]

    #Caculate IQR
    IQR = Q3-Q1

    #Define bounds
    lb = Q1-1.5*IQR
    ub = Q3+1.5*IQR
    print("(IQR = {})A point outside of the following range can be considered an outlier: ({},{})".format(IQR,lb,ub))

    calc_df = df[(df[column] < lb) | (df[column] > ub)]
    print("The number of outliers that will be removed out of {} observations are {}.".format(df[column].size,len(calc_df[column])))

    #remove the outliers from the dataframe
    no_outliers = df[~df[column].isin(calc_df[column])]
    return no_outliers

def删除异常值（df，列）：
“此函数接收数据帧并使用指定列的值删除异常值”
#使用descripe（）方法识别感兴趣的统计信息
descripe=df[column].descripe（）
#为感兴趣的列中的每个值创建字典
描述一下你的名字={“计数”：0，“平均数”：1，“标准数”：2，“最小数”：3，“25”：4，“50”：5，“75”：6，“最大数”：7}
#提取四分位数（Q1、Q3）
Q1=描述[description_dict['25']]
Q3=描述[描述[75']
#计算IQR
IQR=Q3-Q1
#定义界限
lb=Q1-1.5*IQR
ub=Q3+1.5*IQR
print（（（IQR={}）超出以下范围的点可以视为异常值：（{}，{}）”.format（IQR，lb，ub））
calc_df=df[（df[列]ub）]
print（“将从{}个观测值中删除的异常值数为{}。”.format（df[column].size，len（calc_df[column]））
#从数据帧中删除异常值
无异常值=df[~df[column].isin（计算[column]）]
不返回任何异常值

您只需向它传递一个数据框，并指定要用作识别和删除异常值的基础的列。在my github上，您可以找到一个带有快速教程的笔记本：

括号可能有问题。你能试着把部分放在

后面吗？我想你可以在groupby（）中使用for key，val迭代组，而不是迭代键然后访问值。忘记链接这篇有用的文章：。值之一是字符串吗？如果是数字，请使用int（value）您好@AlexanderCécile，非常感谢您的回复。我不应该使用“.values”有什么具体原因吗？我使用它是因为Q变量上的.reforme方法不起作用。所以我在另一篇文章中读到了我必须使用的。values和它起作用了…而且，我已经在文章中添加了我的完整代码！再次感谢您的支持。
Q1 = grp['Precio USD'].quantile(0.25)
Q3 = grp['Precio USD'].quantile(0.75)
IQR = Q3 - Q1
#Let's reshape the quartiles to be able to operate them
Q1 = Q1.values.reshape(-1,1)
Q3 = Q3.values.reshape(-1,1)
IQR = IQR.values.reshape(-1,1)
print(Q1.shape, Q3.shape, IQR.shape)

#Let's filter the dataset based on the IQR * +- 1.5
filter = (grp['Precio USD'] >= Q1 - 1.5 * IQR) & (grp['Precio USD'] <= Q3 + 1.5 *IQR)
grp.loc[filter]

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-117-09dffe5671dd> in <module>
      9 
     10 #Let's filter the dataset based on the IQR * +- 1.5
---> 11 filter = (grp['Precio USD'] >= Q1 - 1.5 * IQR) & (grp['Precio USD'] <= Q3 + 1.5 *IQR)
     12 grp.loc[filter]

TypeError: '<=' not supported between instances of 'float' and 'str'

# Let's start by loading the dataset

# In[1]:


#Import the necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
get_ipython().run_line_magic('matplotlib', 'inline')

# Read the CSV file into a DataFrame: df
gt_df = pd.read_csv('RE_Data_GT.csv')
gt_df.tail()


# Let's do some simple statistical analysis to understand the variables we have and their behaviour a little better.

# In[2]:


#Fill in NaN's with the man of the column on the "Banos" column
gt_df['Banos'] = gt_df['Banos'].fillna(gt_df['Banos'].mean())
gt_df.info()


# In[3]:


gt_df.describe()


# From the table above we can see that a few of the columns have data very spread out (high standard deviation), this is not necessarily bad, because we know the dataset we understand that this could be caused by the two types of listings ('Venta' y 'Alquiler'), it makes sense to have variance if we look at prices by rent and sales at the same time. 
# 
# Now let's move to one of the most exciting parts, which is some exploratory data analysis (EDA). But before we do that, I think that with the information above it would make sense to have two different dataframes one for rentals and other for home sales. 

# In[4]:


gt_alquiler = gt_df[gt_df['Tipo de listing'] == 'Alquiler']
gt_venta = gt_df[gt_df['Tipo de listing'] == 'Venta']
gt_alquiler.info()
gt_venta.info()


# Excellent, it seems like we have 2128 data points for 'Alquiler'(rental) and 3004 for 'Venta' (sales). Now that we have our 2 dataframes, we can actually start to do some EDA, we'll start by looking at home sales (Tipo de listing =='Venta').

# In[5]:


_ = gt_venta['Precio USD'].plot.hist(title = 'Distribucion de Precios de Venta', colormap='Pastel2')
_ = plt.xlabel('Price in USD')


# In[6]:


#Declare a function to compute the ECDF of an array
def ecdf(data):
    """Compute ECDF for a one-dimensional array of measurements."""
    # Number of data points: n
    n = len(data)

    # x-data for the ECDF: x
    x = np.sort(data)

    # y-data for the ECDF: y
    y = np.arange(1, len(data)+1) / n

    return x, y


# In[7]:


#Create Variable to pass to the ECDF function
gt_venta_precio = gt_venta['Precio USD']

#Compute ECDF for
x, y = ecdf(gt_venta_precio)

# Generate plot
_ = plt.plot(x, y, marker='.', linestyle='none')

# Add title and label the axes
_ = plt.title('ECDF de Precio en USD')
_ = plt.xlabel('Precio en USD')
_ = plt.ylabel('ECDF')

# Display the plot
plt.show()


# Apparently there are a few outliers that require our attention. To better understand these points, it's better if we group them by a 'zona' (zona/area) to see which listing has such a high price. 
# 
# Let's start to understand the specific outliers by grouping the listings by "Zona" and then using a box plot for each to review each in more detail.

# In[8]:


#Create a new dataframe with only "Precio USD" & "Zona"
gt_venta_precio_zona = gt_venta[['Precio USD','Zona']]
#Group by "Zona"
grp = gt_venta_precio_zona.groupby('Zona')
#Iterate through the groups to get the keys (titles) of each "Zona" and plot the results
for key in grp.groups.keys():
    grp.get_group(key).plot.box(title=key)


# In[14]:


Q1 = grp['Precio USD'].quantile(0.25)
Q3 = grp['Precio USD'].quantile(0.75)
IQR = Q3 - Q1
#Let's reshape the quartiles to be able to operate them
Q1 = Q1.values.reshape(-1,1)
Q3 = Q3.values.reshape(-1,1)
IQR = IQR.values.reshape(-1,1)
print(Q1.shape, Q3.shape, IQR.shape)

#Let's filter the dataset based on the IQR * +- 1.5
filter = (grp['Precio USD'] >= (Q1 - 1.5 * IQR)) & (grp['Precio USD'] <= (Q3 + 1.5 *IQR))
grp.loc[filter]

def remove_outliers(df, column):
    "This function takes in a dataframe and removes the outliers using the values of the specified column"
    #Use the describe() method to identify the statistics of interest
    describe = df[column].describe()

    #Create a dictionary for each of the values from the column of interest
    describe_dict = {"count":0,"mean":1,"std":2,"min":3,"25":4,"50":5,"75":6,"max":7}

    #Extract quartiles (Q1, Q3)
    Q1 = describe[describe_dict['25']]
    Q3 = describe[describe_dict['75']]

    #Caculate IQR
    IQR = Q3-Q1

    #Define bounds
    lb = Q1-1.5*IQR
    ub = Q3+1.5*IQR
    print("(IQR = {})A point outside of the following range can be considered an outlier: ({},{})".format(IQR,lb,ub))

    calc_df = df[(df[column] < lb) | (df[column] > ub)]
    print("The number of outliers that will be removed out of {} observations are {}.".format(df[column].size,len(calc_df[column])))

    #remove the outliers from the dataframe
    no_outliers = df[~df[column].isin(calc_df[column])]
    return no_outliers