Python：如何用多个值对特性进行热编码？_Python_Pandas_Dataframe

Python：如何用多个值对特性进行热编码？

python pandas dataframe

Python：如何用多个值对特性进行热编码？,python,pandas,dataframe,Python,Pandas,Dataframe,我有以下数据框df，在route列中有一架飞机的机票价格中的旅行城市名称我想从route中获取各个城市的名称，并对它们进行热编码数据帧（df）所需数据帧（df_编码的）代码我已经使用以下代码对route列执行了一些预处理，但无法理解如何对其进行热编码 def location_preprocessing(text): """ Function to Preprocess the features having location names.

我有以下数据框

df

，在

route

列中有一架飞机的

机票价格

中的旅行城市名称

我想从

route

中获取各个城市的名称，并对它们进行热编码

数据帧（
df
）

所需数据帧（
df_编码的
）

代码
我已经使用以下代码对

route

列执行了一些预处理，但无法理解如何对其进行热编码

def location_preprocessing(text):

  """
  Function to Preprocess the features having location names.
  """

  text = text.replace(" ", "")    # Remove whitespaces
  text = text.split("|")          # Obtain individual cities

  lst_text = [x.lower() for x in text]    # Lowercase city names

  text = " ".join(lst_text)               # Convert to string from list

  return text

df['route'] = df['route'].apply(lambda x: location_preprocessing(x))

如果我使用下面的代码直接应用一个热编码，那么所有路由都被认为是唯一的，并且是单独的热编码，这是不需要的。我希望每个城市都成为一个热点，而不是路线

df = pd.get_dummies(df, columns = ['route'])    # One-hot Encoding `route`

如何获取所需的数据帧？

如果您有数据帧：

   id                      route  ticket_price
0   1  Mumbai - Pune - Bangalore         10000
1   2               Pune - Delhi          7000
2   3               Delhi - Pune          6500

然后：

印刷品：

    Route_Bangalore  Route_Delhi  Route_Mumbai  Route_Pune  ticket_price
id                                                                      
1                 1            0             1           1         10000
2                 0            1             0           1          7000
3                 0            1             0           1          6500

这是正确的，但原始数据框要大得多，并且包含除

路由

以外的多个列，因此如果您能告诉我如何在原地执行此操作，最好是在不重置索引的情况下执行此操作（因为列车和验证数据是组合的，重置索引将使它们难以分离）.

   id                      route  ticket_price
0   1  Mumbai - Pune - Bangalore         10000
1   2               Pune - Delhi          7000
2   3               Delhi - Pune          6500

df.route = df.route.str.split(" - ")
df_out = pd.concat(
    [
        df.explode("route")
        .pivot_table(index="id", columns="route", aggfunc="size", fill_value=0)
        .add_prefix("Route_"),
        df.set_index("id").ticket_price,
    ],
    axis=1,
)
print(df_out)

    Route_Bangalore  Route_Delhi  Route_Mumbai  Route_Pune  ticket_price
id                                                                      
1                 1            0             1           1         10000
2                 0            1             0           1          7000
3                 0            1             0           1          6500