Pandas 如何基于dataframe列中的列表值创建多个标志列?
该表如下所示:Pandas 如何基于dataframe列中的列表值创建多个标志列?,pandas,dataframe,hive,pyspark,data-manipulation,Pandas,Dataframe,Hive,Pyspark,Data Manipulation,该表如下所示: ID |CITY ---------------------------------- 1 |London|Paris|Tokyo 2 |Tokyo|Barcelona|Mumbai|London 3 |Vienna|Paris|Seattle ID |Paris | Seattle | Tokyo -------------------------------------------
ID |CITY
----------------------------------
1 |London|Paris|Tokyo
2 |Tokyo|Barcelona|Mumbai|London
3 |Vienna|Paris|Seattle
ID |Paris | Seattle | Tokyo
-------------------------------------------
1 |1 |0 |1
2 |0 |0 |1
3 |1 |1 |0
“城市”列包含大约1000多个以|分隔的值
我想创建一个标志栏来指示一个人是否只访问了感兴趣的城市
city_of_interest=['Paris','Seattle','Tokyo']
列表中有20个这样的值
输出应如下所示:
ID |CITY
----------------------------------
1 |London|Paris|Tokyo
2 |Tokyo|Barcelona|Mumbai|London
3 |Vienna|Paris|Seattle
ID |Paris | Seattle | Tokyo
-------------------------------------------
1 |1 |0 |1
2 |0 |0 |1
3 |1 |1 |0
解决方案可以是pandas或pyspark。pandas解决方案 首先转换为要使用的列表:
然后我们可以使用: 方法1: 方法2:+ 输出新的\u df:
ID Paris Seattle Tokyo
0 1 1 0 1
1 2 0 0 1
2 3 1 1 0
使用自定义项检查兴趣城市值是否位于分隔列中
from pyspark.sql.functions import udf
#Input list
city_of_interest=['Paris','Seattle','Tokyo']
#UDF definition
def city_present(city_name,city_list):
return len(set([city_name]) & set(city_list.split('|')))
city_present_udf = udf(city_present,IntegerType())
#Converting cities list to a column of array type for adding columns to the dataframe
city_array = array(*[lit(city) for city in city_of_interest])
l = len(city_of_interest)
col_names = df.columns + [city for city in city_of_interest]
result = df.select(df.columns + [city_present_udf(city_array[i],df.city) for i in range(l)])
result = result.toDF(*col_names)
result.show()
对于pyspark,使用+:
对于熊猫,请使用:
你能把你的尝试也包括在内吗?
from pyspark.sql.functions import udf
#Input list
city_of_interest=['Paris','Seattle','Tokyo']
#UDF definition
def city_present(city_name,city_list):
return len(set([city_name]) & set(city_list.split('|')))
city_present_udf = udf(city_present,IntegerType())
#Converting cities list to a column of array type for adding columns to the dataframe
city_array = array(*[lit(city) for city in city_of_interest])
l = len(city_of_interest)
col_names = df.columns + [city for city in city_of_interest]
result = df.select(df.columns + [city_present_udf(city_array[i],df.city) for i in range(l)])
result = result.toDF(*col_names)
result.show()
from pyspark.sql.functions import split, array_contains
df.withColumn('cities', split('CITY', '\|')) \
.select('ID', *[ array_contains('cities', c).astype('int').alias(c) for c in city_of_interest ])
.show()
+---+-----+-------+-----+
| ID|Paris|Seattle|Tokyo|
+---+-----+-------+-----+
| 1| 1| 0| 1|
| 2| 0| 0| 1|
| 3| 1| 1| 0|
+---+-----+-------+-----+
df[city_of_interest] = df.CITY.str.get_dummies()[city_of_interest]
df = df.drop('CITY', axis=1)