Python 如何将数据帧的pyspark列中包含的unicode列表转换为浮点列表?
我已经创建了一个dataframe,如图所示Python 如何将数据帧的pyspark列中包含的unicode列表转换为浮点列表?,python,apache-spark,pyspark,apache-spark-sql,pyspark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Sql,我已经创建了一个dataframe,如图所示 import ast from pyspark.sql.functions import udf values = [(u'['2','4','713',10),(u'['12','245']',20),(u'['101','12']',30)] df = sqlContext.createDataFrame(values,['list','A']) df.show() +-----------------+---
import ast
from pyspark.sql.functions import udf
values = [(u'['2','4','713',10),(u'['12','245']',20),(u'['101','12']',30)]
df = sqlContext.createDataFrame(values,['list','A'])
df.show()
+-----------------+---+
| list| A|
+-----------------+---+
|u'['2','4','713']| 10|
| u' ['12','245']| 20|
| u'['101','12',]| 30|
+-----------------+---+
**How can I convert the above dataframe such that each element in the list is a float and is within a proper list**
I tried the below one :
def df_amp_conversion(df_modelamp):
string_list_to_list = udf(lambda row: ast.literal_eval(str(row)))
df_modelamp = df_modelamp.withColumn('float_list',string_list_to_list(col("list")))
df2 = amp_conversion(df)
但数据保持不变,没有变化。
我不想将数据帧转换为pandas或使用collect,因为它是内存密集型的。
如果可能的话,试着给我一个最佳的解决方案。我使用pyspark,那是因为你忘记了类型
udf(lambda row: ast.literal_eval(str(row)), "array<integer>")
虽然这样做会更有效率:
from pyspark.sql.functions import rtrim, ltrim, split
df = spark.createDataFrame(["""u'[23,4,77,890,4]"""], "string").toDF("list")
df.select(split(
regexp_replace("list", "^u'\\[|\\]$", ""), ","
).cast("array<integer>").alias("list")).show()
# +-------------------+
# | list|
# +-------------------+
# |[23, 4, 77, 890, 4]|
# +-------------------+
我可以在Python3中创建真正的结果,只需对函数df_amp_conversion的定义稍加修改。您没有返回df_modelamp的值!此代码适用于我:
import ast
from pyspark.sql.functions import udf, col
values = [(u"['2','4','713']",10),(u"['12','245']",20),(u"['101','12']",30)]
df = sqlContext.createDataFrame(values,['list','A'])
def df_amp_conversion(df_modelamp):
string_list_to_list = udf(lambda row: ast.literal_eval(str(row)))
df_modelamp = df_modelamp.withColumn('float_list',string_list_to_list(col("list")))
return df_modelamp
df2 = df_amp_conversion(df)
df2.show()
# +---------------+---+-----------+
# | list| A| float_list|
# +---------------+---+-----------+
# |['2','4','713']| 10|[2, 4, 713]|
# | ['12','245']| 20| [12, 245]|
# | ['101','12']| 30| [101, 12]|
# +---------------+---+-----------+