Python 如何在DataFrame的列中存储numpy.ndarray
在结构化流媒体中,如何使用返回两个元素的Python 如何在DataFrame的列中存储numpy.ndarray,python,numpy,apache-spark,pyspark,spark-structured-streaming,Python,Numpy,Apache Spark,Pyspark,Spark Structured Streaming,在结构化流媒体中,如何使用返回两个元素的numpy.ndarray的UDF创建两个新列 这就是我到目前为止所拥有的: schema = StructType([ StructField("host_id", LongType()), StructField("fence_id", LongType()), StructField("policy_id", LongType()), StructField("timestamp", LongType()),
numpy.ndarray
的UDF创建两个新列
这就是我到目前为止所拥有的:
schema = StructType([
StructField("host_id", LongType()),
StructField("fence_id", LongType()),
StructField("policy_id", LongType()),
StructField("timestamp", LongType()),
StructField("distances", ArrayType(LongType()))
])
ds = spark \
.readStream \
.format("json") \
.schema(schema) \
.load("data/")
ds.printSchema()
pa = PosAlgorithm()
get_distance_udf = udf(lambda y: pa.getLocation(y), ArrayType(LongType()))
dfnew = ds.withColumn("location", get_distance_udf(col("distances")))
query = dfnew \
.writeStream \
.format('console') \
.start()
query.awaitTermination()
函数pa.getLocation
返回numpy.ndarray
,例如[42.15999863,2.08498164]
。我想将这些数字存储在数据框dfnew
的两个新列中,称为纬度
和经度
替换
get_distance_udf = udf(lambda y: pa.getLocation(y), ArrayType(LongType()))
与
然后,如果需要,展开结果:
from pyspark.sql.functions import col
(ds
.withColumn("location", get_distance_udf(col("distances")))
.withColumn("latitude", col("location.latitude"))
.withColumn("longitude", col("location.longitude")))
from pyspark.sql.functions import col
(ds
.withColumn("location", get_distance_udf(col("distances")))
.withColumn("latitude", col("location.latitude"))
.withColumn("longitude", col("location.longitude")))