Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/278.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/svg/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何在DataFrame的列中存储numpy.ndarray_Python_Numpy_Apache Spark_Pyspark_Spark Structured Streaming - Fatal编程技术网

Python 如何在DataFrame的列中存储numpy.ndarray

Python 如何在DataFrame的列中存储numpy.ndarray,python,numpy,apache-spark,pyspark,spark-structured-streaming,Python,Numpy,Apache Spark,Pyspark,Spark Structured Streaming,在结构化流媒体中,如何使用返回两个元素的numpy.ndarray的UDF创建两个新列 这就是我到目前为止所拥有的: schema = StructType([ StructField("host_id", LongType()), StructField("fence_id", LongType()), StructField("policy_id", LongType()), StructField("timestamp", LongType()),

在结构化流媒体中,如何使用返回两个元素的
numpy.ndarray
的UDF创建两个新列

这就是我到目前为止所拥有的:

schema = StructType([
    StructField("host_id", LongType()),
    StructField("fence_id", LongType()),
    StructField("policy_id", LongType()),
    StructField("timestamp", LongType()),
    StructField("distances", ArrayType(LongType()))
])

ds = spark \
    .readStream \
    .format("json") \
    .schema(schema) \
    .load("data/")

ds.printSchema()
pa = PosAlgorithm()
get_distance_udf = udf(lambda y: pa.getLocation(y), ArrayType(LongType()))
dfnew = ds.withColumn("location", get_distance_udf(col("distances")))

query = dfnew \
    .writeStream \
    .format('console') \
    .start()

query.awaitTermination()
函数
pa.getLocation
返回
numpy.ndarray
,例如
[42.15999863,2.08498164]
。我想将这些数字存储在数据框
dfnew
的两个新列中,称为
纬度
经度

替换

get_distance_udf = udf(lambda y: pa.getLocation(y), ArrayType(LongType()))

然后,如果需要,展开结果:

from pyspark.sql.functions import col

(ds
    .withColumn("location", get_distance_udf(col("distances")))
    .withColumn("latitude", col("location.latitude"))
    .withColumn("longitude", col("location.longitude")))
from pyspark.sql.functions import col

(ds
    .withColumn("location", get_distance_udf(col("distances")))
    .withColumn("latitude", col("location.latitude"))
    .withColumn("longitude", col("location.longitude")))