Pyspark 派斯帕克-哈弗公式中的误差

Pyspark 派斯帕克-哈弗公式中的误差,pyspark,haversine,Pyspark,Haversine,我试图在pyspark中实现一个haversine_距离计算器 我正在重新使用我以前出于同样目的使用的python代码,这就是我所做的: 1.将harvesine_距离函数作为自定义项实现 2.在我的数据帧中,使用它来计算两个横向/纵向点的距离 3.当我对这些值运行a检查时,它似乎是正确的 4.但当我试图在dist_km上添加where子句时,我得到了一个错误: File "<stdin>", line 13, in haversine_distance TypeError: a f

我试图在pyspark中实现一个haversine_距离计算器 我正在重新使用我以前出于同样目的使用的python代码,这就是我所做的: 1.将harvesine_距离函数作为自定义项实现 2.在我的数据帧中,使用它来计算两个横向/纵向点的距离 3.当我对这些值运行a检查时,它似乎是正确的 4.但当我试图在dist_km上添加where子句时,我得到了一个错误:

File "<stdin>", line 13, in haversine_distance
TypeError: a float is required
代码:

from math import radians, cos, sin, asin, sqrt, atan2, pi

def haversine_distance(lat1, lon1, lat2, lon2):

    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """

    deg2rad = pi/180.0

    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = (lon2 - lon1) 
    dlat = (lat2 - lat1) 

    a = sin(dlat/2.0)**2.0 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2.0
    c = 2.0 * asin(sqrt(a))

    #c = 2.0 * atan2(sqrt(a), sqrt(1.0-a))

    r = 6372.8 # Radius of earth in kilometers. Use 3956 for miles #No rounding R = 3959.87433 (miles), 6372.8(km)

    return c * r

haversine_distance_udf = udf(haversine_distance, FloatType())

upd_join_final66_df = upd_join_final66_df.withColumn('dist_km', \
                haversine_distance_udf(upd_join_final66_df['LATITUDE'],upd_join_final66_df['LONGITUDE']\
                    ,upd_join_final66_df['delv_lat_upd'],upd_join_final66_df['delv_lng_upd'])\
                )

upd_join_final66_df.registerTempTable("fac66")
当我运行下面的命令时,没有错误进行抽查

spark.sql("select delv_lat_upd, delv_lng_upd, LATITUDE, LONGITUDE, dist_km \
from fac66 \
").show()
当我试图询问这个地区时,我发现了一个错误

An error occurred while calling o4882.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1041.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1041.0 (TID 81684, 10.0.0.15, executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 171, in main
    process()
  File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 166, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/hdp/current/spark2-client/python/pyspark/serializers.py", line 220, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/usr/hdp/current/spark2-client/python/pyspark/serializers.py", line 138, in dump_stream
    for obj in iterator:
  File "/usr/hdp/current/spark2-client/python/pyspark/serializers.py", line 209, in _batched
    for item in iterator:
  File "<string>", line 1, in <lambda>
  File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 70, in <lambda>
    return lambda *a: f(*a)
  File "<stdin>", line 13, in haversine_distance
TypeError: a float is required
调用o4882.showString时发生错误。 :org.apache.SparkException:作业因阶段失败而中止:阶段1041.0中的任务0失败4次,最近的失败:阶段1041.0中的任务0.3丢失(TID 81684,10.0.0.15,executor 2):org.apache.spark.api.python.python异常:回溯(最近一次调用): 文件“/usr/hdp/current/spark2 client/python/pyspark/worker.py”,第171行,在main中 过程() 文件“/usr/hdp/current/spark2 client/python/pyspark/worker.py”,第166行,正在处理中 serializer.dump_流(func(拆分索引,迭代器),outfile) 文件“/usr/hdp/current/spark2 client/python/pyspark/serializers.py”,第220行,在dump_流中 self.serializer.dump_流(self._批处理(迭代器),流) 文件“/usr/hdp/current/spark2 client/python/pyspark/serializers.py”,第138行,在dump_流中 对于迭代器中的obj: 文件“/usr/hdp/current/spark2 client/python/pyspark/serializers.py”,第209行,分批处理 对于迭代器中的项: 文件“”,第1行,在 文件“/usr/hdp/current/spark2 client/python/pyspark/worker.py”,第70行,在 返回λ*a:f(*a) 文件“”,第13行,哈弗森距离 TypeError:需要浮点
问题是由导致公式出现问题的空数据引起的

从数学导入弧度、cos、sin、asin、sqrt、atan2、pi
def haversine_距离(lat1、lon1、lat2、lon2):
如果lon1==None或lat1==None或lon2==None或lat2==None:
一无所获
其他:
deg2rad=pi/180.0
#将十进制度数转换为弧度
lon1,lat1,lon2,lat2=贴图(弧度,[lon1,lat1,lon2,lat2])
#哈弗森公式
dlon=(lon2-lon1)
dlat=(lat2-lat1)
a=sin(dlat/2.0)**2.0+cos(lat1)*cos(lat2)*sin(dlon/2.0)**2.0
c=2.0*asin(sqrt(a))
#c=2.0*atan2(sqrt(a),sqrt(1.0-a))
r=6372.8#地球半径,单位为公里。使用3956表示英里#无舍入R=3959.87433(英里),6372.8(公里)
返回c*r

@eb,问题解决了吗?
An error occurred while calling o4882.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1041.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1041.0 (TID 81684, 10.0.0.15, executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 171, in main
    process()
  File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 166, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/hdp/current/spark2-client/python/pyspark/serializers.py", line 220, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/usr/hdp/current/spark2-client/python/pyspark/serializers.py", line 138, in dump_stream
    for obj in iterator:
  File "/usr/hdp/current/spark2-client/python/pyspark/serializers.py", line 209, in _batched
    for item in iterator:
  File "<string>", line 1, in <lambda>
  File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 70, in <lambda>
    return lambda *a: f(*a)
  File "<stdin>", line 13, in haversine_distance
TypeError: a float is required