PySpark approxSimilarityJoin()不返回任何结果

PySpark approxSimilarityJoin()不返回任何结果,pyspark,databricks,nearest-neighbor,euclidean-distance,Pyspark,Databricks,Nearest Neighbor,Euclidean Distance,我试图通过在PySpark中对用户特征进行矢量化并按用户向量之间的距离进行排序来找到类似的用户。我在运行时5.5 LTS ML cluster(Scala 2.11,Spark 2.4.3)上的Databricks中运行这个 在中的代码之后,我使用了pyspark.ml.feature.BucketedRandomProjectionLSH模型中的approxSimilarityJoin()方法 我使用approxSimilarityJoin()成功地找到了类似的用户,但我偶尔会遇到一个感兴趣的

我试图通过在PySpark中对用户特征进行矢量化并按用户向量之间的距离进行排序来找到类似的用户。我在运行时5.5 LTS ML cluster(Scala 2.11,Spark 2.4.3)上的Databricks中运行这个

在中的代码之后,我使用了
pyspark.ml.feature.BucketedRandomProjectionLSH
模型中的
approxSimilarityJoin()
方法

我使用
approxSimilarityJoin()
成功地找到了类似的用户,但我偶尔会遇到一个感兴趣的用户,显然没有类似的用户

通常当
approxSimilarityJoin()
不返回任何内容时,我假设这是因为
threshold
参数设置为低。这有时会解决问题,但现在我尝试使用100000的
阈值
,但仍然一无所获

我将模型定义为

brp=BucketedRandomProjectionLSH(inputCol=“scaledFeatures”,outputCol=“hashes”,bucketLength=1.0)

我不确定更改
bucketLength
numHashTables
是否有助于获得结果

下面的示例显示了一对用户,其中
approxSimilarityJoin()
返回了某些内容(
dataA
dataB
),以及一对用户(
dataC
dataD

from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col


dataA = [(0, Vectors.dense([0.7016968702094931,0.2636417660310031,4.155293362824633,4.191398632883099]),)]
dataB = [(1, Vectors.dense([0.3757117100334294,0.2636417660310031,4.1539923630906745,4.190086328785612]),)]

dfA = spark.createDataFrame(dataA, ["customer_id", "scaledFeatures"])
dfB = spark.createDataFrame(dataB, ["customer_id", "scaledFeatures"])

brp = BucketedRandomProjectionLSH(inputCol="scaledFeatures", outputCol="hashes", bucketLength=2.0,
                                  numHashTables=3)
model = brp.fit(dfA)

# returns
# theshold of 100000 is clearly overkill
# A dataframe with dfA and dfB feature vectors and a EuclideanDistance of 0.32599039770730354
model.approxSimilarityJoin(dfA, dfB, 100000, distCol="EuclideanDistance").show()



dataC = [(0, Vectors.dense([1.1600056435954367,78.27652460873155,3.5535837780801396,0.0030949620591871887]),)]
dataD = [(1, Vectors.dense([0.4660731192450482,39.85571715054726,1.0679201943112886,0.012330725745062067]),)]

dfC = spark.createDataFrame(dataC, ["customer_id", "scaledFeatures"])
dfD = spark.createDataFrame(dataD, ["customer_id", "scaledFeatures"])

brp = BucketedRandomProjectionLSH(inputCol="scaledFeatures", outputCol="hashes", bucketLength=2.0,
                                  numHashTables=3)
model = brp.fit(dfC)

# returns empty df
model.approxSimilarityJoin(dfC, dfD, 100000, distCol="EuclideanDistance").show()

通过将
bucketLength
参数值增加到
15
,我能够获得上述示例后半部分的结果。由于欧几里德距离约为34,阈值可能会降低

根据:

bucketLength=每个散列存储桶的长度,较大的存储桶会降低误报率