Python 3.x 为什么pyspark中的某些相关值不在[-1,1]范围内？_Python 3.x_Pyspark_Pearson Correlation

Python 3.x 为什么pyspark中的某些相关值不在[-1,1]范围内？

python-3.x pyspark

Python 3.x 为什么pyspark中的某些相关值不在[-1,1]范围内？,python-3.x,pyspark,pearson-correlation,Python 3.x,Pyspark,Pearson Correlation,我有一个数据框df_corr，只有一列，每行都有一个价格列表 +--------------------+ | prices | +--------------------+ |[101.5,101.0,99.3...| |[101.5,101.0,99.3...| |[101.5,101.0,99.3...| |[101.5,101.0,99.3...| |[101.5,101.0,99.3...| |[101.5,101.0,99.3...| |[101.5,101.

我有一个数据框

df_corr

，只有一列，每行都有一个价格列表

+--------------------+
|             prices |
+--------------------+
|[101.5,101.0,99.3...|
|[101.5,101.0,99.3...|
|[101.5,101.0,99.3...|
|[101.5,101.0,99.3...|
|[101.5,101.0,99.3...|
|[101.5,101.0,99.3...|
|[101.5,101.0,99.3...|
|[101.5,101.0,99.3...|
|[101.5,101.0,99.3...|
|[101.5,101.0,99.3...|
|[101.5,101.0,99.3...|
|[101.5,101.0,99.3...|
|[101.5,101.0,99.3...|
|[101.5,101.0,99.3...|
|[101.5,101.0,99.3...|
|[101.5,101.0,99.3...|
|[101.5,101.0,99.3...|
|[101.5,101.0,99.3...|
|[101.5,101.0,99.3...|
|[101.5,101.0,99.3...|
+--------------------+

我想找出每对价格列之间的相关性（例如[101.5,101.5,101.5,101.5…]和[101.0,101.0,101.0…]之间的相关性）

为此，我使用pyspark的相关函数，但我得到一些对的值超出范围[-1，1]。这是我的代码：

pcorr_matrix = Correlation.corr(df_corr, "prices").head()
print(str(pcorr_matrix[0]))

我得到的输出是

DenseMatrix([[ 1.        ,  0.        , -0.5       , ...,         nan,
                      nan,  2.12132034],
             [ 0.        ,  1.        ,  1.5       , ...,         nan,
                      nan, -2.12132034],
             [-0.5       ,  1.5       ,  1.        , ...,         nan,
                      nan,  1.76776695],
             ..., 
             [        nan,         nan,         nan, ...,  1.        ,
                      nan,         nan],
             [        nan,         nan,         nan, ...,         nan,
               1.        ,         nan],
             [ 2.12132034, -2.12132034,  1.76776695, ...,         nan,
                      nan,  1.        ]])

有人知道为什么会这样吗

编辑：pyspark文档说corr函数是实验性的

我还手工计算，发现其中一些函数应该是NaN，但不是，所以看起来库函数中有一个bug。

pyspark文档说corr函数是实验性的

我还手工计算，发现其中一些函数应该是NaN，但不是，所以看起来库函数中有一个bug。

pyspark文档说corr函数是实验性的

我还手工计算，发现其中一些应该是NaN，但不是—所以看起来库函数中有一个bug