Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/search/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Search 使用筛选器在Pyspark中的dataframe中搜索实例花费太多时间_Search_Dataframe_Filter_Pyspark_Instance - Fatal编程技术网

Search 使用筛选器在Pyspark中的dataframe中搜索实例花费太多时间

Search 使用筛选器在Pyspark中的dataframe中搜索实例花费太多时间,search,dataframe,filter,pyspark,instance,Search,Dataframe,Filter,Pyspark,Instance,我有一个具有N个属性(Atr1、Atr2、Atr3、…、AtrN)的数据帧和一个具有相同[1..N-1]属性的单独实例,但第N个除外 我想检查数据帧中是否有任何实例具有与实例的属性[1..N-1]相同的值,如果存在该实例的引用,我的目标是获取数据帧中具有属性[1..N]的实例 例如,如果我有: Instance: [Row(Atr1=u'A', Atr2=u'B', Atr3=24)] Dataframe: +------+------+------+------+ | Atr1 | At

我有一个具有N个属性(Atr1、Atr2、Atr3、…、AtrN)的数据帧和一个具有相同[1..N-1]属性的单独实例,但第N个除外

我想检查数据帧中是否有任何实例具有与实例的属性[1..N-1]相同的值,如果存在该实例的引用,我的目标是获取数据帧中具有属性[1..N]的实例

例如,如果我有:

Instance:

[Row(Atr1=u'A', Atr2=u'B', Atr3=24)]

Dataframe:

+------+------+------+------+
| Atr1 | Atr2 | Atr3 | Atr4 |
+------+------+------+------+
|  'C' |  'B' |  21  |  'H' |
+------+------+------+------+
|  'D' |  'B' |  21  |  'J' |
+------+------+------+------+
|  'E' |  'B' |  21  |  'K' |
+------+------+------+------+
|  'A' |  'B' |  24  |  'I' |
+------+------+------+------+
我想得到数据帧的第四行,它的值也是Atr4

我用“filter()”方法进行了如下尝试:

df.filter("Atr1 = 'C' and Atr2 = 'B', and Atr3 = 24").take(1)
我得到了我想要的结果,但是花了很多时间

所以,我的问题是:有没有办法在更短的时间内做到同样的事情


谢谢

您可以使用位置敏感哈希(minhashLSH)查找最近的邻居并检查其是否相同

由于您的数据具有字符串,因此需要在应用LSH之前对其进行处理。 我们将使用pysparkml的功能模块

从stringIndexing和onehotencoding开始

df= spark.createDataFrame([('C','B',21,'H'),('D','B',21,'J'),('E','c',21,'K'),('A','B',24,'J')], ["attr1","attr2","attr3","attr4"])


for col_ in ["attr1","attr2","attr4"]:

    stringIndexer = StringIndexer(inputCol=col_, outputCol=col_+"_")
    model = stringIndexer.fit(df)
    df = model.transform(df)
    encoder = OneHotEncoder(inputCol=col_+"_", outputCol="features_"+col_, dropLast = False)
    df = encoder.transform(df)


df = df.drop("attr1","attr2","attr4","attr1_","attr2_","attr4_")
df.show()


+-----+--------------+--------------+--------------+
|attr3|features_attr1|features_attr2|features_attr4|
+-----+--------------+--------------+--------------+
|   21| (4,[2],[1.0])| (2,[0],[1.0])| (3,[1],[1.0])|
|   21| (4,[0],[1.0])| (2,[0],[1.0])| (3,[0],[1.0])|
|   21| (4,[3],[1.0])| (2,[1],[1.0])| (3,[2],[1.0])|
|   24| (4,[1],[1.0])| (2,[0],[1.0])| (3,[0],[1.0])|
+-----+--------------+--------------+--------------+
添加id并组装所有特征向量

from pyspark.sql.functions import monotonically_increasing_id

df = df.withColumn("id", monotonically_increasing_id())
df.show()


assembler = VectorAssembler(inputCols = ["features_attr1", "features_attr2", "features_attr4", "attr3"]
                            , outputCol = "features")
df_ = assembler.transform(df)
df_ = df_.select("id", "features")
df_.show()


+----------+--------------------+
|        id|            features|
+----------+--------------------+
|         0|(10,[2,4,7,9],[1....|
|         1|(10,[0,4,6,9],[1....|
|8589934592|(10,[3,5,8,9],[1....|
|8589934593|(10,[1,4,6,9],[1....|
+----------+--------------------+
mh = MinHashLSH(inputCol="features", outputCol="hashes", seed=12345)
model = mh.fit(df_)
model.transform(df_)
key = df_.select("features").collect()[0]["features"]
model.approxNearestNeighbors(df_, key, 1).collect()
创建minHashLSH模型并搜索最近邻居

from pyspark.sql.functions import monotonically_increasing_id

df = df.withColumn("id", monotonically_increasing_id())
df.show()


assembler = VectorAssembler(inputCols = ["features_attr1", "features_attr2", "features_attr4", "attr3"]
                            , outputCol = "features")
df_ = assembler.transform(df)
df_ = df_.select("id", "features")
df_.show()


+----------+--------------------+
|        id|            features|
+----------+--------------------+
|         0|(10,[2,4,7,9],[1....|
|         1|(10,[0,4,6,9],[1....|
|8589934592|(10,[3,5,8,9],[1....|
|8589934593|(10,[1,4,6,9],[1....|
+----------+--------------------+
mh = MinHashLSH(inputCol="features", outputCol="hashes", seed=12345)
model = mh.fit(df_)
model.transform(df_)
key = df_.select("features").collect()[0]["features"]
model.approxNearestNeighbors(df_, key, 1).collect()
输出

[Row(id=0, features=SparseVector(10, {2: 1.0, 4: 1.0, 7: 1.0, 9: 21.0}), hashes=[DenseVector([-1272095496.0])], distCol=0.0)]

在这里多了解一些信息会有所帮助。特别是:需要多长时间?你想花多长时间?运行此功能的群集/硬件有多大?通常,无论集群有多大,在执行任何spark操作时都会有一些开销,因为spark必须将数据和代码分发到集群,然后收集结果。在pyspark中做这样简单的事情永远不会像在本地机器上用python做同样简单的事情那样快。