Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/svg/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Pyspark(Databricks)性能问题。NLP问题_Pyspark_Databricks - Fatal编程技术网

Pyspark(Databricks)性能问题。NLP问题

Pyspark(Databricks)性能问题。NLP问题,pyspark,databricks,Pyspark,Databricks,我在Pyspark的Databricks中使用NLP任务时遇到性能问题: 上下文: 我有2个pyspark数据帧,其中包含一个“ID”列和一个“文本”列,例如: Table A | Table B ID_A TEXT_A | ID_B TEXT_B 0 text_A0 | 0 text_B0 1 text_A1 | 1 text_B1 2 text_A2

我在Pyspark的Databricks中使用NLP任务时遇到性能问题:

上下文:
我有2个pyspark数据帧,其中包含一个“ID”列和一个“文本”列,例如:

Table A              |   Table B
ID_A  TEXT_A         |   ID_B    TEXT_B
0     text_A0        |   0       text_B0
1     text_A1        |   1       text_B1
2     text_A2        |   2       text_B2
display(model_Word2vec.transform(A))

ID_A    TEXT_A    COORDINATES_A     
0       text A0   [1, 5, [], [0.05, 0.1, -1.5, 0.2, -0.7]]      
1       text A1   [1, 5, [], [0.15, -1.1, 0.5, 0.27, -0.1]]     
2       text A2   [1, 5, [], [1.05, 1.2, -0.55, 0.2, -1.7]]             
为了找到文本之间的相似性,我想测量每个A记录与每个B记录之间的余弦相似性(类似于笛卡尔积相似性),因此我使用了
Word2vec
模型。
第一步是从下面解释的
ml
lib中训练
Word2Vec
模型:

word2vec = Word2Vec(vectorSize = 5, windowSize=2, inputCol = 
'TEXT', outputCol = 'COORDINATES')
model_Word2Vec = word2vec.fit(Data_train)
然后我将模型应用于数据A和数据B(
model\u Word2vec.transform(A)
model\u Word2vec.transform(B)
),获得每个文本的坐标(单词坐标的平均值)。例如:

Table A              |   Table B
ID_A  TEXT_A         |   ID_B    TEXT_B
0     text_A0        |   0       text_B0
1     text_A1        |   1       text_B1
2     text_A2        |   2       text_B2
display(model_Word2vec.transform(A))

ID_A    TEXT_A    COORDINATES_A     
0       text A0   [1, 5, [], [0.05, 0.1, -1.5, 0.2, -0.7]]      
1       text A1   [1, 5, [], [0.15, -1.1, 0.5, 0.27, -0.1]]     
2       text A2   [1, 5, [], [1.05, 1.2, -0.55, 0.2, -1.7]]             
我不得不说,数据帧是分布式的、不可变的

pyspark中的余弦相似性示例:

X = [0.05, 0.1, -1.5, 0.2, -0.7]; Y = [1.0, 0.003, 2.12, 0.22, 1.3] 
cos(X, Y) = X.dot(Y) / ( X.norm(2)*Y.norm(2) ) 
问题:
我想要这样的东西:

Crossjoin
ID_A  COORDINATES_A                 ID_B COORDINATES_B                 Cosine
0     [0.05, 0.1, -1.5, 0.2, -0.7]  0    [1.0, 0.003, 2.12, 0.22, 1.3]  -0.89
0     [0.05, 0.1, -1.5, 0.2, -0.7]  1    [0.13, 1.1, 0.5,1.27, 1.99]    -0.4
1     [0.15, -1.1, 0.5, 0.27, -0.1] 0    [1.0, 0.003, 2.12, 0.22, 1.3]  -0.34
1     [0.15, -1.1, 0.5, 0.27, -0.1] 1    [0.13, 1.1, 0.5,1.27, 1.99]    -0.24
2     [1.05, 1.2, -0.55, 0.2, -1.7] 0    [1.0, 0.003, 2.12, 0.22, 1.3]  -0.35
2     [1.05, 1.2, -0.55, 0.2, -1.7] 1    [0.13, 1.1, 0.5,1.27, 1.99]    -0.31
我面临着性能问题,我得到了如下结果:

1°第一进近:

我认为这两个循环会导致以下错误:

(15) Spark Jobs
The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.
2°第二进近:

但是我得到了以下错误

(1) Spark Jobs
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 27.0 failed 4 times, most recent failure: Lost task 0.3 in stage 27.0 (TID 54, 10.139.64.6, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
--------------------------------------------------------------------------- 
Py4JJavaError Traceback (most recent call last) <command-2001347748501300> in <module>() ----> 1 columna_que_como_c = [(i[0].values) for i in cross_que.select(col("result_que_como_c")).collect()] /databricks/spark/python/pyspark/sql/dataframe.py in collect(self) 546 # Default path used in OSS Spark / for non-DF-ACL clusters: 547 with SCCallSiteSync(self._sc) as css: --> 548 sock_info = self._jdf.collectToPython() 549 return list(_load_from_socket(sock_info, BatchedSerializer(PickleSerializer()))) 550 /databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args) 1255 answer = self.gateway_client.send_command(command) 1256 return_value = get_return_value( -> 1257 answer, self.gateway_client, self.target_id, self.name) 1258 1259 for temp_arg in temp_args: /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 61 def deco(*a, **kw):´
(1)火花作业
org.apache.spark.sparkeexception:作业因阶段失败而中止:阶段27.0中的任务0失败4次,最近的失败:阶段27.0中的任务0.3丢失(TID 54,10.139.64.6,executor 2):executor LostFailure(执行器2因某个正在运行的任务而退出)原因:远程RPC客户端已解除关联。可能是由于容器超过阈值或网络问题。检查驱动程序日志中的警告消息。
--------------------------------------------------------------------------- 
Py4JJavaError Traceback(最近一次调用)in()--->1 columna_que_como_c=[(i[0].values)for i in cross_que.select(col(“result_que_como_c”).collect()]/dataricks/spark/python/pyspark/sql/dataframe.py in collect(self)546#OSS spark/用于非DF ACL集群的默认路径:547 with SCCallSiteSync(self.\u sc)作为css:-->548 sock_info=self.\u jdf.collectToPython()549返回列表(\u从_socket(sock_info,BatchedSerializer(PickleSerializer()))加载)550/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in_ucall_u(self,*args)1255 answer=self.gateway\u客户端。send_命令1256 return_value=get_u返回值(>1257答案,self.gateway_客户端,self.target_id,self.name)1258 1259用于temp_参数中的temp_参数:/databricks/spark/python/pyspark/sql/utils.py in deco(*a,**kw)61 def deco(*a,**kw):

如果有人能帮我解决这个问题,我将不胜感激。

那个错误消息图片,请用错误文本替换它。不要用图片来描述文章中的文本内容。那个错误消息图片,请用错误文本替换它。不要用图片来描述文章中的文本内容。