Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/ruby/21.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 独立模式下pyspark上的连接被拒绝错误_Python_Pyspark - Fatal编程技术网

Python 独立模式下pyspark上的连接被拒绝错误

Python 独立模式下pyspark上的连接被拒绝错误,python,pyspark,Python,Pyspark,spark文件在txt文件中打印出前10个单词 from pyspark import SparkContext from pyspark import SparkConf from pyspark.streaming import StreamingContext conf = SparkConf().setAppName("read text file in pyspark") sc = SparkContext(conf=conf) path_to_file = &

spark文件在txt文件中打印出前10个单词

from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.streaming import StreamingContext

conf = SparkConf().setAppName("read text file in pyspark")
sc = SparkContext(conf=conf)

path_to_file = "C:/Users/Admin/Desktop/beeline_project/lorem2.txt"
path_to_save_result =  "C:/Users/Admin/Desktop/beeline_project/output/"

words = sc.textFile(path_to_file).flatMap(lambda line: line.split(" "))

wordCounts = words.map(lambda word: (word, 1)).reduceByKey(lambda a,b:a +b)
output = wordCounts.collect()
output = sorted(output, key = lambda tup: tup[1], reverse=True)
output = output[:10]
for (word, count) in output:
  print("%s: %i" % (word, count))

sc.stop()
但是它给出了这个错误,我只希望程序正确停止。 它最后说“没有建立连接-机器拒绝了”,但我没有在任何地方建立连接

可能是因为在没有main()的情况下运行pyspark?它会影响输出吗

SUCCESS: The process with PID 13904 (child process of PID 6048) has been terminated.
SUCCESS: The process with PID 6048 (child process of PID 13948) has been terminated.
SUCCESS: The process with PID 13948 (child process of PID 8892) has been terminated.
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "C:\python37\lib\site-packages\py4j\java_gateway.py", line 1152, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
  File "C:\python37\lib\socket.py", line 589, in readinto
    return self._sock.recv_into(b)
ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\python37\lib\site-packages\py4j\java_gateway.py", line 985, in send_command
    response = connection.send_command(command)
  File "C:\python37\lib\site-packages\py4j\java_gateway.py", line 1164, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:62315)
Traceback (most recent call last):
  File "C:\python37\lib\site-packages\py4j\java_gateway.py", line 929, in _get_connection
    connection = self.deque.pop()
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\python37\lib\site-packages\py4j\java_gateway.py", line 1067, in start
    self.socket.connect((self.address, self.port))
ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:62315)
Traceback (most recent call last):
  File "C:\python37\lib\site-packages\py4j\java_gateway.py", line 929, in _get_connection
    connection = self.deque.pop()
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\python37\lib\site-packages\py4j\java_gateway.py", line 1067, in start
    self.socket.connect((self.address, self.port))
ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it
编辑:

使用spark2.4.6、scala 2.11.12、java8和python3.7,尝试以下操作:

从pyspark.sql导入SparkSession
spark=SparkSession.builder.getOrCreate()
路径到文件=“C:/Users/Admin/Desktop/beeline\u project/lorem2.txt”
保存结果的路径=“C:/Users/Admin/Desktop/beeline\u project/output/”
df=spark.read.text(路径到文件)
words=df.rdd.flatMap(lambda行:行[0]。拆分(“”)。收集()
wordCounts=words.map(lambda-word:(word,1)).reduceByKey(lambda,b:a+b)
output=wordCounts.collect()
输出=排序(输出,key=lambda tup:tup[1],reverse=True)
输出=输出[:10]
对于输出中的(字、计数):
打印(“%s:%i”%(字,计数))

请注意,这种方法仅使用Spark读取文件并将其转换为字符串列表。将所有数据收集到驱动程序节点的速度很慢,如果数据集很大,则会出错。这段代码可能会起作用,但它肯定不是“火花方式”执行分析的方式。您应该考虑使用Spark DataFrames/本机Spark函数执行此分析,以便它可以在群集的多个节点上并行执行,并利用Spark引擎的强大功能。

您可以将您的Spark版本添加到问题中吗?最佳答案取决于您的Spark版本。@Powers,谢谢您的帮助!我应该在任何问题上都加上这些。。。spark 2.4.6 python3.7、scala 2.11.12、java8、win 10 64bitthx,但它不起作用,.collect()是问题所在-它说没有这样的函数spark.read.text返回rdd类型的数据。我试着在没有收集的情况下运行flatmap-仍然没有运气