Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python SparkContext在错误的位置并行化拆分_Python_Apache Spark_Pyspark - Fatal编程技术网

Python SparkContext在错误的位置并行化拆分

Python SparkContext在错误的位置并行化拆分,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我下载了一个文件,现在我正试图将其作为数据帧写入hdfs import requests from pyspark import SparkContext, SparkConf conf = SparkConf().setAppName('Write Data').setMaster('local') sc = SparkContext(conf=conf) file = requests.get('https://data.nasa.gov/resource/y77d-th95.csv')

我下载了一个文件,现在我正试图将其作为数据帧写入hdfs

import requests
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName('Write Data').setMaster('local')
sc = SparkContext(conf=conf)

file = requests.get('https://data.nasa.gov/resource/y77d-th95.csv')

data = sc.parallelize(file)
打印文件内容时,我会看到以下输出:

print(file.text)
":@computed_region_cbhk_fwbd",":@computed_region_nnqa_25f4","fall","geolocation","geolocation_address","geolocation_city","geolocation_state","geolocation_zip","id","mass","name","nametype","recclass","reclat","reclong","year"
,,"Fell","POINT (6.08333 50.775)",,,,,"1","21","Aachen","Valid","L5","50.775000","6.083330","1880-01-01T00:00:00.000"
,,"Fell","POINT (10.23333 56.18333)",,,,,"2","720","Aarhus","Valid","H6","56.183330","10.233330","1951-01-01T00:00:00.000"
这正是我想看到的。现在,我正试图从使用
data=sc.parallelize(文件)


为什么我没有得到第一行像我期待从我的第一次打印?它在中途的某个地方停止了,我没有看到标题的其他组件。

它不工作,因为
响应。\uu iter\uuu
不知道格式。它只是

如果您确实需要读取这样的数据,请使用
text.splitlines

sc.parallelize(file.text.splitlines())
或者更好:

import csv
import io

sc.parallelize(csv.reader(io.StringIO(file.text)))

它不工作,因为
响应。\uuuuu iter\uuuuu
不支持格式。它只是

如果您确实需要读取这样的数据,请使用
text.splitlines

sc.parallelize(file.text.splitlines())
或者更好:

import csv
import io

sc.parallelize(csv.reader(io.StringIO(file.text)))

答案很简单。要并行化Python对象,您需要提供一个列表来Spark。在这种情况下,您提供的是响应:

>>> file = requests.get('https://data.nasa.gov/resource/y77d-th95.csv')
>>> file
<Response [200]>
当您有一个像Hadoop这样的文件系统时,Hadoop将为您进行拆分,并以一种在换行符上拆分的方式排列HDFS块

希望这有帮助


干杯,福克回答很简单。要并行化Python对象,您需要提供一个列表来Spark。在这种情况下,您提供的是响应:

>>> file = requests.get('https://data.nasa.gov/resource/y77d-th95.csv')
>>> file
<Response [200]>
当您有一个像Hadoop这样的文件系统时,Hadoop将为您进行拆分,并以一种在换行符上拆分的方式排列HDFS块

希望这有帮助

干杯,福克