Numpy 如何从下载csv文件的python请求流加载数据帧?
我想从csv文件创建一个数据帧,我将通过流媒体检索该文件:Numpy 如何从下载csv文件的python请求流加载数据帧?,numpy,apache-spark,python-requests,Numpy,Apache Spark,Python Requests,我想从csv文件创建一个数据帧,我将通过流媒体检索该文件: import requests url = "https://{0}:8443/gateway/default/webhdfs/v1/{1}?op=OPEN".format(host, filepath) r = requests.get(url, auth=(username, password), verify=False,
import requests
url = "https://{0}:8443/gateway/default/webhdfs/v1/{1}?op=OPEN".format(host, filepath)
r = requests.get(url,
auth=(username, password),
verify=False,
allow_redirects=True,
stream=True)
chunk_size = 1024
for chunk in r.iter_content(chunk_size):
# how to load the data
如何将数据从http流加载到spark中
请注意,无法使用HDFS格式检索数据-必须使用WebHDFS。您可以预先生成块边界的RDD,然后使用它在工作区内处理文件。例如:
def process(start, finish):
// Download file
// Process downloaded content in range [start, finish)
// Return a list of item
partition_size = file_size / num_partition
boundaries = [(i, i+paritition_size - 1) for i in range(0, file_size, partition_size)]
rrd = sc.parallelize(boundaries).flatMap(process)
df = sqlContext.createDataFrame(rrd)
您希望在对接收到的结果进行流式传输时创建一个数据帧,对吗?您可能希望了解Spark流媒体功能。导入文件后,将使用core Spark分析数据。这比我的解决方案好得多,它会导致堆栈溢出: