Apache spark 火花罐'；不加载大文件？_Apache Spark_Pyspark

Apache spark 火花罐'；不加载大文件？

apache-spark pyspark

Apache spark 火花罐'；不加载大文件？,apache-spark,pyspark,Apache Spark,Pyspark,我想加载一个大的csv文件，所以我尝试了pyspark，但是jupyter笔记本返回以下错误： IOPub data rate exceeded. The notebook server will temporarily stop sending output to the client in order to avoid crashing it. To change this limit, set the config variable `--NotebookApp.iopub_data_ra

我想加载一个大的csv文件，所以我尝试了pyspark，但是jupyter笔记本返回以下错误：

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

这是我的代码：

import findspark
findspark.init()
from pyspark import SparkContext, SparkConf

from pyspark.sql import SparkSession

#readmultiple csv with pyspark
 spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()

 df = spark.read.csv("Desktop/train/train.csv",header=True);

 Pickup_locations=df.select("pickup_datetime","Pickup_latitude",
                          "Pickup_longitude")

 print(Pickup_locations.collect())

这个文件有多大？它有1048576行。我从kaggle下载它。显然问题出在collect（）中，jupyter无法显示数据帧的所有数据。当我将collect（）更改为count（）时，我得到的行数是1458644，对于未处理的原始数据，没有理由使用

collect

，这里使用它的方式；如果您需要

收集

，那么首先为什么要使用Spark呢？事实上，我的主要问题是迭代数据帧以获得每行的纬度和经度，然后将其传递给folium map，因此我计划这样做：对于df.collect（）中的行：提取lat和lon