Pyspark:windows中的SaveTable无法处理windows路径
我正在尝试使用windows路径(使用“”而不是“/”)保存CSV文件。我认为它不起作用,因为windows路径Pyspark:windows中的SaveTable无法处理windows路径,pyspark,Pyspark,我正在尝试使用windows路径(使用“”而不是“/”)保存CSV文件。我认为它不起作用,因为windows路径 这就是代码不起作用的原因吗 有解决这个问题的办法吗 守则: from pyspark.sql import SparkSession from pyspark.sql.types import * from pyspark.sql import Row def init_spark(appname): spark = SparkSession.builder.appName(a
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql import Row
def init_spark(appname):
spark = SparkSession.builder.appName(appname).getOrCreate()
sc = spark.sparkContext
return spark,sc
def run_on_configs_spark():
spark,sc = init_spark(appname="bucket_analysis")
p_configs_RDD = sc.parallelize([1,4,5])
p_configs_RDD=p_configs_RDD.map(mul)
schema = StructType([StructField('a', IntegerType()), StructField('b', IntegerType())])
df=spark.createDataFrame(p_configs_RDD,schema)
df.write.saveAsTable(r"C:\Users\yuvalr\Desktop\example_csv",format="csv")
def mul(x):
return (x,x**2)
run_on_configs_spark()
错误代码:
Traceback (most recent call last):
File "C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py", line 426, in <module>
analysis()
File "C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py", line 408, in analysis
run_CDH()
File "C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py", line 420, in run_CDH
max_prob_for_extension=None, max_base_size_B=4096,OP_arr=[0.2],
File "C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py", line 173, in settings_print
dic=get_map_of_worst_seq(params)
File "C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py", line 245, in get_map_of_worst_seq
run_over_settings_spark_test(info_obj)
File "C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py", line 239, in run_over_settings_spark_test
run_on_configs_spark(configs)
File "C:\Users\yuvalr\Desktop\Git_folder\algo_sim\Bucket_analysis\Set_multiple_configurations\spark_parallelized_configs.py", line 17, in run_on_configs_spark
df.write.saveAsTable(r"C:\Users\yuvalr\Desktop\example_csv",format="csv")
File "C:\Users\yuvalr\Desktop\spark\Spark\python\pyspark\sql\readwriter.py", line 868, in saveAsTable
self._jwrite.saveAsTable(name)
File "C:\Users\yuvalr\venv\lib\site-packages\py4j\java_gateway.py", line 1305, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "C:\Users\yuvalr\Desktop\spark\Spark\python\pyspark\sql\utils.py", line 137, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.ParseException:
mismatched input ':' expecting {<EOF>, '.', '-'}(line 1, pos 1)
== SQL ==
C:\Users\yuvalr\Desktop\example_csv
-^^^
回溯(最近一次呼叫最后一次):
文件“C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py”,第426行,在
分析()
文件“C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py”,分析中第408行
run_CDH()
文件“C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py”,第420行,在run_CDH中
扩展的最大概率=无,最大基本概率=4096,操作概率=[0.2],
文件“C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py”,第173行,设置打印
dic=获取最差顺序的映射(参数)
文件“C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py”,第245行,位于最差地图的获取顺序中
运行设置火花测试(信息对象)
“C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py”文件,第239行,在运行设置火花测试中
在配置上运行火花(配置)
文件“C:\Users\yuvalr\Desktop\Git\u folder\algo\u sim\Bucket\u analysis\Set\u multiple\u configurations\spark\u parallellelized\u configs.py”,第17行,在\u configs\u spark上运行
df.write.saveAsTable(r“C:\Users\yuvalr\Desktop\example\u csv”,format=“csv”)
文件“C:\Users\yuvalr\Desktop\spark\spark\python\pyspark\sql\readwriter.py”,第868行,在saveAsTable中
self.\u jwrite.saveAsTable(名称)
文件“C:\Users\yuvalr\venv\lib\site packages\py4j\java\u gateway.py”,第1305行,在调用中__
回答,self.gateway\u客户端,self.target\u id,self.name)
文件“C:\Users\yuvalr\Desktop\spark\spark\python\pyspark\sql\utils.py”,第137行,deco格式
从(已转换的)中提升
文件“”,第3行,从
pyspark.sql.utils.ParseException:
输入不匹配“:”应为{,'.','-'}(第1行,位置1)
==SQL==
C:\Users\yuvalr\Desktop\example\u csv
-^^^
在我看来,问题在于您的输出行:
请尝试以下方法:
df.write.csv(“file:///C:/Users/yuvalr/Desktop/example_csv.csv")
- 是的,我知道你在Windows上,所以你期待着反斜杠,但PySpark不是
- Windows对文件扩展名非常敏感-如果没有
,您可能只需要创建一个名为.csv
example\u csv
- 对此,您不需要正则表达式
字符串r”“
- 使用
双重确认这是我们正在讨论的文件文件://
saveAsTable()
希望提供一个可以用
目录spark.sql.warehouse.dir
saveAsTable(名称、格式=无、模式=无、分区方式=无、**选项)
参数
- 名称–表名
- 格式–用于保存的格式
- 模式–追加、覆盖、错误、errorifexists、忽略(默认值:错误)之一
- partitionBy–分区列的名称
- 选项–所有其他字符串选项
C:\\
)
将spark.sql.warehouse.dir设置为指向目标目录,如下所示
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql import Row
def init_spark(appname):
spark = SparkSession.builder\
.config("spark.sql.warehouse.dir", "C:\\Users\yuvalr\Desktop")\
.appName(appname).getOrCreate()
sc = spark.sparkContext
return spark,sc
def run_on_configs_spark():
spark,sc = init_spark(appname="bucket_analysis")
p_configs_RDD = sc.parallelize([1,4,5])
p_configs_RDD=p_configs_RDD.map(mul)
schema = StructType([StructField('a', IntegerType()), StructField('b', IntegerType())])
df=spark.createDataFrame(p_configs_RDD,schema)
df.write.saveAsTable("example_csv",format="csv",mode="overwrite")
def mul(x):
return (x,x**2)
run_on_configs_spark()
编辑1:
如果它是一个外部表(存储底层文件的外部路径),您可以使用下面的
#df.write.option("path","C:\\Users\yuvalr\Desktop").saveAsTable("example_csv",format="csv",mode="overwrite")
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql import Row
def init_spark(appname):
spark = SparkSession.builder\
.appName(appname).getOrCreate()
sc = spark.sparkContext
return spark,sc
def run_on_configs_spark():
spark,sc = init_spark(appname="bucket_analysis")
p_configs_RDD = sc.parallelize([1,4,5])
p_configs_RDD=p_configs_RDD.map(mul)
schema = StructType([StructField('a', IntegerType()), StructField('b', IntegerType())])
df=spark.createDataFrame(p_configs_RDD,schema)
df.write.option("path","C:\\Users\yuvalr\Desktop").saveAsTable("example_csv",format="csv",mode="overwrite")
def mul(x):
return (x,x**2)
run_on_configs_spark()