Pyspark:windows中的SaveTable无法处理windows路径

Pyspark:windows中的SaveTable无法处理windows路径,pyspark,Pyspark,我正在尝试使用windows路径(使用“”而不是“/”)保存CSV文件。我认为它不起作用,因为windows路径 这就是代码不起作用的原因吗 有解决这个问题的办法吗 守则: from pyspark.sql import SparkSession from pyspark.sql.types import * from pyspark.sql import Row def init_spark(appname): spark = SparkSession.builder.appName(a

我正在尝试使用windows路径(使用“”而不是“/”)保存CSV文件。我认为它不起作用,因为windows路径

  • 这就是代码不起作用的原因吗
  • 有解决这个问题的办法吗
  • 守则:

    from pyspark.sql import SparkSession
    from pyspark.sql.types import *
    from pyspark.sql import Row
    
    def init_spark(appname):
      spark = SparkSession.builder.appName(appname).getOrCreate()
      sc = spark.sparkContext
      return spark,sc
    
    def run_on_configs_spark():
      spark,sc = init_spark(appname="bucket_analysis")
      p_configs_RDD = sc.parallelize([1,4,5])
      p_configs_RDD=p_configs_RDD.map(mul)
      schema = StructType([StructField('a', IntegerType()), StructField('b', IntegerType())])
      df=spark.createDataFrame(p_configs_RDD,schema)
      df.write.saveAsTable(r"C:\Users\yuvalr\Desktop\example_csv",format="csv")
    
    
    def mul(x):
      return (x,x**2)
    
    run_on_configs_spark()
    
    错误代码:

    Traceback (most recent call last):
      File "C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py", line 426, in <module>
        analysis()
      File "C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py", line 408, in analysis
        run_CDH()
      File "C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py", line 420, in run_CDH
        max_prob_for_extension=None, max_base_size_B=4096,OP_arr=[0.2],
      File "C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py", line 173, in settings_print
        dic=get_map_of_worst_seq(params)
      File "C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py", line 245, in get_map_of_worst_seq
        run_over_settings_spark_test(info_obj)
      File "C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py", line 239, in run_over_settings_spark_test
        run_on_configs_spark(configs)
      File "C:\Users\yuvalr\Desktop\Git_folder\algo_sim\Bucket_analysis\Set_multiple_configurations\spark_parallelized_configs.py", line 17, in run_on_configs_spark
        df.write.saveAsTable(r"C:\Users\yuvalr\Desktop\example_csv",format="csv")
      File "C:\Users\yuvalr\Desktop\spark\Spark\python\pyspark\sql\readwriter.py", line 868, in saveAsTable
        self._jwrite.saveAsTable(name)
      File "C:\Users\yuvalr\venv\lib\site-packages\py4j\java_gateway.py", line 1305, in __call__
        answer, self.gateway_client, self.target_id, self.name)
      File "C:\Users\yuvalr\Desktop\spark\Spark\python\pyspark\sql\utils.py", line 137, in deco
        raise_from(converted)
      File "<string>", line 3, in raise_from
    pyspark.sql.utils.ParseException: 
    mismatched input ':' expecting {<EOF>, '.', '-'}(line 1, pos 1)
    
    == SQL ==
    C:\Users\yuvalr\Desktop\example_csv
    -^^^
    
    回溯(最近一次呼叫最后一次):
    文件“C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py”,第426行,在
    分析()
    文件“C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py”,分析中第408行
    run_CDH()
    文件“C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py”,第420行,在run_CDH中
    扩展的最大概率=无,最大基本概率=4096,操作概率=[0.2],
    文件“C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py”,第173行,设置打印
    dic=获取最差顺序的映射(参数)
    文件“C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py”,第245行,位于最差地图的获取顺序中
    运行设置火花测试(信息对象)
    “C:/Users/yuvalr/Desktop/Git_folder/algo_sim/Bucket_analysis/Set_multiple_configurations/run_multiple_configurations.py”文件,第239行,在运行设置火花测试中
    在配置上运行火花(配置)
    文件“C:\Users\yuvalr\Desktop\Git\u folder\algo\u sim\Bucket\u analysis\Set\u multiple\u configurations\spark\u parallellelized\u configs.py”,第17行,在\u configs\u spark上运行
    df.write.saveAsTable(r“C:\Users\yuvalr\Desktop\example\u csv”,format=“csv”)
    文件“C:\Users\yuvalr\Desktop\spark\spark\python\pyspark\sql\readwriter.py”,第868行,在saveAsTable中
    self.\u jwrite.saveAsTable(名称)
    文件“C:\Users\yuvalr\venv\lib\site packages\py4j\java\u gateway.py”,第1305行,在调用中__
    回答,self.gateway\u客户端,self.target\u id,self.name)
    文件“C:\Users\yuvalr\Desktop\spark\spark\python\pyspark\sql\utils.py”,第137行,deco格式
    从(已转换的)中提升
    文件“”,第3行,从
    pyspark.sql.utils.ParseException:
    输入不匹配“:”应为{,'.','-'}(第1行,位置1)
    ==SQL==
    C:\Users\yuvalr\Desktop\example\u csv
    -^^^
    
    在我看来,问题在于您的输出行:

    请尝试以下方法:

    df.write.csv(“file:///C:/Users/yuvalr/Desktop/example_csv.csv")
    
    • 是的,我知道你在Windows上,所以你期待着反斜杠,但PySpark不是
    • Windows对文件扩展名非常敏感-如果没有
      .csv
      ,您可能只需要创建一个名为
      example\u csv
    • 对此,您不需要正则表达式
      r”“
      字符串
    • 使用
      文件://
      双重确认这是我们正在讨论的文件

    如您所见,
    saveAsTable()
    希望提供一个可以用 目录
    spark.sql.warehouse.dir

    saveAsTable(名称、格式=无、模式=无、分区方式=无、**选项)

    参数

    • 名称–表名
    • 格式–用于保存的格式
    • 模式–追加、覆盖、错误、errorifexists、忽略(默认值:错误)之一
    • partitionBy–分区列的名称
    • 选项–所有其他字符串选项
    资料来源:

    解决方法:(注意windows
    C:\\

    将spark.sql.warehouse.dir设置为指向目标目录,如下所示

    from pyspark.sql import SparkSession
    from pyspark.sql.types import *
    from pyspark.sql import Row
    
    def init_spark(appname):
      spark = SparkSession.builder\
        .config("spark.sql.warehouse.dir", "C:\\Users\yuvalr\Desktop")\
        .appName(appname).getOrCreate()
      sc = spark.sparkContext
      return spark,sc
    
    def run_on_configs_spark():
      spark,sc = init_spark(appname="bucket_analysis")
      p_configs_RDD = sc.parallelize([1,4,5])
      p_configs_RDD=p_configs_RDD.map(mul)
      schema = StructType([StructField('a', IntegerType()), StructField('b', IntegerType())])
      df=spark.createDataFrame(p_configs_RDD,schema)
      df.write.saveAsTable("example_csv",format="csv",mode="overwrite")
    
    
    def mul(x):
      return (x,x**2)
    
    run_on_configs_spark()
    
    编辑1: 如果它是一个外部表(存储底层文件的外部路径),您可以使用下面的

    #df.write.option("path","C:\\Users\yuvalr\Desktop").saveAsTable("example_csv",format="csv",mode="overwrite")
    
    
    from pyspark.sql import SparkSession
    from pyspark.sql.types import *
    from pyspark.sql import Row
    
    
    def init_spark(appname):
      spark = SparkSession.builder\
        .appName(appname).getOrCreate()
      sc = spark.sparkContext
      return spark,sc
    
    def run_on_configs_spark():
      spark,sc = init_spark(appname="bucket_analysis")
      p_configs_RDD = sc.parallelize([1,4,5])
      p_configs_RDD=p_configs_RDD.map(mul)
      schema = StructType([StructField('a', IntegerType()), StructField('b', IntegerType())])
      df=spark.createDataFrame(p_configs_RDD,schema)
      df.write.option("path","C:\\Users\yuvalr\Desktop").saveAsTable("example_csv",format="csv",mode="overwrite")
    
    
    def mul(x):
      return (x,x**2)
    
    run_on_configs_spark()