Python pyspark-遍历文件并替换dataframe中的值

Python pyspark-遍历文件并替换dataframe中的值,python,pyspark-sql,Python,Pyspark Sql,我需要读取文件并替换s3路径中的值。我可以循环遍历文件,但无法替换值 File ending with \n /MTD_avg_cust_bal1 /MTDSumOfCustomerInitiatedTrxns1 /MTDCountOfCustomerInitiatedTrxns1 代码: metrics = open("Output.txt","r") line = metrics.readline() while line: print line line = metri

我需要读取文件并替换s3路径中的值。我可以循环遍历文件,但无法替换值

File ending with \n
/MTD_avg_cust_bal1
/MTDSumOfCustomerInitiatedTrxns1
/MTDCountOfCustomerInitiatedTrxns1
代码:

metrics = open("Output.txt","r")
line = metrics.readline()

while line:
    print line
    line = metrics.readline()
    s3path = ("SELECT * FROM parquet.`s3n://bucket{}/loaddate=20170406/part-r-00000-d60b633d-ff49-4515-8cff-ace9faf1b267.csv`") .format(line).strip('\n')

    print s3path
    df1 = sqlContext.sql(s3path)
pyspark.sql.utils.AnalysisException: u'Path does not exist: s3n://omniscience1/MTDSumOfCustomerInitiatedTrxns1\n/loaddate=20170406/part-r-00000-d60b633d-ff49-4515-8cff-ace9faf1b267.csv;; line 1 pos 14'
错误:

metrics = open("Output.txt","r")
line = metrics.readline()

while line:
    print line
    line = metrics.readline()
    s3path = ("SELECT * FROM parquet.`s3n://bucket{}/loaddate=20170406/part-r-00000-d60b633d-ff49-4515-8cff-ace9faf1b267.csv`") .format(line).strip('\n')

    print s3path
    df1 = sqlContext.sql(s3path)
pyspark.sql.utils.AnalysisException: u'Path does not exist: s3n://omniscience1/MTDSumOfCustomerInitiatedTrxns1\n/loaddate=20170406/part-r-00000-d60b633d-ff49-4515-8cff-ace9faf1b267.csv;; line 1 pos 14'

问题是在替换值时,它包括了
\n
,并且我需要为每一行提供单独的数据帧。

如果您能让我们知道您的输出应该是什么样子,这会有所帮助

我不是python专家。但以下是我根据自己的理解得出的结论。让我知道这是否是你要找的

with open("Output.txt", 'r') as file:

    for line in file:
        line = line.strip('\n')
        s3path = ("SELECT * FROM parquet.`s3n://bucket{}/loaddate=20170406/part-r-00000-d60b633d-ff49-4515-8cff-ace9faf1b267.csv`") .format(line)

        print (s3path)
下面是上面脚本的输出

SELECT * FROM parquet.`s3n://bucket/MTD_avg_cust_bal1/loaddate=20170406/part-r-00000-d60b633d-ff49-4515-8cff-ace9faf1b267.csv`
SELECT * FROM parquet.`s3n://bucket/MTDSumOfCustomerInitiatedTrxns1/loaddate=20170406/part-r-00000-d60b633d-ff49-4515-8cff-ace9faf1b267.csv`
SELECT * FROM parquet.`s3n://bucket/MTDCountOfCustomerInitiatedTrxns1/loaddate=20170406/part-r-00000-d60b633d-ff49-4515-8cff-ace9faf1b267.csv`

是的,它应该像df1=SELECT*FROM parquet.
s3n://bucket/MTD\u avg\u cust\u bal1/loaddate=20170406/part-r-00000-d60b633d-ff49-4515-8cff-ace9faf1b267.csv
您是否计划动态生成数据帧名称(如df1、df2)?如果是这样的话,我可以知道你为什么要这么做吗?谢谢Raju的输入,我可以读取文件,但它只读取文件的最后一行。是的,我需要动态创建数据帧,然后使用open(“Output.txt”,“r”)作为file:for-line-in-file:line=line.strip('\n')s3path=“SELECT*FROM-parquet.
s3n://omniscience1{}/loaddate=20170406/part*.parquet
”。format(line)df=sqlContext.sql(s3path)