Python 向pyspark数据帧添加新行
我是新来的,但对熊猫很熟悉。 我有一个Pypark数据框Python 向pyspark数据帧添加新行,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我是新来的,但对熊猫很熟悉。 我有一个Pypark数据框 # instantiate Spark spark = SparkSession.builder.getOrCreate() # make some test data columns = ['id', 'dogs', 'cats'] vals = [ (1, 2, 0), (2, 0, 1) ] # create DataFrame df = spark.createDataFrame(vals, columns
# instantiate Spark
spark = SparkSession.builder.getOrCreate()
# make some test data
columns = ['id', 'dogs', 'cats']
vals = [
(1, 2, 0),
(2, 0, 1)
]
# create DataFrame
df = spark.createDataFrame(vals, columns)
希望添加新行(4,5,7),以便输出:
df.show()
+---+----+----+
| id|dogs|cats|
+---+----+----+
| 1| 2| 0|
| 2| 0| 1|
| 4| 5| 7|
+---+----+----+
根据我所做的,使用union,显示块部分编码-当然,您需要根据自己的情况进行调整:
val dummySchema = StructType(
StructField("phrase", StringType, true) :: Nil)
var dfPostsNGrams2 = spark.createDataFrame(sc.emptyRDD[Row], dummySchema)
for (i <- i_grams_Cols) {
val nameCol = col({i})
dfPostsNGrams2 = dfPostsNGrams2.union(dfPostsNGrams.select(explode({nameCol}).as("phrase")).toDF )
}
val dummySchema=StructType(
StructField(“短语”,StringType,true)::Nil)
var dfPostsNGrams2=spark.createDataFrame(sc.emptyRDD[Row],dummySchema)
因为(我已经说过,联盟是前进的方向。我只是回答你的问题,给你举一个Pypark的例子:
#如果尚未自动创建,请实例化Sparkcontext
spark=SparkSession.builder.getOrCreate()
列=['id'、'dogs'、'cats']
VAL=[(1,2,0)、(2,0,1)]
df=spark.createDataFrame(VAL,列)
newRow=spark.createDataFrame([(4,5,7)],列)
追加=df.union(新行)
追加.show()
还请查看DataRicks常见问题解答:另一种选择是使用分区拼花格式,并为每个要附加的数据帧添加额外的拼花文件。这样您可以创建(数百、数千、数百万)当您稍后读取目录时,spark将把它们作为一个联合体读取
本例使用pyarrow
注意:我还演示了如何编写未分区的单个拼花地板(example.parquet),前提是您已经知道要将单个拼花地板文件放在哪里
import pyarrow.parquet as pq
import pandas as pd
headers=['A', 'B', 'C']
row1 = ['a1', 'b1', 'c1']
row2 = ['a2', 'b2', 'c2']
df1 = pd.DataFrame([row1], columns=headers)
df2 = pd.DataFrame([row2], columns=headers)
df3 = df1.append(df2, ignore_index=True)
table = pa.Table.from_pandas(df3)
pq.write_table(table, 'example.parquet', flavor='spark')
pq.write_to_dataset(table, root_path="test_part_file", partition_cols=['B', 'C'], flavor='spark')
# Adding a new partition (B=b2/C=c3
row3 = ['a3', 'b3', 'c3']
df4 = pd.DataFrame([row3], columns=headers)
table2 = pa.Table.from_pandas(df4)
pq.write_to_dataset(table2, root_path="test_part_file", partition_cols=['B', 'C'], flavor='spark')
# Add another parquet file to the B=b2/C=c2 partition
# Note this does not overwrite existing partitions, it just appends a new .parquet file.
# If files already exist, then you will get a union result of the two (or multiple) files when you read the partition
row5 = ['a5', 'b2', 'c2']
df5 = pd.DataFrame([row5], columns=headers)
table3 = pa.Table.from_pandas(df5)
pq.write_to_dataset(table3, root_path="test_part_file", partition_cols=['B', 'C'], flavor='spark')
之后读取输出
from pyspark.sql import SparkSession
spark = (SparkSession
.builder
.appName("testing parquet read")
.getOrCreate())
df_spark = spark.read.parquet('test_part_file')
df_spark.show(25, False)
你应该看到这样的东西
+---+---+---+
|A |B |C |
+---+---+---+
|a5 |b2 |c2 |
|a2 |b2 |c2 |
|a1 |b1 |c1 |
|a3 |b3 |c3 |
+---+---+---+
如果您再次端到端运行相同的东西,您应该会看到类似这样的副本(因为以前的所有拼花文件仍然存在,所以请删除它们)
请说明答案是否正确并接受,等等。如果答案不正确,请提出其他建议。这个例子有点离题,但它是关于工会的。
+---+---+---+
|A |B |C |
+---+---+---+
|a2 |b2 |c2 |
|a5 |b2 |c2 |
|a5 |b2 |c2 |
|a2 |b2 |c2 |
|a1 |b1 |c1 |
|a1 |b1 |c1 |
|a3 |b3 |c3 |
|a3 |b3 |c3 |
+---+---+---+