Python 如何从PySpark数据帧中获取随机行？_Python_Apache Spark_Dataframe_Pyspark_Apache Spark Sql

Python 如何从PySpark数据帧中获取随机行？

python apache-spark dataframe pyspark

Python 如何从PySpark数据帧中获取随机行？,python,apache-spark,dataframe,pyspark,apache-spark-sql,Python,Apache Spark,Dataframe,Pyspark,Apache Spark Sql,如何从PySpark数据帧中获取随机行？我只看到方法sample（），它将分数作为参数。将该分数设置为1/numberOfRows会导致随机结果，有时我不会得到任何行在RDD上有一个方法takeSample（），该方法将希望样本包含的元素数作为参数。我知道这可能会很慢，因为您必须计算每个分区，但有没有办法在数据帧上获得类似的内容？您只需在RDD上调用takeSample： df = sqlContext.createDataFrame( [(1, "a"), (2, "b"), (3,

如何从PySpark数据帧中获取随机行？我只看到方法

sample（）

，它将分数作为参数。将该分数设置为

1/numberOfRows

会导致随机结果，有时我不会得到任何行

在

RDD

上有一个方法

takeSample（）

，该方法将希望样本包含的元素数作为参数。我知道这可能会很慢，因为您必须计算每个分区，但有没有办法在数据帧上获得类似的内容？

您只需在

RDD上调用takeSample
：
df = sqlContext.createDataFrame(
    [(1, "a"), (2, "b"), (3, "c"), (4, "d")], ("k", "v"))
df.rdd.takeSample(False, 1, seed=0)
## [Row(k=3, v='c')]

如果您不想收集，只需取更高的分数并限制：
df.sample(False, 0.1, seed=0).limit(1)

不同类型的样本
随机抽样%的数据（有替换和无替换）
import pyspark.sql.functions as F
#Randomly sample 50% of the data without replacement
sample1 = df.sample(False, 0.5, seed=0)

#Randomly sample 50% of the data with replacement
sample1 = df.sample(True, 0.5, seed=0)

#Take another sample exlcuding records from previous sample using Anti Join
sample2 = df.join(sample1, on='ID', how='left_anti').sample(False, 0.5, seed=0)

#Take another sample exlcuding records from previous sample using Where
sample1_ids = [row['ID'] for row in sample1.ID]
sample2 = df.where(~F.col('ID').isin(sample1_ids)).sample(False, 0.5, seed=0)

#Generate a startfied sample of the data across column(s)
#Sampling is probabilistic and thus cannot guarantee an exact number of rows
fractions = {
        'NJ': 0.5, #Take about 50% of records where state = NJ
    'NY': 0.25, #Take about 25% of records where state = NY
    'VA': 0.1, #Take about 10% of records where state = VA
}
stratified_sample = df.sampleBy(F.col('state'), fractions, seed=0)

有没有办法得到随机值。在上述情况下，每次我运行查询时都会生成相同的数据帧。不要传递种子，每次都应该得到不同的数据帧。很好的提示，@LateCoder！（在Spark 2.3.1中，保持seed=None似乎只适用于df.rdd.takeSample，而不是df.sample。）为什么人们不想收集？哦，因为收集，它可能不适合驾驶员的记忆。