关于如何使用Scala中的随机值向现有数据帧添加新列
我有一个带有拼花文件的数据框,我必须添加一个包含一些随机数据的新列,但我需要这些随机数据彼此不同。这是我的实际代码,spark的当前版本是1.5.1-cdh-5.5.2:关于如何使用Scala中的随机值向现有数据帧添加新列,scala,apache-spark,random,apache-spark-sql,user-defined-functions,Scala,Apache Spark,Random,Apache Spark Sql,User Defined Functions,我有一个带有拼花文件的数据框,我必须添加一个包含一些随机数据的新列,但我需要这些随机数据彼此不同。这是我的实际代码,spark的当前版本是1.5.1-cdh-5.5.2: val mydf = sqlContext.read.parquet("some.parquet") // mydf.count() // 63385686 mydf.cache val r = scala.util.Random import org.apache.spark.sql.functions.udf def
val mydf = sqlContext.read.parquet("some.parquet")
// mydf.count()
// 63385686
mydf.cache
val r = scala.util.Random
import org.apache.spark.sql.functions.udf
def myNextPositiveNumber :String = { (r.nextInt(Integer.MAX_VALUE) + 1 ).toString.concat("D")}
val myFunction = udf(myNextPositiveNumber _)
val myNewDF = mydf.withColumn("myNewColumn",lit(myNextPositiveNumber))
使用此代码,我有以下数据:
scala> myNewDF.select("myNewColumn").show(10,false)
+-----------+
|myNewColumn|
+-----------+
|889488717D |
|889488717D |
|889488717D |
|889488717D |
|889488717D |
|889488717D |
|889488717D |
|889488717D |
|889488717D |
|889488717D |
+-----------+
看起来udf myNextPositiveNumber只调用一次,不是吗
更新
确认后,只有一个不同的值:
scala> myNewDF.select("myNewColumn").distinct.show(50,false)
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
...
+-----------+
|myNewColumn|
+-----------+
|889488717D |
+-----------+
我做错了什么
更新2:最后,在@user6910411的帮助下,我得到了以下代码:
val mydf = sqlContext.read.parquet("some.parquet")
// mydf.count()
// 63385686
mydf.cache
val r = scala.util.Random
import org.apache.spark.sql.functions.udf
val accum = sc.accumulator(1)
def myNextPositiveNumber():String = {
accum+=1
accum.value.toString.concat("D")
}
val myFunction = udf(myNextPositiveNumber _)
val myNewDF = mydf.withColumn("myNewColumn",lit(myNextPositiveNumber))
myNewDF.select("myNewColumn").count
// 63385686
更新3
实际代码生成如下数据:
scala> mydf.select("myNewColumn").show(5,false)
17/02/22 11:01:57 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
+-----------+
|myNewColumn|
+-----------+
|2D |
|2D |
|2D |
|2D |
|2D |
+-----------+
only showing top 5 rows
看起来udf函数只调用一次,不是吗?我需要在该列中添加一个新的随机元素
更新4@user6910411
我有一段增加id的实际代码,但它没有连接最后的字符,这很奇怪。这是我的代码:
import org.apache.spark.sql.functions.udf
val mydf = sqlContext.read.parquet("some.parquet")
mydf.cache
def myNextPositiveNumber():String = monotonically_increasing_id().toString().concat("D")
val myFunction = udf(myNextPositiveNumber _)
val myNewDF = mydf.withColumn("myNewColumn",expr(myNextPositiveNumber))
scala> myNewDF.select("myNewColumn").show(5,false)
17/02/22 12:00:02 WARN Executor: 1 block locks were not released by TID = 1:
[rdd_4_0]
+-----------+
|myNewColumn|
+-----------+
|0 |
|1 |
|2 |
|3 |
|4 |
+-----------+
我需要像这样的东西:
+-----------+
|myNewColumn|
+-----------+
|1D |
|2D |
|3D |
|4D |
+-----------+
火花>=2.3 可以使用
asNondeterministic
方法禁用某些优化:
import org.apache.spark.sql.expressions.UserDefinedFunction
val f: UserDefinedFunction = ???
val fNonDeterministic: UserDefinedFunction = f.asNondeterministic
在使用此选项之前,请确保您了解保证
火花<2.3
传递给udf的函数应该是确定性的(可能的除外),空函数调用可以用常量替换。如果要生成随机数,请使用一个内置函数:
- -使用来自U[0.0,1.0]的独立且相同分布(i.i.d.)样本生成随机列
- -从标准正态分布中生成一个具有独立且相同分布(i.i.d.)样本的列
(rand * Integer.MAX_VALUE).cast("bigint").cast("string")
您可以使用
单调递增\u id
生成随机值
然后,您可以定义一个UDF,在将其强制转换为字符串后将任何字符串追加到该UDF中,默认情况下,单调递增\u id
返回Long
scala> var df = Seq(("Ron"), ("John"), ("Steve"), ("Brawn"), ("Rock"), ("Rick")).toDF("names")
+-----+
|names|
+-----+
| Ron|
| John|
|Steve|
|Brawn|
| Rock|
| Rick|
+-----+
scala> val appendD = spark.sqlContext.udf.register("appendD", (s: String) => s.concat("D"))
scala> df = df.withColumn("ID",monotonically_increasing_id).selectExpr("names","cast(ID as String) ID").withColumn("ID",appendD($"ID"))
+-----+---+
|names| ID|
+-----+---+
| Ron| 0D|
| John| 1D|
|Steve| 2D|
|Brawn| 3D|
| Rock| 4D|
| Rick| 5D|
+-----+---+
请注意:您应该真正删除第一行-您可以使用
单调增量ID
生成随机值单调递增的id
什么都不是,只是随机的。考虑到分布,它是严格确定的。在2.0中,单调增量ID也被弃用。您应该使用单调地\u递增\u id
来代替。您所说的“在使用此选项之前,请确保您了解保证”是什么意思?我在Spark 2.4中使用了.asNondeterministic
,它不适用于java.util.UUID.randomuid()
。重新执行生成随机UUID的udf。