Apache spark 如何将PySpark中的句号替换为零?
我试图用PySpark中的值0替换原始数据中的句号Apache spark 如何将PySpark中的句号替换为零?,apache-spark,pyspark,pyspark-sql,Apache Spark,Pyspark,Pyspark Sql,我试图用PySpark中的值0替换原始数据中的句号 我试图使用.when和.otherse语句 我尝试使用regexp\u replace将“.”更改为0 已尝试的代码: 从pyspark.sql导入函数为F #对于上述第1项: dataframe2=dataframe1.带列(“测试列”,F.when((F.col(“测试列”)==F.lit(“.”),0)。否则(F.col(“测试列”)) #对于上述第2项: dataframe2=dataframe1.withColumn('test_co
从pyspark.sql导入函数为F
#对于上述第1项:
dataframe2=dataframe1.带列(“测试列”,F.when((F.col(“测试列”)==F.lit(“.”),0)。否则(F.col(“测试列”))
#对于上述第2项:
dataframe2=dataframe1.withColumn('test_col',F.regexp_replace(dataframe1.test_col','0'))
而不是“.”它应该只用数字重写列(即,在非句号行中有一个数字,否则,它是一个句号,应该替换为0)。示例代码确实正确查询 包装otz.scalaspark
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
object ValueReplacement {
def main(args: Array[String]) {
val sparkConfig = new SparkConf().setAppName("Value-Replacement").setMaster("local[*]").set("spark.executor.memory", "1g");
val sparkContext = new SparkContext(sparkConfig)
val someData = Seq(
Row(3, "r1"),
Row(9, "r2"),
Row(27, "r3"),
Row(81, "r4")
)
val someSchema = List(
StructField("number", IntegerType, true),
StructField("word", StringType, true)
)
val sqlContext = new SQLContext(sparkContext)
val dataFrame = sqlContext.createDataFrame(
sparkContext.parallelize(someData),
StructType(someSchema)
)
val filteredDataFrame = dataFrame.withColumn("number", when(col("number") === 3, -3).otherwise(col("number")));
filteredDataFrame.show()
}
}
输出
+------+----+
|number|word|
+------+----+
| -3| r1|
| 9| r2|
| 27| r3|
| 81| r4|
+------+----+
+---+----+
|num|text|
+---+----+
| 3| r1|
| 9| r2.|
| 27| .|
+---+----+
+---+----+
|num|text|
+---+----+
| -3| r1|
| 9| r2.|
| 27| .|
+---+----+
+---+----+
|num|text|
+---+----+
| 3| r1|
| 9| r2.|
| 27| 0|
+---+----+
示例代码不能正确地进行查询 包装otz.scalaspark
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
object ValueReplacement {
def main(args: Array[String]) {
val sparkConfig = new SparkConf().setAppName("Value-Replacement").setMaster("local[*]").set("spark.executor.memory", "1g");
val sparkContext = new SparkContext(sparkConfig)
val someData = Seq(
Row(3, "r1"),
Row(9, "r2"),
Row(27, "r3"),
Row(81, "r4")
)
val someSchema = List(
StructField("number", IntegerType, true),
StructField("word", StringType, true)
)
val sqlContext = new SQLContext(sparkContext)
val dataFrame = sqlContext.createDataFrame(
sparkContext.parallelize(someData),
StructType(someSchema)
)
val filteredDataFrame = dataFrame.withColumn("number", when(col("number") === 3, -3).otherwise(col("number")));
filteredDataFrame.show()
}
}
输出
+------+----+
|number|word|
+------+----+
| -3| r1|
| 9| r2|
| 27| r3|
| 81| r4|
+------+----+
+---+----+
|num|text|
+---+----+
| 3| r1|
| 9| r2.|
| 27| .|
+---+----+
+---+----+
|num|text|
+---+----+
| -3| r1|
| 9| r2.|
| 27| .|
+---+----+
+---+----+
|num|text|
+---+----+
| 3| r1|
| 9| r2.|
| 27| 0|
+---+----+
Pypark版本
from pyspark.sql import SparkSession
from pyspark.sql.types import (StringType, IntegerType, StructField, StructType)
from pyspark.sql import functions
column_schema = StructType([StructField("num", IntegerType()), StructField("text", StringType())])
data = [[3, 'r1'], [9, 'r2.'], [27, '.']]
spark = SparkSession.builder.master("local").getOrCreate()
spark.conf.set("spark.executor.memory", '1g')
spark.conf.set('spark.executor.cores', '1')
spark.conf.set('spark.cores.max', '2')
spark.conf.set("spark.driver.memory", '1g')
spark_context = spark.sparkContext
data_frame = spark.createDataFrame(data, schema=column_schema)
data_frame.show()
filtered_data_frame = data_frame.withColumn('num',
functions.when(data_frame['num'] == 3, -3).otherwise(data_frame['num']))
filtered_data_frame.show()
filtered_data_frame = data_frame.withColumn('text',
functions.when(data_frame['text'] == '.', '0').otherwise(
data_frame['text']))
filtered_data_frame.show()
输出
+------+----+
|number|word|
+------+----+
| -3| r1|
| 9| r2|
| 27| r3|
| 81| r4|
+------+----+
+---+----+
|num|text|
+---+----+
| 3| r1|
| 9| r2.|
| 27| .|
+---+----+
+---+----+
|num|text|
+---+----+
| -3| r1|
| 9| r2.|
| 27| .|
+---+----+
+---+----+
|num|text|
+---+----+
| 3| r1|
| 9| r2.|
| 27| 0|
+---+----+
Pypark版本
from pyspark.sql import SparkSession
from pyspark.sql.types import (StringType, IntegerType, StructField, StructType)
from pyspark.sql import functions
column_schema = StructType([StructField("num", IntegerType()), StructField("text", StringType())])
data = [[3, 'r1'], [9, 'r2.'], [27, '.']]
spark = SparkSession.builder.master("local").getOrCreate()
spark.conf.set("spark.executor.memory", '1g')
spark.conf.set('spark.executor.cores', '1')
spark.conf.set('spark.cores.max', '2')
spark.conf.set("spark.driver.memory", '1g')
spark_context = spark.sparkContext
data_frame = spark.createDataFrame(data, schema=column_schema)
data_frame.show()
filtered_data_frame = data_frame.withColumn('num',
functions.when(data_frame['num'] == 3, -3).otherwise(data_frame['num']))
filtered_data_frame.show()
filtered_data_frame = data_frame.withColumn('text',
functions.when(data_frame['text'] == '.', '0').otherwise(
data_frame['text']))
filtered_data_frame.show()
输出
+------+----+
|number|word|
+------+----+
| -3| r1|
| 9| r2|
| 27| r3|
| 81| r4|
+------+----+
+---+----+
|num|text|
+---+----+
| 3| r1|
| 9| r2.|
| 27| .|
+---+----+
+---+----+
|num|text|
+---+----+
| -3| r1|
| 9| r2.|
| 27| .|
+---+----+
+---+----+
|num|text|
+---+----+
| 3| r1|
| 9| r2.|
| 27| 0|
+---+----+
如果您的dataframe1类似于:
+--------+
|test_col|
+--------+
| 1.0|
| 2.0|
| 2|
+--------+
您的尝试必须让步:
dataframe2=dataframe1.withColumn('test_col',F.regexp_replace(dataframe1.test_col','0'))
dataframe2.show()
+--------+
|测试柱|
+--------+
| 000|
| 000|
| 0|
+--------+
这里的
表示所有字母都要替换,而不仅仅是“.”。
但是,如果在点之前添加转义序列(\
),则一切正常
dataframe2=dataframe1.withColumn('test\u col',F.regexp\u replace(dataframe1.test\u col,\.','0'))
dataframe2.show()
+--------+
|测试柱|
+--------+
| 100|
| 200|
| 2|
+--------+
如果您的dataframe1类似于:
+--------+
|test_col|
+--------+
| 1.0|
| 2.0|
| 2|
+--------+
您的尝试必须让步:
dataframe2=dataframe1.withColumn('test_col',F.regexp_replace(dataframe1.test_col','0'))
dataframe2.show()
+--------+
|测试柱|
+--------+
| 000|
| 000|
| 0|
+--------+
这里的
表示所有字母都要替换,而不仅仅是“.”。
但是,如果在点之前添加转义序列(\
),则一切正常
dataframe2=dataframe1.withColumn('test\u col',F.regexp\u replace(dataframe1.test\u col,\.','0'))
dataframe2.show()
+--------+
|测试柱|
+--------+
| 100|
| 200|
| 2|
+--------+
如果test\u col
正好等于“dot”(“)”,那么上面的方法应该可以很好地工作。这是您想要的,还是您希望它替换“.”,无论何时它在test\u col
中?上面两种方法?是的,我希望它替换“.”。我检查了数据,它似乎是数字或一个点,可能有一些带有空格或其他东西,所以我可以尝试在测试列中使用逻辑?否则只有一个fukll数字,例如425,没有小数。顺便说一句,我发现两个代码都有语法错误。with column应该是数据帧索引,如id或timestamp.che勾选此项。您可以提供一个示例数据集createDataframe吗?您是想替换数据框中的所有点,还是特定于某个列?如果test\u col
正好等于“dot”(“)。这是您想要的,还是您想替换它?”,无论何时它在测试列中?上述两种解决方案?是的,我希望它替换“.”。我检查了数据,它似乎是数字或一个点,可能有一些带有空格或其他东西,所以我可以尝试在测试列中使用逻辑?否则只有一个fukll数字,例如425,没有小数。顺便说一句,我发现两个代码都有语法错误。with column应该是数据帧索引,如id或timestamp.che你能提供一个示例数据集createDataframe吗?你是想替换数据框中的所有点,还是特定于一列的点?不幸的是,我使用的是PySpark。不幸的是,我使用的是PySpark。