PySpark:使用条件筛选数据帧
我有以下示例数据帧:PySpark:使用条件筛选数据帧,pyspark,Pyspark,我有以下示例数据帧: l = [('Alice went to wonderland',), ('qwertyuiopqwert some text',), ('hello world',), ('ThisGetsFilteredToo',)] df = spark.createDataFrame(l) | Alice went to wonderland | | qwertyuiopqwert some text | | hello world | | Thi
l = [('Alice went to wonderland',), ('qwertyuiopqwert some text',), ('hello world',), ('ThisGetsFilteredToo',)]
df = spark.createDataFrame(l)
| Alice went to wonderland |
| qwertyuiopqwert some text |
| hello world |
| ThisGetsFilteredToo |
给定这个数据帧,我想过滤掉包含一个长度大于15个字符的单词的行。在本例中,第2行的单词“qwertyuiopqwert”的长度大于15。所以应该放弃。
同样,第4行也应该删除 从pyspark.sql.functions导入udf,col
from pyspark.sql.functions import udf,col
from pyspark.sql.types import StringType, IntegerType, ArrayType
data = ['athshgthsc asl','sdf sdfdsadf sdf', 'arasdfa sdf','aa bb','aaa bbb ccc','dd aa bbb']
df = sqlContext.createDataFrame(data,StringType())
def getLenghts(lst):
tempLst = []
for ele in lst:
tempLst.append(len(ele))
return tempLst
getList = udf(lambda data:data.split(),StringType())
getListLen = udf(getLenghts,ArrayType(IntegerType()))
getMaxLen = udf(lambda data:max(data),IntegerType())
df = (df.withColumn('splitWords',getList(df.value))
.withColumn('lengthList',getListLen(col('splitWords')))
.withColumn('maxLen',getMaxLen('lengthList')))
df.filter(df.maxLen<5).select('value').show()
+----------------+
| value|
+----------------+
| athshgthsc asl|
|sdf sdfdsadf sdf|
| arasdfa sdf|
| aa bb|
| aaa bbb ccc|
| dd aa bbb|
+----------------+
+----------------+--------------------+----------+------+
| value| splitWords|lengthList|maxLen|
+----------------+--------------------+----------+------+
| athshgthsc asl| [athshgthsc, asl]| [10, 3]| 10|
|sdf sdfdsadf sdf|[sdf, sdfdsadf, sdf]| [3, 8, 3]| 8|
| arasdfa sdf| [arasdfa, sdf]| [7, 3]| 7|
| aa bb| [aa, bb]| [2, 2]| 2|
| aaa bbb ccc| [aaa, bbb, ccc]| [3, 3, 3]| 3|
| dd aa bbb| [dd, aa, bbb]| [2, 2, 3]| 3|
+----------------+--------------------+----------+------+
+-----------+
| value|
+-----------+
| aa bb|
|aaa bbb ccc|
| dd aa bbb|
+-----------+
从pyspark.sql.types导入StringType、IntegerType、ArrayType
数据=['ATHSHGTSC asl'、'sdf sdfdsadf sdf'、'arasdfa sdf'、'aa bb'、'aaa bbb ccc'、'dd aa bbb']
df=sqlContext.createDataFrame(数据,StringType())
def getLenghts(lst):
tempLst=[]
对于lst中的ele:
模板附加(len(ele))
返回圣堂武士
getList=udf(lambda数据:data.split(),StringType())
getListLen=udf(getLenghts,ArrayType(IntegerType()))
getMaxLen=udf(lambda数据:max(数据),IntegerType())
df=(df.withColumn('splitWords',getList(df.value))
.withColumn('lengthList',getListLen(col('splitWords'))
.withColumn('maxLen',getMaxLen('lengthList'))
df.filter(df.maxLen 15。在分割数据集之前,还可以执行更多的预处理。对我来说,我一直将长度>5的数据过滤掉。从pyspark.sql.functions导入udf,col
从pyspark.sql.types导入StringType、IntegerType、ArrayType
数据=['ATHSHGTSC asl'、'sdf sdfdsadf sdf'、'arasdfa sdf'、'aa bb'、'aaa bbb ccc'、'dd aa bbb']
df=sqlContext.createDataFrame(数据,StringType())
def getLenghts(lst):
tempLst=[]
对于lst中的ele:
模板附加(len(ele))
返回圣堂武士
getList=udf(lambda数据:data.split(),StringType())
getListLen=udf(getLenghts,ArrayType(IntegerType()))
getMaxLen=udf(lambda数据:max(数据),IntegerType())
df=(df.withColumn('splitWords',getList(df.value))
.withColumn('lengthList',getListLen(col('splitWords'))
.withColumn('maxLen',getMaxLen('lengthList'))
df.filter(df.maxLen 15)。在分割数据集之前,还可以执行更多的预处理。对我来说,我一直将长度>5的数据过滤掉。虽然前面的答案似乎正确,但我认为您可以使用一个简单的用户定义函数来完成此操作。创建函数来分割字符串并查找长度>15的任何单词:
def no_long_words(s):
for word in s.split():
if len(word) > 15:
return False
return True
创建自定义项:
from pyspark.sql.types import BooleanType
no_long_words_udf = udf(no_long_words, BooleanType())
使用udf在数据帧上运行筛选器:
df2 = df.filter(no_long_words_udf('col1'))
df2.show()
+--------------------+
| col1|
+--------------------+
|Alice went to won...|
|qwertyuiopqwert s...|
| hello world|
+--------------------+
注意:qwertyuiopqwert实际上有15个字符长,因此它包含在结果中。虽然前面的答案似乎正确,但我认为您可以使用一个简单的用户定义函数来实现这一点。创建函数以拆分字符串并查找长度>15的任何单词:
def no_long_words(s):
for word in s.split():
if len(word) > 15:
return False
return True
创建自定义项:
from pyspark.sql.types import BooleanType
no_long_words_udf = udf(no_long_words, BooleanType())
使用udf在数据帧上运行筛选器:
df2 = df.filter(no_long_words_udf('col1'))
df2.show()
+--------------------+
| col1|
+--------------------+
|Alice went to won...|
|qwertyuiopqwert s...|
| hello world|
+--------------------+
注意:qwertyuiopqwert实际上有15个字符长,因此它包含在结果中