PySpark:使用条件筛选数据帧

PySpark:使用条件筛选数据帧,pyspark,Pyspark,我有以下示例数据帧: l = [('Alice went to wonderland',), ('qwertyuiopqwert some text',), ('hello world',), ('ThisGetsFilteredToo',)] df = spark.createDataFrame(l) | Alice went to wonderland | | qwertyuiopqwert some text | | hello world | | Thi

我有以下示例数据帧:

l = [('Alice went to wonderland',), ('qwertyuiopqwert some text',), ('hello world',), ('ThisGetsFilteredToo',)]
df = spark.createDataFrame(l)


| Alice went to wonderland  |
| qwertyuiopqwert some text |
| hello world               |
| ThisGetsFilteredToo       |
给定这个数据帧,我想过滤掉包含一个长度大于15个字符的单词的行。在本例中,第2行的单词“qwertyuiopqwert”的长度大于15。所以应该放弃。 同样,第4行也应该删除

从pyspark.sql.functions导入udf,col
from pyspark.sql.functions import udf,col
from pyspark.sql.types import StringType, IntegerType, ArrayType
data = ['athshgthsc asl','sdf sdfdsadf sdf', 'arasdfa sdf','aa bb','aaa bbb ccc','dd aa bbb']
df = sqlContext.createDataFrame(data,StringType())

def getLenghts(lst):
    tempLst = []
    for ele in lst:
        tempLst.append(len(ele))
    return tempLst

getList = udf(lambda data:data.split(),StringType())
getListLen = udf(getLenghts,ArrayType(IntegerType()))
getMaxLen = udf(lambda data:max(data),IntegerType())

df = (df.withColumn('splitWords',getList(df.value))
        .withColumn('lengthList',getListLen(col('splitWords')))
        .withColumn('maxLen',getMaxLen('lengthList')))
df.filter(df.maxLen<5).select('value').show()




+----------------+
|           value|
+----------------+
|  athshgthsc asl|
|sdf sdfdsadf sdf|
|     arasdfa sdf|
|           aa bb|
|     aaa bbb ccc|
|       dd aa bbb|
+----------------+

+----------------+--------------------+----------+------+
|           value|          splitWords|lengthList|maxLen|
+----------------+--------------------+----------+------+
|  athshgthsc asl|   [athshgthsc, asl]|   [10, 3]|    10|
|sdf sdfdsadf sdf|[sdf, sdfdsadf, sdf]| [3, 8, 3]|     8|
|     arasdfa sdf|      [arasdfa, sdf]|    [7, 3]|     7|
|           aa bb|            [aa, bb]|    [2, 2]|     2|
|     aaa bbb ccc|     [aaa, bbb, ccc]| [3, 3, 3]|     3|
|       dd aa bbb|       [dd, aa, bbb]| [2, 2, 3]|     3|
+----------------+--------------------+----------+------+

+-----------+
|      value|
+-----------+
|      aa bb|
|aaa bbb ccc|
|  dd aa bbb|
+-----------+
从pyspark.sql.types导入StringType、IntegerType、ArrayType 数据=['ATHSHGTSC asl'、'sdf sdfdsadf sdf'、'arasdfa sdf'、'aa bb'、'aaa bbb ccc'、'dd aa bbb'] df=sqlContext.createDataFrame(数据,StringType()) def getLenghts(lst): tempLst=[] 对于lst中的ele: 模板附加(len(ele)) 返回圣堂武士 getList=udf(lambda数据:data.split(),StringType()) getListLen=udf(getLenghts,ArrayType(IntegerType())) getMaxLen=udf(lambda数据:max(数据),IntegerType()) df=(df.withColumn('splitWords',getList(df.value)) .withColumn('lengthList',getListLen(col('splitWords')) .withColumn('maxLen',getMaxLen('lengthList')) df.filter(df.maxLen 15。在分割数据集之前,还可以执行更多的预处理。对我来说,我一直将长度>5的数据过滤掉。

从pyspark.sql.functions导入udf,col
从pyspark.sql.types导入StringType、IntegerType、ArrayType
数据=['ATHSHGTSC asl'、'sdf sdfdsadf sdf'、'arasdfa sdf'、'aa bb'、'aaa bbb ccc'、'dd aa bbb']
df=sqlContext.createDataFrame(数据,StringType())
def getLenghts(lst):
tempLst=[]
对于lst中的ele:
模板附加(len(ele))
返回圣堂武士
getList=udf(lambda数据:data.split(),StringType())
getListLen=udf(getLenghts,ArrayType(IntegerType()))
getMaxLen=udf(lambda数据:max(数据),IntegerType())
df=(df.withColumn('splitWords',getList(df.value))
.withColumn('lengthList',getListLen(col('splitWords'))
.withColumn('maxLen',getMaxLen('lengthList'))

df.filter(df.maxLen 15)。在分割数据集之前,还可以执行更多的预处理。对我来说,我一直将长度>5的数据过滤掉。

虽然前面的答案似乎正确,但我认为您可以使用一个简单的用户定义函数来完成此操作。创建函数来分割字符串并查找长度>15的任何单词:

def no_long_words(s):
    for word in s.split():
        if len(word) > 15:
            return False
    return True
创建自定义项:

from pyspark.sql.types import BooleanType
no_long_words_udf = udf(no_long_words, BooleanType())
使用udf在数据帧上运行筛选器:

df2 = df.filter(no_long_words_udf('col1'))
df2.show()

+--------------------+
|                col1|
+--------------------+
|Alice went to won...|
|qwertyuiopqwert s...|
|         hello world|
+--------------------+

注意:qwertyuiopqwert实际上有15个字符长,因此它包含在结果中。

虽然前面的答案似乎正确,但我认为您可以使用一个简单的用户定义函数来实现这一点。创建函数以拆分字符串并查找长度>15的任何单词:

def no_long_words(s):
    for word in s.split():
        if len(word) > 15:
            return False
    return True
创建自定义项:

from pyspark.sql.types import BooleanType
no_long_words_udf = udf(no_long_words, BooleanType())
使用udf在数据帧上运行筛选器:

df2 = df.filter(no_long_words_udf('col1'))
df2.show()

+--------------------+
|                col1|
+--------------------+
|Alice went to won...|
|qwertyuiopqwert s...|
|         hello world|
+--------------------+
注意:qwertyuiopqwert实际上有15个字符长,因此它包含在结果中