PySpark：使用条件筛选数据帧_Pyspark

PySpark：使用条件筛选数据帧

pyspark

PySpark：使用条件筛选数据帧,pyspark,Pyspark,我有以下示例数据帧： l = [('Alice went to wonderland',), ('qwertyuiopqwert some text',), ('hello world',), ('ThisGetsFilteredToo',)] df = spark.createDataFrame(l) | Alice went to wonderland | | qwertyuiopqwert some text | | hello world | | Thi

我有以下示例数据帧：

l = [('Alice went to wonderland',), ('qwertyuiopqwert some text',), ('hello world',), ('ThisGetsFilteredToo',)]
df = spark.createDataFrame(l)


| Alice went to wonderland  |
| qwertyuiopqwert some text |
| hello world               |
| ThisGetsFilteredToo       |

给定这个数据帧，我想过滤掉包含一个长度大于15个字符的单词的行。在本例中，第2行的单词“qwertyuiopqwert”的长度大于15。所以应该放弃。同样，第4行也应该删除

从pyspark.sql.functions导入udf，col
from pyspark.sql.functions import udf,col
from pyspark.sql.types import StringType, IntegerType, ArrayType
data = ['athshgthsc asl','sdf sdfdsadf sdf', 'arasdfa sdf','aa bb','aaa bbb ccc','dd aa bbb']
df = sqlContext.createDataFrame(data,StringType())

def getLenghts(lst):
    tempLst = []
    for ele in lst:
        tempLst.append(len(ele))
    return tempLst

getList = udf(lambda data:data.split(),StringType())
getListLen = udf(getLenghts,ArrayType(IntegerType()))
getMaxLen = udf(lambda data:max(data),IntegerType())

df = (df.withColumn('splitWords',getList(df.value))
        .withColumn('lengthList',getListLen(col('splitWords')))
        .withColumn('maxLen',getMaxLen('lengthList')))
df.filter(df.maxLen<5).select('value').show()




+----------------+
|           value|
+----------------+
|  athshgthsc asl|
|sdf sdfdsadf sdf|
|     arasdfa sdf|
|           aa bb|
|     aaa bbb ccc|
|       dd aa bbb|
+----------------+

+----------------+--------------------+----------+------+
|           value|          splitWords|lengthList|maxLen|
+----------------+--------------------+----------+------+
|  athshgthsc asl|   [athshgthsc, asl]|   [10, 3]|    10|
|sdf sdfdsadf sdf|[sdf, sdfdsadf, sdf]| [3, 8, 3]|     8|
|     arasdfa sdf|      [arasdfa, sdf]|    [7, 3]|     7|
|           aa bb|            [aa, bb]|    [2, 2]|     2|
|     aaa bbb ccc|     [aaa, bbb, ccc]| [3, 3, 3]|     3|
|       dd aa bbb|       [dd, aa, bbb]| [2, 2, 3]|     3|
+----------------+--------------------+----------+------+

+-----------+
|      value|
+-----------+
|      aa bb|
|aaa bbb ccc|
|  dd aa bbb|
+-----------+

从pyspark.sql.types导入StringType、IntegerType、ArrayType
数据=['ATHSHGTSC asl'、'sdf sdfdsadf sdf'、'arasdfa sdf'、'aa bb'、'aaa bbb ccc'、'dd aa bbb']
df=sqlContext.createDataFrame（数据，StringType（））
def getLenghts（lst）：
tempLst=[]
对于lst中的ele：
模板附加（len（ele））
返回圣堂武士
getList=udf（lambda数据：data.split（），StringType（））
getListLen=udf（getLenghts，ArrayType（IntegerType（）））
getMaxLen=udf（lambda数据：max（数据），IntegerType（））
df=（df.withColumn（'splitWords'，getList（df.value））
.withColumn（'lengthList'，getListLen（col（'splitWords'））
.withColumn（'maxLen'，getMaxLen（'lengthList'））
df.filter（df.maxLen 15。在分割数据集之前，还可以执行更多的预处理。对我来说，我一直将长度>5的数据过滤掉。
从pyspark.sql.functions导入udf，col
从pyspark.sql.types导入StringType、IntegerType、ArrayType
数据=['ATHSHGTSC asl'、'sdf sdfdsadf sdf'、'arasdfa sdf'、'aa bb'、'aaa bbb ccc'、'dd aa bbb']
df=sqlContext.createDataFrame（数据，StringType（））
def getLenghts（lst）：
tempLst=[]
对于lst中的ele：
模板附加（len（ele））
返回圣堂武士
getList=udf（lambda数据：data.split（），StringType（））
getListLen=udf（getLenghts，ArrayType（IntegerType（）））
getMaxLen=udf（lambda数据：max（数据），IntegerType（））
df=（df.withColumn（'splitWords'，getList（df.value））
.withColumn（'lengthList'，getListLen（col（'splitWords'））
.withColumn（'maxLen'，getMaxLen（'lengthList'））
df.filter（df.maxLen 15）。在分割数据集之前，还可以执行更多的预处理。对我来说，我一直将长度>5的数据过滤掉。
虽然前面的答案似乎正确，但我认为您可以使用一个简单的用户定义函数来完成此操作。创建函数来分割字符串并查找长度>15的任何单词：
def no_long_words(s):
    for word in s.split():
        if len(word) > 15:
            return False
    return True

创建自定义项：
from pyspark.sql.types import BooleanType
no_long_words_udf = udf(no_long_words, BooleanType())

使用udf在数据帧上运行筛选器：
df2 = df.filter(no_long_words_udf('col1'))
df2.show()

+--------------------+
|                col1|
+--------------------+
|Alice went to won...|
|qwertyuiopqwert s...|
|         hello world|
+--------------------+

注意：qwertyuiopqwert实际上有15个字符长，因此它包含在结果中。
虽然前面的答案似乎正确，但我认为您可以使用一个简单的用户定义函数来实现这一点。创建函数以拆分字符串并查找长度>15的任何单词：
def no_long_words(s):
    for word in s.split():
        if len(word) > 15:
            return False
    return True

创建自定义项：
from pyspark.sql.types import BooleanType
no_long_words_udf = udf(no_long_words, BooleanType())

使用udf在数据帧上运行筛选器：
df2 = df.filter(no_long_words_udf('col1'))
df2.show()

+--------------------+
|                col1|
+--------------------+
|Alice went to won...|
|qwertyuiopqwert s...|
|         hello world|
+--------------------+

注意：qwertyuiopqwert实际上有15个字符长，因此它包含在结果中