Django 如何创建pyspark udf，从同一文件中的另一个类函数调用一个类函数？_Django_Python 3.x_Apache Spark_Pyspark_User Defined Functions

Django 如何创建pyspark udf，从同一文件中的另一个类函数调用一个类函数？

django python-3.x apache-spark pyspark

Django 如何创建pyspark udf，从同一文件中的另一个类函数调用一个类函数？,django,python-3.x,apache-spark,pyspark,user-defined-functions,Django,Python 3.x,Apache Spark,Pyspark,User Defined Functions,我在一个基于类的视图中创建pyspark udf，在另一个基于类的视图中，我有我想要调用的函数，它们都在同一个文件（api.py）中，但是当我检查结果数据帧的内容时，我得到以下错误： ModuleNotFoundError: No module named 'api' 我不明白为什么会发生这种情况，我尝试在pyspark控制台中执行类似的代码，效果很好。有人问了一个类似的问题，但不同的是我试图在同一个文件中这样做这是我全部代码的一部分： api.py class TextMiningMeth

我在一个基于类的视图中创建pyspark udf，在另一个基于类的视图中，我有我想要调用的函数，它们都在同一个文件（

api.py

）中，但是当我检查结果数据帧的内容时，我得到以下错误：

ModuleNotFoundError: No module named 'api'

我不明白为什么会发生这种情况，我尝试在pyspark控制台中执行类似的代码，效果很好。有人问了一个类似的问题，但不同的是我试图在同一个文件中这样做

这是我全部代码的一部分：

api.py

class TextMiningMethods():
    def clean_tweet(self,tweet):
        '''
        some logic here
        '''
        return "Hello: "+tweet


class BigDataViewSet(TextMiningMethods,viewsets.ViewSet):

    @action(methods=['post'], detail=False)
    def word_cloud(self, request, *args, **kwargs): 
        '''
        some previous logic here
        '''
        spark=SparkSession \
            .builder \
            .master("spark://"+SPARK_WORKERS) \
            .appName('word_cloud') \
            .config("spark.executor.memory", '2g') \
            .config('spark.executor.cores', '2') \
            .config('spark.cores.max', '2') \
            .config("spark.driver.memory",'2g') \
            .getOrCreate()

        sc.sparkContext.addPyFile('path/to/udfFile.py')
        cols = ['text']
        rows = []

        for tweet_account_index, tweet_account_data in enumerate(tweets_list):

            tweet_data_aux_pandas_df = pd.Series(tweet_account_data['tweet']).dropna()
            for tweet_index,tweet in enumerate(tweet_data_aux_pandas_df):
                row= [tweet['text']]
                rows.append(row)

        # Create a Pandas Dataframe of tweets
        tweet_pandas_df = pd.DataFrame(rows, columns = cols)

        schema = StructType([
            StructField("text", StringType(),True)
        ])

        # Converts to Spark DataFrame
        df = spark.createDataFrame(tweet_pandas_df,schema=schema)
        clean_tweet_udf = udf(TextMiningMethods().clean_tweet, StringType())
        clean_tweet_df = df.withColumn("clean_tweet", clean_tweet_udf(df["text"]))
        clean_tweet_df.show()   # This line produces the error

import re
import string
import unidecode
from nltk.corpus import stopwords

class TextMiningMethods():
    """docstring for TextMiningMethods"""
    def clean_tweet(self,tweet):
        # some logic here

pyspark中的类似测试效果良好

from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql.functions import udf
def clean_tweet(name):
    return "This is " + name

schema = StructType([StructField("Id", IntegerType(),True),StructField("tweet", StringType(),True)])

data = [[ 1, "tweet 1"],[2,"tweet 2"],[3,"tweet 3"]]
df = spark.createDataFrame(data,schema=schema)

clean_tweet_udf = udf(clean_tweet,StringType())
clean_tweet_df = df.withColumn("clean_tweet", clean_tweet_udf(df["tweet"]))
clean_tweet_df.show()

以下是我的问题：

这个错误与什么有关？我怎样才能修好它

使用基于类的视图时，创建pyspark udf的正确方法是什么？在调用pyspark udf的同一文件中编写函数是错误的做法吗？（在我的例子中，使用django rest框架的所有api端点）

任何帮助都将不胜感激，提前感谢

更新：

这说明了如何使用SparkContext将自定义类与pyspark一起使用，而不是与SparkSession一起使用（这是我的案例），但我使用了以下方法：

sc.sparkContext.addPyFile('path/to/udfFile.py')

问题是，我在为dataframe创建udf函数的同一个文件中定义了一个类，在这个类中我有一些函数用作pyspark udf（如代码中所示）当addPyFile（）的路径在同一代码中时，我找不到如何实现该行为。尽管如此，我还是移动了我的代码并遵循了（这是我修复的另一个错误）：

创建一个名为
```
udf
```
创建一个新的空
```
\u_ini\u uuu.py
```
文件，使目录成为包
并为我的udf函数创建file.py

在这个文件中，我尝试在函数的开头或内部导入依赖项。在所有情况下，我都会收到

ModuleNotFoundError:没有名为'udf'的模块

pyspark\u udf.py

class TextMiningMethods():
    def clean_tweet(self,tweet):
        '''
        some logic here
        '''
        return "Hello: "+tweet


class BigDataViewSet(TextMiningMethods,viewsets.ViewSet):

    @action(methods=['post'], detail=False)
    def word_cloud(self, request, *args, **kwargs): 
        '''
        some previous logic here
        '''
        spark=SparkSession \
            .builder \
            .master("spark://"+SPARK_WORKERS) \
            .appName('word_cloud') \
            .config("spark.executor.memory", '2g') \
            .config('spark.executor.cores', '2') \
            .config('spark.cores.max', '2') \
            .config("spark.driver.memory",'2g') \
            .getOrCreate()

        sc.sparkContext.addPyFile('path/to/udfFile.py')
        cols = ['text']
        rows = []

        for tweet_account_index, tweet_account_data in enumerate(tweets_list):

            tweet_data_aux_pandas_df = pd.Series(tweet_account_data['tweet']).dropna()
            for tweet_index,tweet in enumerate(tweet_data_aux_pandas_df):
                row= [tweet['text']]
                rows.append(row)

        # Create a Pandas Dataframe of tweets
        tweet_pandas_df = pd.DataFrame(rows, columns = cols)

        schema = StructType([
            StructField("text", StringType(),True)
        ])

        # Converts to Spark DataFrame
        df = spark.createDataFrame(tweet_pandas_df,schema=schema)
        clean_tweet_udf = udf(TextMiningMethods().clean_tweet, StringType())
        clean_tweet_df = df.withColumn("clean_tweet", clean_tweet_udf(df["text"]))
        clean_tweet_df.show()   # This line produces the error

import re
import string
import unidecode
from nltk.corpus import stopwords

class TextMiningMethods():
    """docstring for TextMiningMethods"""
    def clean_tweet(self,tweet):
        # some logic here

在我的

api.py

文件的开头，我尝试了所有这些

from udf.pyspark_udf import TextMiningMethods

# or

from udf.pyspark_udf import *

在单词_cloud函数中

class BigDataViewSet(viewsets.ViewSet):
    def word_cloud(self, request, *args, **kwargs):
        from udf.pyspark_udf import TextMiningMethods

在python调试器中，这一行起作用：

from udf.pyspark_udf import TextMiningMethods

但是当我显示数据帧时，我收到了错误：

clean_tweet_df.show()

ModuleNotFoundError: No module named 'udf'

很明显，原来的问题变成了另一个问题，现在我的问题更多地与此相关，但是我还没有找到一个令人满意的方法来导入文件并从另一个类函数中创建pyspark udf callinf类函数

我缺少什么？

经过多次尝试后，我无法通过引用

addPyFile（）

路径中的方法找到解决方案，该路径位于我创建udf的同一文件中（我想知道这是否是一种不好的做法），或者在另一个文件中，技术文档中说：

为将来在此SparkContext上执行的所有任务添加一个.py或.zip依赖项。传递的路径可以是本地文件、HDFS中的文件（或其他支持Hadoop的文件系统）或HTTP、HTTPS或FTP URI

所以我说的应该是可能的。基于此，我必须使用此选项，并使用以下选项从其最高级别压缩所有udf文件夹：

zip-r udf.zip udf

另外，在

pyspark_udf.py

中，我必须按如下方式导入依赖项以避免出现这种情况

而不是：

import re
import string
import unidecode
from nltk.corpus import stopwords

class TextMiningMethods():
    """docstring for TextMiningMethods"""
    def clean_tweet(self,tweet):

然后，这句话终于奏效了：

clean_tweet_df.show()

我希望这对其他任何人都有用

经过不同的尝试后，我无法通过引用

addPyFile（）

路径中的方法找到解决方案，该路径位于我创建udf的同一文件中（我想知道这是否是一种不好的做法）或另一个文件中，技术文档中说：

所以我说的应该是可能的。基于此，我必须使用此选项，并使用以下选项从其最高级别压缩所有udf文件夹：

zip-r udf.zip udf

另外，在

pyspark_udf.py

中，我必须按如下方式导入依赖项以避免出现这种情况

而不是：

import re
import string
import unidecode
from nltk.corpus import stopwords

class TextMiningMethods():
    """docstring for TextMiningMethods"""
    def clean_tweet(self,tweet):

然后，这句话终于奏效了：

clean_tweet_df.show()

我希望这对其他人有用

谢谢！你的方法对我有用

请澄清我的步骤：

使用

和pyspark\u udfs.py创建udf
模块


首先制作一个bash文件来压缩UDF，然后在顶层运行我的文件：


runner.sh

echo "zipping udfs..."
zip -r udf.zip udf
echo "udfs zipped"

echo "running script..."
/opt/conda/bin/python runner.py
echo "script ended."


实际上，代码从udf.pyspark\u udfs
模块导入了我的udf，并在我需要的python函数中初始化了我的udf，如下所示：

谢谢大家!！你的方法对我有用
请澄清我的步骤：

使用和pyspark\u udfs.py创建udf
模块

首先制作一个bash文件来压缩UDF，然后在顶层运行我的文件：

runner.sh

echo "zipping udfs..."
zip -r udf.zip udf
echo "udfs zipped"

echo "running script..."
/opt/conda/bin/python runner.py
echo "script ended."


实际上，代码从udf.pyspark\u udfs
模块导入了我的udf，并在我需要的python函数中初始化了我的udf，如下所示：

什么是“基于类的视图”？这是否回答了您的问题@下面的链接解释了什么是基于类的视图docs.djangoproject.com/en/3.0/topics/class-based-viewsdjango@user10938362我用我尝试过的所有东西更新了我的答案，从你提供给我的链接开始，这是相似的，但不是相同的情况。我会尝试一种不同的方法：你能从CBV中提取逻辑并使其不受django的影响吗？您的代码示例提到了文本挖掘，所以我想tjat的核心功能与托管无关（事实上它是这样的）