如何使用python在pyspark上运行sql查询?
嗨,我是pyspark的新手。我没有用pyspark编写代码,所以我需要帮助使用python在pyspark上运行sql查询 您能告诉我如何创建dataframe,然后在其上查看和运行sql查询吗? 运行查询需要哪些模块? 你能帮我怎么跑吗 数据来自TERR.txt文件 sql查询:如何使用python在pyspark上运行sql查询?,pyspark,pyspark-sql,Pyspark,Pyspark Sql,嗨,我是pyspark的新手。我没有用pyspark编写代码,所以我需要帮助使用python在pyspark上运行sql查询 您能告诉我如何创建dataframe,然后在其上查看和运行sql查询吗? 运行查询需要哪些模块? 你能帮我怎么跑吗 数据来自TERR.txt文件 sql查询: select a.id as nmitory_id, a.dscrptn as nmitory_desc, a.nm as terr_nm, a.pstn_type, a.parnt_terr as parnt_
select a.id as nmitory_id, a.dscrptn as nmitory_desc, a.nm as terr_nm, a.pstn_type, a.parnt_terr as parnt_nm_id, b.nm as parnt_terr_nm, a.start_dt, a.type,
CASE
WHEN substr (a.nm, 1, 6) IN ('105-30',
'105-31',
'105-32',
'105-41',
'105-42',
'105-43',
'200-CD',
'200-CG',
'200-CO',
'200-CP',
'200-CR',
'200-DG'
)
THEN
'JBI'
WHEN substr (a.nm, 1, 6) IN ('100-SC',
'105-05',
'105-06',
'105-07',
'105-08',
'105-13',
'105-71',
'105-72',
'105-73'
)
THEN
'JP'
WHEN substr (a.nm, 1, 6) IN ('103-16')
THEN
'JT'
WHEN substr (a.nm, 1, 6) IN ('105-51',
'200-HA',
'200-HF',
'200-HT',
'105-HT')
THEN
'JSA'
WHEN substr (a.nm, 1, 6) IN ('105-61',
'200-PR')
THEN
'PR'
WHEN substr (a.nm, 1, 3) IN ('302')
THEN
'Canada - MEM'
WHEN substr (a.nm, 1, 3) IN ('301')
THEN
'Canada - MSL'
ELSE
'Unspecified'
END
AS DEPARTMENT,
CASE
WHEN substr (a.nm, 1, 6) IN ('105-06',
'105-07',
'105-08'
)
THEN
'CVM MSL'
WHEN substr (a.nm, 1, 6) IN ('100-SC',
'105-13'
)
THEN
'CVM CSS'
WHEN substr (a.nm, 1, 6) IN ('105-41',
'200-CD'
)
THEN
'Derm MSL'
WHEN substr (a.nm, 1, 6) IN ('105-42',
'200-CG'
)
THEN
'Gastro MSL'
WHEN substr (a.nm, 1, 6) IN ('105-31')
THEN
'Heme Onc MSL'
WHEN substr (a.nm, 1, 6) IN ('200-DG')
THEN
'Imm MD'
WHEN substr (a.nm, 1, 6) IN ('103-16')
THEN
'ID MSL'
WHEN substr (a.nm, 1, 6) IN ('200-CP')
THEN
'Imm Ops'
WHEN substr (a.nm, 1, 6) IN ('105-05',
'105-71',
'105-72',
'105-73'
)
THEN
'Neuro MSL'
WHEN substr (a.nm, 1, 6) IN ('105-30',
'200-CO'
)
THEN
'Onc MSL'
WHEN substr (a.nm, 1, 6) IN ('105-61',
'200-PR'
)
THEN
'Puerto Rico MSL'
WHEN substr (a.nm, 1, 6) IN ('105-43',
'200-CR'
)
THEN
'Rheum MSL'
WHEN substr (a.nm, 1, 6) IN ('105-51',
'200-HF'
)
THEN
'RWVE Field'
WHEN substr (a.nm, 1, 6) IN ('105-32')
THEN
'Solid Tumor MSL'
WHEN substr (a.nm, 1, 6) IN ('200-HT',
'105-HT')
THEN
'RWVE Pop Health'
WHEN substr (a.nm, 1, 6) IN ('301-PC')
THEN
'Canada - PC MSL'
WHEN substr (a.nm, 1, 6) IN ('301-VR')
THEN
'Canada - VR/ONC MSL'
WHEN substr (a.nm, 1, 6) IN ('301-SO')
THEN
'Canada - Hematology (Myeloid) MSL'
WHEN substr (a.nm, 1, 6) IN ('301-ON')
THEN
'Canada - Hematology (Lymphoid) MSL'
WHEN substr (a.nm, 1, 6) IN ('301-IP')
THEN
'Canada - CNS MSL'
WHEN substr (a.nm, 1, 6) IN ('301-RD')
THEN
'Canada - Rheum MSL'
WHEN substr (a.nm, 1, 6) IN ('301-IB')
THEN
'Canada - Gastro MSL'
WHEN substr (a.nm, 1, 6) IN ('301-DE')
THEN
'Canada - Derm MSL'
WHEN substr (a.nm, 1, 6) IN ('301-SE')
THEN
'Canada - Biologics MSL'
WHEN substr (a.nm, 1, 6) IN ('302-PC')
THEN
'Canada - PC MEM'
WHEN substr (a.nm, 1, 6) IN ('302-VR')
THEN
'Canada - VR/ONC MEM'
WHEN substr (a.nm, 1, 6) IN ('302-SO')
THEN
'Canada - Hematology (Myeloid) MEM'
WHEN substr (a.nm, 1, 6) IN ('302-ON')
THEN
'Canada - Hematology (Lymphoid) MEM'
WHEN substr (a.nm, 1, 6) IN ('302-IP')
THEN
'Canada - CNS MEM'
WHEN substr (a.nm, 1, 6) IN ('302-RD')
THEN
'Canada - Rheum MEM'
WHEN substr (a.nm, 1, 6) IN ('302-IB')
THEN
'Canada - Gastro MEM'
WHEN substr (a.nm, 1, 6) IN ('302-DE')
THEN
'Canada - Derm MEM'
WHEN substr (a.nm, 1, 6) IN ('302-SE')
THEN
'Canada - Biologics MEM'
ELSE
'Unspecified'
END
AS FRANCHISE
from outbound.terr a left outer join outbound.terr b on a.parnt_terr = b.id
将查询保存为字符串等变量,并假设您知道SparkSession对象是什么,则可以使用SparkSession.sql在表上启动查询:
df.createTempView('TABLE_X')
query = "SELECT * FROM TABLE_X"
df = spark.sql(query)
要将csv读入Spark,请执行以下操作:
def read_csv_spark(spark, file_path):
df = (
spark.read.format("com.databricks.spark.csv")
.options(header="true", inferSchema="true")
.load(file_path)
)
return df
df = read_csv_spark(spark, "/path/to/file.csv")
将查询保存为字符串等变量,并假设您知道SparkSession对象是什么,则可以使用SparkSession.sql在表上启动查询:
df.createTempView('TABLE_X')
query = "SELECT * FROM TABLE_X"
df = spark.sql(query)
要将csv读入Spark,请执行以下操作:
def read_csv_spark(spark, file_path):
df = (
spark.read.format("com.databricks.spark.csv")
.options(header="true", inferSchema="true")
.load(file_path)
)
return df
df = read_csv_spark(spark, "/path/to/file.csv")
您应该创建一个临时视图并对其进行查询 例如:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("sample").getOrCreate()
df = spark.read.load("TERR.txt")
df.createTempView("example")
df2 = spark.sql("SELECT * FROM example")
您应该创建一个临时视图并对其进行查询 例如:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("sample").getOrCreate()
df = spark.read.load("TERR.txt")
df.createTempView("example")
df2 = spark.sql("SELECT * FROM example")
我需要先创建数据帧吗?如果是,那么你能指导我吗?@abhishek我已经更新了我的答案。如果有任何问题,请通知我。你能写下导入所有模块的完整代码吗?@abhishek我在回答中创建了spark会话。我必须从TERR.txt文件创建数据帧,然后创建视图,并在其顶部运行sql查询。我需要先创建数据帧吗?如果是,那么你能指导我吗?@abhishek我已经更新了我的答案。如果有任何问题,请通知我。你能写下导入所有模块的完整代码吗?@abhishek我在回答中创建了spark会话。我必须从TERR.txt文件创建数据帧,然后创建视图,并在其顶部运行sql查询。我需要先创建数据帧吗?如果是,那么你能指导我吗?什么是TABLE_X?@abhishek这是我们给dataframe的SQL表的临时名称,你能写下所有模块的全部代码吗?@abhishek当然,如果你认为StackOverflow是一个代码编写服务。我需要先创建dataframe吗?如果是,那么你能指导我吗?什么是TABLE_X?@abhishek这是我们给DataFrame的SQL表的临时名称,你能写下所有模块的全部代码吗?@abhishek当然,如果你认为StackOverflow是一种代码编写服务的话。