有没有办法用Python加速配置单元查询_Python_Hive_Pyspark_Kerberos

有没有办法用Python加速配置单元查询

python hive pyspark

有没有办法用Python加速配置单元查询,python,hive,pyspark,kerberos,Python,Hive,Pyspark,Kerberos,我目前正在使用Pyhive连接到由Kerberos验证的Hiveserver2 我主要通过以下查询从这个表中读取数百个标记的时间序列数据，时间序列数据间隔为10秒： sql = """ select dt, max(case when tag_id = 'tag1' then val end) as tag1, max(case when tag_id = 'tag2' then val end) as tag2, max(case when tag_id = 'tag3' then val

我目前正在使用Pyhive连接到由Kerberos验证的Hiveserver2

我主要通过以下查询从这个表中读取数百个标记的时间序列数据，时间序列数据间隔为10秒：

sql = """
select dt, 
max(case when tag_id = 'tag1' then val end) as tag1,
max(case when tag_id = 'tag2' then val end) as tag2,
max(case when tag_id = 'tag3' then val end) as tag3,
max(case when tag_id = 'tag4' then val end) as tag4,
.....
.....
from (

    select tag_id,
        cast(concat(year_mon_day,' ',lpad(hour_val,2,'0'),':',lpad(minute_val,2,'0'),
               ':',lpad(second_val,2,'0')) as timestamp) dt,
        val
    from tag_data
    where cast(year_mon_day) as date) >= '2018-09-01'
    and cast(year_mon_day) as date) < '2018-09-15'
) sub
group by dt
"""
cur = conn.cursor()
cur.execute(sql)
results = cur.fetchall()

此查询大约需要40分钟才能返回100000行和400多列。这只是15天的数据。最终，我们需要一次提取数月或数年的数据进行机器学习

我想知道是否有一种方法可以加速pyhive查询。我知道这是sqlalchemy的秘密，但查询是否根据可用资源并行运行？我们有相当好的HPC资源可供使用。我怎样才能跑得更快

我应该使用PySpark并运行SparkSQL进行查询吗？但是，我似乎找不到如何使用PySpark连接到hive2+kerberos

sql = """
select dt, 
max(case when tag_id = 'tag1' then val end) as tag1,
max(case when tag_id = 'tag2' then val end) as tag2,
max(case when tag_id = 'tag3' then val end) as tag3,
max(case when tag_id = 'tag4' then val end) as tag4,
.....
.....
from (

    select tag_id,
        cast(concat(year_mon_day,' ',lpad(hour_val,2,'0'),':',lpad(minute_val,2,'0'),
               ':',lpad(second_val,2,'0')) as timestamp) dt,
        val
    from tag_data
    where cast(year_mon_day) as date) >= '2018-09-01'
    and cast(year_mon_day) as date) < '2018-09-15'
) sub
group by dt
"""
cur = conn.cursor()
cur.execute(sql)
results = cur.fetchall()