有没有办法用Python加速配置单元查询
我目前正在使用Pyhive连接到由Kerberos验证的Hiveserver2 我主要通过以下查询从这个表中读取数百个标记的时间序列数据,时间序列数据间隔为10秒:有没有办法用Python加速配置单元查询,python,hive,pyspark,kerberos,Python,Hive,Pyspark,Kerberos,我目前正在使用Pyhive连接到由Kerberos验证的Hiveserver2 我主要通过以下查询从这个表中读取数百个标记的时间序列数据,时间序列数据间隔为10秒: sql = """ select dt, max(case when tag_id = 'tag1' then val end) as tag1, max(case when tag_id = 'tag2' then val end) as tag2, max(case when tag_id = 'tag3' then val
sql = """
select dt,
max(case when tag_id = 'tag1' then val end) as tag1,
max(case when tag_id = 'tag2' then val end) as tag2,
max(case when tag_id = 'tag3' then val end) as tag3,
max(case when tag_id = 'tag4' then val end) as tag4,
.....
.....
from (
select tag_id,
cast(concat(year_mon_day,' ',lpad(hour_val,2,'0'),':',lpad(minute_val,2,'0'),
':',lpad(second_val,2,'0')) as timestamp) dt,
val
from tag_data
where cast(year_mon_day) as date) >= '2018-09-01'
and cast(year_mon_day) as date) < '2018-09-15'
) sub
group by dt
"""
cur = conn.cursor()
cur.execute(sql)
results = cur.fetchall()
此查询大约需要40分钟才能返回100000行和400多列。这只是15天的数据。最终,我们需要一次提取数月或数年的数据进行机器学习
我想知道是否有一种方法可以加速pyhive查询。我知道这是sqlalchemy的秘密,但查询是否根据可用资源并行运行?我们有相当好的HPC资源可供使用。我怎样才能跑得更快
我应该使用PySpark并运行SparkSQL进行查询吗?但是,我似乎找不到如何使用PySpark连接到hive2+kerberos
sql = """
select dt,
max(case when tag_id = 'tag1' then val end) as tag1,
max(case when tag_id = 'tag2' then val end) as tag2,
max(case when tag_id = 'tag3' then val end) as tag3,
max(case when tag_id = 'tag4' then val end) as tag4,
.....
.....
from (
select tag_id,
cast(concat(year_mon_day,' ',lpad(hour_val,2,'0'),':',lpad(minute_val,2,'0'),
':',lpad(second_val,2,'0')) as timestamp) dt,
val
from tag_data
where cast(year_mon_day) as date) >= '2018-09-01'
and cast(year_mon_day) as date) < '2018-09-15'
) sub
group by dt
"""
cur = conn.cursor()
cur.execute(sql)
results = cur.fetchall()