Sql 如何加入最新记录?
我有两张桌子。表A包含2004年至2012年公司债券交易的每日信息,表B包含特定日期的债券评级信息。我需要连接这两个表,这样对于表A中的每一笔交易,都会附加该特定债券的最新评级Sql 如何加入最新记录?,sql,performance,sas,Sql,Performance,Sas,我有两张桌子。表A包含2004年至2012年公司债券交易的每日信息,表B包含特定日期的债券评级信息。我需要连接这两个表,这样对于表A中的每一笔交易,都会附加该特定债券的最新评级 Table A: daily_transactions -------------------------------------------- DATE |BOND |PRICE -------------------------------------------- 20110401 |AE
Table A: daily_transactions
--------------------------------------------
DATE |BOND |PRICE
--------------------------------------------
20110401 |AES |100
20110402 |AES |101
20110403 |AES |102
20110404 |AES |103
20110401 |BPP |99
20110402 |BPP |98
Table B: bond_ratings
--------------------------------------------
DATE |BOND |RATING
--------------------------------------------
20110401 |AES |AAA
20110403 |AES |BB
20110401 |BPP |CCC
Table C: joined_data
--------------------------------------------
DATE |BOND |PRICE |RATING
--------------------------------------------
20110401 |AES |100 |AAA
20110402 |AES |101 |AAA
20110403 |AES |102 |BB
20110404 |AES |103 |BB
20110401 |BPP |99 |CCC
20110402 |BPP |98 |CCC
我在表A中有大约1000000条记录,在表B中有14000条记录
更新:
到目前为止,我得到的是:
create table test_merge as
SELECT a.date, b.date, a.bond, a.price, b.rating
FROM daily_transactions a
LEFT JOIN bond_ratings b ON a.bond = b.bond AND b.date <= a.date
WHERE NOT EXISTS (
SELECT 1 FROM bond_ratings b1
WHERE b1.bond = a.bond
AND b1.date <= a.date
AND b1.date > b.date
);
它似乎工作得很好,但是由于我拥有的数据量,它的运行速度非常慢。大约需要2个小时。有什么方法可以优化它以更快地运行吗
我是sql新手,非常感谢您的帮助 我怀疑在您的例子中,子查询正在破坏性能 以下方法避免了子查询,从而使连接过程更加高效
/*sample data:*/
DATA daily_transactions;
input date bond $ price;
informat date yymmdd8.;
format date yymmddn8.;
infile datalines dsd delimiter = '|';
datalines;
20110401|AES|100
20110402|AES|101
20110403|AES|102
20110404|AES|103
20110401|BPP|99
20110402|BPP|98
;
run;
DATA bond_ratings;
input date bond $ rating $;
informat date yymmdd8.;
format date yymmddn8.;
infile datalines dsd delimiter = '|';
datalines;
20110401|AES |AAA
20110403|AES |BB
20110401|BPP |CCC
;
run;
/*Modify the bond_ratings dataset such that for each record we can specify up till when that rating is valid*/
/*essentially we will have two date fields (from_date, to_date)
from_date bond rating to_date
20110401 AES AAA 20110402
20110403 AES BB .
20110401 BPP CCC .
*/
/*since there is no LEAD function in SAS, we sort in decending order by date and apply the LAG function - in effect getting the leading value*/
PROC SORT DATA = bond_ratings OUT = bond_ratings_sorted;
by bond descending date;
run;
/*capture the to_date by using lag function on the date.*/
data bond_ratings_lookup(rename = (date=from_date));
set bond_ratings_sorted;
by bond descending date;
format to_date yymmddn8.;
lag_date = lag(date);/*note: the reason we keep lag function outside the if-else group below because of the way lag-function works-just look it on google*/
if first.bond and first.date then to_date =.;
else to_date=lag_date-1;/*-1, so that to_date is set to 1 day less the next available bond rating date*/
drop lag_date;
run;
/*this sort is not necessary, but if you want to just verify the output then it is usefull*/
proc sort data = bond_ratings_lookup out = bond_ratings_lookup_sorted;
by bond from_date;
run;
/*final query:*/
proc sql;
create table joined as
select a.*, b.rating, b.from_date as bond_rating_start_period, b.to_date as bond_rating_end_period
from daily_transactions as a
left join bond_ratings_lookup_sorted as b
on a.bond = b.bond and
(
b.to_date ne . and (a.date >=b.from_date and a.date<= b.to_date )
or
b.to_date = . and (a.date >=b.from_date )
)
order by a.bond, a.date, b.from_date
;
quit;
我通过在bond列上建立索引,将运行时间缩短到了5分钟
对于更基于SAS而不是SQL的方法,您可以对表B使用SAS格式,并且可能会加快速度。A只是一个查找表,将开始和结束之间的任何内容映射到标签。例如,将此表作为格式加载:
fmtname | START | END | LABEL
-----------------------------------------------------------
$bondRate | AES20110401 | AES20110403 | AAA
将开始和结束之间的任何文本字符串映射到标签。所以AES20110302->AAA
下面是完整的代码,使用上面的表B,假设日期是一个数字字段,如果不使用inputDATE,则使用YYDDMMN8。要将其转换为数字,请执行以下操作:
PROC SORT DATA = TABLE_B;
by bond descending date;
run;
/*Use lag function to get the start and end date on one line*/
data bond_ratings_fmt;
set TABLE_B;
by bond descending date;
START_DT = put(date,$8);*Character date like '20110401';
END_DT = put(lag(date)-1,$8);* 1 less than the prior records end;
*first.bond is the most recent rating for each bond;
*setting the END_DT to some future date in this case.;
if first.bond then END_DT= '20991231';
START = cats(BOND,START_DT);*Cats concatenates and trims spaces, makes AES20110401;
END = cats(BOND,END_DT);
LABEL = Rating;
fmtName='$bondRate';
run;
*Load the format, using CNTLIN (Control Table In);
proc format cntlin=bond_ratings_fmt;
*Apply the format;
data TableC_withRating (drop=_:);
set TableA;
_DateChar = put(DATE,$8.);
Rating = put(BOND||_DateChar,$bondRate.);
run;
您可以通过在格式中添加另一个案例来获得更高的兴趣-网上有许多使用cntlin和proc格式的好例子。关于什么查询的建议?如果知道您尝试了什么,那就太好了。@LearningNeverStops,请参阅更新的问题。谢谢。@Danielfries数据库是什么?@hashbrown。如前所述,我不熟悉数据库和sql,但查询是用SASGood的东西执行的,我认为这对子查询没有帮助。在债券和日期上创建一个综合指数可能会更好。谢谢@sashikanthdardy。在我找到这个解决方案后,我看到了你的答案。谢谢你的意见和时间。有没有办法自动定义开始和结束?因此,开始设置为评级日期,结束设置为下一个评级的评级日期?是-编辑以添加更多代码。第一步与Sashikanth Dareddy的代码非常相似。我绝对推荐另一条记录hlo='o'。每种格式都应该有一条其他记录,除非您明确希望other传递传入值。
PROC SORT DATA = TABLE_B;
by bond descending date;
run;
/*Use lag function to get the start and end date on one line*/
data bond_ratings_fmt;
set TABLE_B;
by bond descending date;
START_DT = put(date,$8);*Character date like '20110401';
END_DT = put(lag(date)-1,$8);* 1 less than the prior records end;
*first.bond is the most recent rating for each bond;
*setting the END_DT to some future date in this case.;
if first.bond then END_DT= '20991231';
START = cats(BOND,START_DT);*Cats concatenates and trims spaces, makes AES20110401;
END = cats(BOND,END_DT);
LABEL = Rating;
fmtName='$bondRate';
run;
*Load the format, using CNTLIN (Control Table In);
proc format cntlin=bond_ratings_fmt;
*Apply the format;
data TableC_withRating (drop=_:);
set TableA;
_DateChar = put(DATE,$8.);
Rating = put(BOND||_DateChar,$bondRate.);
run;