Stata 拼写数据管理：过去24个月内在给定状态下花费的月份_Stata_Duration_Survival Analysis

Stata 拼写数据管理：过去24个月内在给定状态下花费的月份

stata

Stata 拼写数据管理：过去24个月内在给定状态下花费的月份,stata,duration,survival-analysis,Stata,Duration,Survival Analysis,我正在使用具有以下格式的拼写数据集： clear all input persid start end t_start t_end spell_type year spell_number event 1 8 9 44 45 1 1999 1 0 1 12 12 60 60 1 2000 1 0 1 1 1 61 61 1 2001 1 0

我正在使用具有以下格式的拼写数据集：

    clear all

input persid    start   end t_start t_end   spell_type  year    spell_number    event
    1   8   9   44  45  1   1999    1   0
    1   12  12  60  60  1   2000    1   0
    1   1   1   61  61  1   2001    1   0
    1   7   11  67  71  1   2001    2   0
    1   1   4   85  88  2   2003    1   0
    1   5   7   89  91  1   2003    2   1
    1   8   11  92  95  2   2003    3   0
    1   1   1   97  97  2   2004    1   0
    1   1   3   121 123 1   2006    1   1
    1   4   5   124 125 2   2006    2   0
    1   6   9   126 129 1   2006    3   1
    1   10  11  130 131 2   2006    4   0
    1   12  12  132 132 1   2006    5   1
    1   1   12  157 168 1   2009    1   0
    1   1   12  169 180 1   2010    1   0
    1   1   12  181 192 1   2011    1   0
    1   1   12  193 204 1   2012    1   0
    1   1   12  205 216 1   2013    1   0
end

lab define lab_spelltype 1 "unemployment spell" 2 "employment spell"
lab val spell_type lab_spelltype

其中，

persid

是该人的id<代码>开始和

结束

分别是年度失业/就业期开始和结束的月份

t_开始

和

t_结束

是相同的度量，但从1996年1月1日开始计算<代码>事件对于上一行为失业期的就业条目等于1

数据是这样的，在给定的一年中没有重叠的法术，并且每年相同类型的连续法术被合并在一起

我的目标是，对于

event

为1的每一行，计算过去6个月和24个月内的就业月数。在这个具体的例子中，我想得到的是：

clear all
input persid    start   end t_start t_end   spell_type  year    spell_number    event   empl_6  empl_24
    1   8   9   44  45  1   1999    1   0   .   .
    1   12  12  60  60  1   2000    1   0   .   .
    1   1   1   61  61  1   2001    1   0   .   .
    1   7   11  67  71  1   2001    2   0   .   .
    1   1   4   85  88  2   2003    1   0   .   .
    1   5   7   89  91  1   2003    2   1   0   5
    1   8   11  92  95  2   2003    3   0   .   .
    1   1   1   97  97  2   2004    1   0   .   .
    1   1   3   121 123 1   2006    1   1   0   0
    1   4   5   124 125 2   2006    2   0   .   .
    1   6   9   126 129 1   2006    3   1   3   3
    1   10  11  130 131 2   2006    4   0   .   .
    1   12  12  132 132 1   2006    5   1   4   7
    1   1   12  157 168 1   2009    1   0   .   .
    1   1   12  169 180 1   2010    1   0   .   .
    1   1   12  181 192 1   2011    1   0   .   .
    1   1   12  193 204 1   2012    1   0   .   .
    1   1   12  205 216 1   2013    1   0   .   .
end

因此，我的想法是，我必须回到每个

事件==1

条目前面的行，并计算该个人的就业月数

你能建议一种获得最终结果的方法吗？一些人建议扩展数据集，但也许有更好的方法来解决这个问题（特别是因为数据集非常大）

编辑

就业状态的正确标签为：

lab define lab_spelltype 1 "employment spell" 2 "unemployment spell"

过去在工作中花费的月数（

emp_6

和

emp_24

）和

事件的定义现在用这个标签是正确的。
发布的示例在开发和测试解决方案方面没有什么用处，所以我编造了具有相同属性的假数据。使用1和2作为指标的值是一种不好的做法，因此我用1代替了已采用的指标，表示已采用，否则为0。单独使用月份和年份也没用，所以使用Stata每月日期
第一个解决方案是在每个法术扩展到每月观察一次后使用tsegen
（来自SSC）。使用面板数据，您只需将所需时间窗口的就业指标求和即可
第二种解决方案使用rangestat
（也来自SSC），并在不扩展数据的情况下执行相同的计算。这个想法很简单，只要把以前的职业法术的持续时间加上，如果该法术的结束时间在所需的时间范围内。当然，如果咒语的结束在窗口内，而不是开始，则必须减去窗口外的天数
* fake data for 100 persons, up to 10 spells with no overlap
clear
set seed 123423
set obs 100
gen long persid = _n
gen spell_start = ym(runiformint(1990,2013),1)
expand runiformint(1,10)
bysort persid: gen spellid = _n
by persid: gen employed = runiformint(0,1)
by persid: gen spell_avg = int((ym(2015,12) - spell_start) / _N) + 1
by persid: replace spell_start = spell_start[_n-1] + ///
    runiformint(1,spell_avg) if _n > 1
by persid: gen spell_end = runiformint(spell_start, spell_start[_n+1]-1)
replace spell_end = spell_start + runiformint(1,12) if mi(spell_end)
format %tm spell_start spell_end

* an event is an employment spell that immediately follow an unemployment spell
by persid: gen event = employed & employed[_n-1] == 0

* expand to one obs per month and declare as panel data
expand spell_end - spell_start + 1
bysort persid spellid: gen ym = spell_start + _n - 1
format %tm ym
tsset persid ym

* only count employement months; limit results to first month event obs
tsegen m6 = rowtotal(L(1/6).employed)
tsegen m24 = rowtotal(L(1/24).employed)
bysort persid spellid (ym): replace m6 = . if _n > 1 | !event
bysort persid spellid (ym): replace m24 = . if _n > 1 | !event

* --------- redo using rangestat, without any monthly expansion ----------------

* return to original obs but keep first month results
bysort persid spellid: keep if _n == 1

* employment end and duration for employed observations only
gen e_end = spell_end if employed
gen e_len = spell_end - spell_start + 1 if employed

foreach target in 6 24 {

    // define interval bounds but only for event observations
    // an out-of-sample [0,0] interval will yield no results for non-events
    gen low`target' = cond(event, spell_start-`target', 0)
    gen high`target' = cond(event, spell_start-1, 0)

    // sum employment lengths and save earliest employment spell info
    rangestat (sum) empl`target'=e_len ///
        (firstnm) firste`target'=e_end firste`target'len=e_len, ///
        by(persid) interval(spell_end low`target' high`target')

    // remove from the count months that occur before lower bound
    gen e_start = firste`target' - firste`target'len + 1
    gen outside = low`target' - e_start
    gen empl`target'final = cond(outside > 0, empl`target'-outside, empl`target')
    replace empl`target'final = 0 if mi(empl`target'final) & event
    drop e_start outside
}

* confirm that we match the -tsegen- results
assert m24 == empl24final
assert m6 == empl6final

发布的示例在开发和测试解决方案时没有什么用处，所以我编造了具有相同属性的假数据。使用1和2作为指标的值是一种不好的做法，因此我用1代替了已采用的指标，表示已采用，否则为0。单独使用月份和年份也没用，所以使用Stata每月日期
第一个解决方案是在每个法术扩展到每月观察一次后使用tsegen
（来自SSC）。使用面板数据，您只需将所需时间窗口的就业指标求和即可
第二种解决方案使用rangestat
（也来自SSC），并在不扩展数据的情况下执行相同的计算。这个想法很简单，只要把以前的职业法术的持续时间加上，如果该法术的结束时间在所需的时间范围内。当然，如果咒语的结束在窗口内，而不是开始，则必须减去窗口外的天数
* fake data for 100 persons, up to 10 spells with no overlap
clear
set seed 123423
set obs 100
gen long persid = _n
gen spell_start = ym(runiformint(1990,2013),1)
expand runiformint(1,10)
bysort persid: gen spellid = _n
by persid: gen employed = runiformint(0,1)
by persid: gen spell_avg = int((ym(2015,12) - spell_start) / _N) + 1
by persid: replace spell_start = spell_start[_n-1] + ///
    runiformint(1,spell_avg) if _n > 1
by persid: gen spell_end = runiformint(spell_start, spell_start[_n+1]-1)
replace spell_end = spell_start + runiformint(1,12) if mi(spell_end)
format %tm spell_start spell_end

* an event is an employment spell that immediately follow an unemployment spell
by persid: gen event = employed & employed[_n-1] == 0

* expand to one obs per month and declare as panel data
expand spell_end - spell_start + 1
bysort persid spellid: gen ym = spell_start + _n - 1
format %tm ym
tsset persid ym

* only count employement months; limit results to first month event obs
tsegen m6 = rowtotal(L(1/6).employed)
tsegen m24 = rowtotal(L(1/24).employed)
bysort persid spellid (ym): replace m6 = . if _n > 1 | !event
bysort persid spellid (ym): replace m24 = . if _n > 1 | !event

* --------- redo using rangestat, without any monthly expansion ----------------

* return to original obs but keep first month results
bysort persid spellid: keep if _n == 1

* employment end and duration for employed observations only
gen e_end = spell_end if employed
gen e_len = spell_end - spell_start + 1 if employed

foreach target in 6 24 {

    // define interval bounds but only for event observations
    // an out-of-sample [0,0] interval will yield no results for non-events
    gen low`target' = cond(event, spell_start-`target', 0)
    gen high`target' = cond(event, spell_start-1, 0)

    // sum employment lengths and save earliest employment spell info
    rangestat (sum) empl`target'=e_len ///
        (firstnm) firste`target'=e_end firste`target'len=e_len, ///
        by(persid) interval(spell_end low`target' high`target')

    // remove from the count months that occur before lower bound
    gen e_start = firste`target' - firste`target'len + 1
    gen outside = low`target' - e_start
    gen empl`target'final = cond(outside > 0, empl`target'-outside, empl`target')
    replace empl`target'final = 0 if mi(empl`target'final) & event
    drop e_start outside
}

* confirm that we match the -tsegen- results
assert m24 == empl24final
assert m6 == empl6final

解决此问题的方法是：

扩展数据以使其每月都可用
用tsfill
填写空白月份，最后
使用sum（）


关于我借用的一些想法，请参见Robert solution
重要提示：这几乎肯定不是解决问题的有效方法，尤其是如果数据很大（如我的情况）。
然而，另一个好处是，人们实际上“看到”了背景中发生的事情，以确保最终结果是所期望的结果
同样重要的是，此解决方案考虑了两个（或更多）事件在6（或24）个月内发生的情况
clear all

input persid    start   end t_start t_end   spell_type  year    spell_number    event
    1   8   9   44  45  1   1999    1   0
    1   12  12  60  60  1   2000    1   0
    1   1   1   61  61  1   2001    1   0
    1   7   11  67  71  1   2001    2   0
    1   1   4   85  88  2   2003    1   0
    1   5   7   89  91  1   2003    2   1
    1   8   11  92  95  2   2003    3   0
    1   1   1   97  97  2   2004    1   0
    1   1   3   121 123 1   2006    1   1
    1   4   5   124 125 2   2006    2   0
    1   6   9   126 129 1   2006    3   1
    1   10  11  130 131 2   2006    4   0
    1   12  12  132 132 1   2006    5   1
    1   1   12  157 168 1   2009    1   0
    1   1   12  169 180 1   2010    1   0
    1   1   12  181 192 1   2011    1   0
    1   1   12  193 204 1   2012    1   0
    1   1   12  205 216 1   2013    1   0
end

lab define lab_spelltype 1 "employment" 2 "unemployment"
lab val spell_type lab_spelltype
list

* generate Stata monthly dates
gen spell_start = ym(year,start)
gen spell_end = ym(year,end)
format %tm spell_start spell_end
list

* expand to monthly data
gen n = spell_end - spell_start + 1
expand n, gen(expanded)
sort persid year spell_number (expanded)
bysort persid year spell_number: gen month = spell_start + _n - 1
by persid year spell_number: replace event = 0 if _n > 1
format %tm month

* xtset, fill months gaps with "empty" rows, use lags and cumsum to count past months in employment
xtset persid month, monthly // %tm format
tsfill
bysort persid (month): gen cumsum = sum(spell_type) if spell_type==1
bysort persid (month): replace cumsum = cumsum[_n-1] if cumsum==.
bysort persid (month): gen m6  = cumsum-1 - L7.cumsum if event==1  // "-1" otherwise it sums also current empl month
bysort persid (month): gen m24 = cumsum-1 - L25.cumsum if event==1
drop if event==.
list persid start end year m* if event

解决此问题的方法是：

扩展数据以使其每月都可用
用tsfill
填写空白月份，最后
使用sum（）


关于我借用的一些想法，请参见Robert solution
重要提示：这几乎肯定不是解决问题的有效方法，尤其是如果数据很大（如我的情况）。
然而，另一个好处是，人们实际上“看到”了背景中发生的事情，以确保最终结果是所期望的结果
同样重要的是，此解决方案考虑了两个（或更多）事件在6（或24）个月内发生的情况
clear all

input persid    start   end t_start t_end   spell_type  year    spell_number    event
    1   8   9   44  45  1   1999    1   0
    1   12  12  60  60  1   2000    1   0
    1   1   1   61  61  1   2001    1   0
    1   7   11  67  71  1   2001    2   0
    1   1   4   85  88  2   2003    1   0
    1   5   7   89  91  1   2003    2   1
    1   8   11  92  95  2   2003    3   0
    1   1   1   97  97  2   2004    1   0
    1   1   3   121 123 1   2006    1   1
    1   4   5   124 125 2   2006    2   0
    1   6   9   126 129 1   2006    3   1
    1   10  11  130 131 2   2006    4   0
    1   12  12  132 132 1   2006    5   1
    1   1   12  157 168 1   2009    1   0
    1   1   12  169 180 1   2010    1   0
    1   1   12  181 192 1   2011    1   0
    1   1   12  193 204 1   2012    1   0
    1   1   12  205 216 1   2013    1   0
end

lab define lab_spelltype 1 "employment" 2 "unemployment"
lab val spell_type lab_spelltype
list

* generate Stata monthly dates
gen spell_start = ym(year,start)
gen spell_end = ym(year,end)
format %tm spell_start spell_end
list

* expand to monthly data
gen n = spell_end - spell_start + 1
expand n, gen(expanded)
sort persid year spell_number (expanded)
bysort persid year spell_number: gen month = spell_start + _n - 1
by persid year spell_number: replace event = 0 if _n > 1
format %tm month

* xtset, fill months gaps with "empty" rows, use lags and cumsum to count past months in employment
xtset persid month, monthly // %tm format
tsfill
bysort persid (month): gen cumsum = sum(spell_type) if spell_type==1
bysort persid (month): replace cumsum = cumsum[_n-1] if cumsum==.
bysort persid (month): gen m6  = cumsum-1 - L7.cumsum if event==1  // "-1" otherwise it sums also current empl month
bysort persid (month): gen m24 = cumsum-1 - L25.cumsum if event==1
drop if event==.
list persid start end year m* if event

谢谢你的回复。不幸的是，我的问题中有一个输入错误：1是就业，而不是失业，否则感兴趣的事件没有意义，以及我在问题末尾报告的（正确的）过去几个月的就业情况。我修改了问题，并提出了一个部分基于您答复的解决方案。此外，我认为，如果一个新的“事件”在前一个事件的6（24）个月内发生，则在计算前几个月的就业时间时不会考虑到这一点。非常感谢您的答复。不幸的是，我的问题中有一个输入错误：1是就业，而不是失业，否则感兴趣的事件没有意义，以及我在问题末尾报告的（正确的）过去几个月的就业情况。我对问题进行了修改，并提出了部分基于您的回答的解决方案。此外，我认为，如果新的“事件”发生在前一个“事件”的6（24）个月内，则在计算前几个月的就业时间时不考虑这一点