基于Stata中每个组内所有可能的标识符对创建数据集_Stata

基于Stata中每个组内所有可能的标识符对创建数据集

stata

基于Stata中每个组内所有可能的标识符对创建数据集,stata,Stata,我的数据集如下所示：国家1 国家2 组中国菲律宾 68a 中国泰国 68a 巴哈马牙买加 176a 巴哈马格林纳达 176a 我认为问题是因为您的唯一标识符实际上是两列（country1和country2）的组合，而在您下面的示例中，有一个唯一的id列。如果您的数据集不是难以置信的大，我将如何使用您的示例： clear input str40(country1 country2 group) "China" "Philippines" &q

我的数据集如下所示：

国家1 国家2 组中国菲律宾 68a 中国泰国 68a 巴哈马牙买加 176a 巴哈马格林纳达 176a

我认为问题是因为您的唯一标识符实际上是两列（

country1

和

country2

）的组合，而在您下面的示例中，有一个唯一的

id

列。如果您的数据集不是难以置信的大，我将如何使用您的示例：

clear
input str40(country1 country2 group)
"China" "Philippines"   "68a"
"China" "Thailand"  "68a"
"Bahamas"   "Jamaica"   "176a"
"Bahamas"   "Grenada"   "176a"
end

egen pair_id = group(country1 country2) // Create unique pair id
reshape long country, i(group pair_id) j(j ) // reshape all countries long
drop pair_id j
rename country country1

* create duplicate dataset to fulljoin
preserve
    rename country country2
    keep country2 group
    tempfile cross
    save `cross', replace
restore

joinby group using `cross' // full join
drop if country1 == country2

* Some tidying to match example output
order country1 country2 group
gsort -group country1 country2
duplicates drop

我能够用下面的代码生成我想要的数据集，但我希望找到一种更直接的方法来编码它

use "original dataset", clear
drop country2
save "temp_country1", replace

use "original dataset", clear
drop country1
ren country2 country1 
append using "temp_country1"

//drop duplicates//
sort number country1
quietly by number country1:  gen dup = cond(_N==1,0,_n)
drop if dup>1
drop dup
save "temp_country1_final", replace

use "temp_country1_final", clear
ren country1 country2
save "temp_country2.dta", replace

use "temp_country1_final", clear
joinby number using "temp_country2.dta"
order country1 country2  number name
drop if country1==country2

如果country1==“country2”在我看来非常错误，则最后一个命令

drop。双引号毫无意义，如果country1==country2

，其目的显然要求删除。这很琐碎，但你还随意编辑了什么？更重要的是，@JR96已经发布了一个解决方案，因此一个好的答案会将你的答案与他们的答案进行比较，并给出一些评论。