Regex 如何在Stata中只提取字符串的大写部分?

Regex 如何在Stata中只提取字符串的大写部分?,regex,string,stata,uppercase,Regex,String,Stata,Uppercase,以下是数据示例: part1 "Cambridge, Maryland TEST MODEL SEADROME" "L.B. MAYER HONORED" "A TOWN MOVES" "U.S. SAVINGS BONDS RALLY" "N.D. NOSES OUT S.M.U. BY 27 TO 20" "Philadelphia, Pa. BURN 2,300 SQUEALERS" "Odd Bits In To-day's News" "Saratoga Springs, N.Y. D

以下是数据示例:

part1
"Cambridge, Maryland TEST MODEL SEADROME"
"L.B. MAYER HONORED"
"A TOWN MOVES"
"U.S. SAVINGS BONDS RALLY"
"N.D. NOSES OUT S.M.U. BY 27 TO 20"
"Philadelphia, Pa. BURN 2,300 SQUEALERS"
"Odd Bits In To-day's News"
"Saratoga Springs, N.Y. DIAVOLO IS STAR AT BRILLIANT SPA OPENING"
"Risk Death in Daring Race"
"Philadelphia, PA. IT'S HIGHER EDUCATION"
"806 DECORATIONS"
"Snow Hauled 20 Miles For Skiers"
"F.D.R. ASKS VICTORY EFFORT"
每个字符串都有一个大写和小写部分,或者都是大写。我一直在尝试使用正则表达式只提取字符串的大写部分,但运气不好。我能做的最好的事情就是识别字符串何时以一定数量的大写字符开始或结束:

generate title = regexs(0) if regexm(part1, "^[A-Z][A-Z][A-Z].*[A-Z][A-Z][A-Z]$")
我还尝试了以下内容,这是从论坛上的另一个问题中得出的:

generate title = regexs(0) if(regexm(part1, "\b[A-Z]{2,}\b"))

它应该查找一行中至少有两个大写字母的单词,但它只为我返回缺少的值。我正在为Mac使用Stata 13.1版。

正如@Stribizev所指出的,否定可能是一种方式:

clear
set more off

input ///
str70 myvar
"Cambridge, Maryland TEST MODEL SEADROME"
"L.B. MAYER HONORED"
"A TOWN MOVES"
"U.S. SAVINGS BONDS RALLY"
"N.D. NOSES OUT S.M.U. BY 27 TO 20"
"Philadelphia, Pa. BURN 2,300 SQUEALERS"
"Odd Bits In To-day's News"
"Saratoga Springs, N.Y. DIAVOLO IS STAR AT BRILLIANT SPA OPENING"
"Risk Death in Daring Race"
"Philadelphia, PA. IT'S HIGHER EDUCATION"
"806 DECORATIONS"
"Snow Hauled 20 Miles For Skiers"
"F.D.R. ASKS VICTORY EFFORT"
end

gen title = trim(regexs(2)) if regexm(myvar, "([,.]*)([^a-z]*$)")

list title
结果是

. list title

     +-----------------------------------------------+
     |                                         title |
     |-----------------------------------------------|
  1. |                           TEST MODEL SEADROME |
  2. |                            L.B. MAYER HONORED |
  3. |                                  A TOWN MOVES |
  4. |                      U.S. SAVINGS BONDS RALLY |
  5. |             N.D. NOSES OUT S.M.U. BY 27 TO 20 |
     |-----------------------------------------------|
  6. |                          BURN 2,300 SQUEALERS |
  7. |                                               |
  8. | N.Y. DIAVOLO IS STAR AT BRILLIANT SPA OPENING |
  9. |                                               |
 10. |                     PA. IT'S HIGHER EDUCATION |
     |-----------------------------------------------|
 11. |                               806 DECORATIONS |
 12. |                                               |
 13. |                    F.D.R. ASKS VICTORY EFFORT |
     +-----------------------------------------------+
我认为这接近你想要的,但并不完美。如果字符串没有规则结构,很难想象一个简单的方法来清理字符串。例如,比较观察值6和10的输入/输出


如果您有一个标题数据库,在初始清理之后,您可以与该数据库进行比较和匹配。例如,请参见
ssc descripe strgroup

这个问题的含义似乎是,您希望正则表达式规范提取所有实例。不管这有多合理,这与正则表达式在Stata中的工作方式无关。您需要在实例上进行循环。这使用了
moss
ssc安装moss
),这是它的主要用途。(收集苔藓的暗示是第二位相关程序作者典型的无力的文字游戏,如果他正在读这篇文章的话。)


我假设您希望结果之间留有空格;否则很难理解。不在大写字母之间指定标点符号;如果需要,则需要相应地修改正则表达式

我想不出哪一条规则可以用一个命令清晰地解析这种类型的数据。通常,最好的策略是针对简单的案例,然后转移到更困难的案例,直到收益递减使额外的尝试失去吸引力

在使用正则表达式时,特别是在观察次数较多的情况下,注意意外匹配非常重要。我使用
listsome
(来自SSC)进行此类工作

它看起来像是第1部分,通常以城市名开始,然后是州名/缩写。下面是处理简单案例和城市/州案例的代码:

clear
input str60 part1
"Cambridge, Maryland TEST MODEL SEADROME" 
"L.B. MAYER HONORED" 
"A TOWN MOVES" 
"U.S. SAVINGS BONDS RALLY" 
"N.D. NOSES OUT S.M.U. BY 27 TO 20" 
"Philadelphia, Pa. BURN 2,300 SQUEALERS" 
"Odd Bits In To-day's News" 
"Saratoga Springs, N.Y. DIAVOLO IS STAR AT BRILLIANT SPA OPEN" 
"Risk Death in Daring Race" 
"Philadelphia, PA. IT'S HIGHER EDUCATION" 
"806 DECORATIONS" 
"Snow Hauled 20 Miles For Skiers" 
"F.D.R. ASKS VICTORY EFFORT" 
end

* take care of the easy cases where there are no lowercase letters
gen title = part1 if !regexm(part1,"[a-z]")

* this type of string work is easier if text is aligned to the left
leftalign   // (from SSC)

* target cases of City, State at the start of part1.
* with complex patterns, it's easy to miss unintended matches when
* lots of obs are involved so use -listsome- (from SSC to track changes)
gen title0 = title
replace title = trim(regexs(3)) if regexm(part1,"^([A-Z][a-z ]*)+, ([A-Z][a-z]*\.?)+([^a-z]+$)")
listsome if title != title0

list part1 title

不确定您想要什么:获取所有大写字母的段?尝试使用
^[^a-z]+$
。但是,可能不支持否定类。如果它不起作用,您将不得不尝试一些变通方法,如
^[A-Z][0-9A-Z~`@$%^&*(\)\+'=\]\[{}\\\\\\\''”;:/?,。>
clear
input str60 part1
"Cambridge, Maryland TEST MODEL SEADROME" 
"L.B. MAYER HONORED" 
"A TOWN MOVES" 
"U.S. SAVINGS BONDS RALLY" 
"N.D. NOSES OUT S.M.U. BY 27 TO 20" 
"Philadelphia, Pa. BURN 2,300 SQUEALERS" 
"Odd Bits In To-day's News" 
"Saratoga Springs, N.Y. DIAVOLO IS STAR AT BRILLIANT SPA OPEN" 
"Risk Death in Daring Race" 
"Philadelphia, PA. IT'S HIGHER EDUCATION" 
"806 DECORATIONS" 
"Snow Hauled 20 Miles For Skiers" 
"F.D.R. ASKS VICTORY EFFORT" 
end

* take care of the easy cases where there are no lowercase letters
gen title = part1 if !regexm(part1,"[a-z]")

* this type of string work is easier if text is aligned to the left
leftalign   // (from SSC)

* target cases of City, State at the start of part1.
* with complex patterns, it's easy to miss unintended matches when
* lots of obs are involved so use -listsome- (from SSC to track changes)
gen title0 = title
replace title = trim(regexs(3)) if regexm(part1,"^([A-Z][a-z ]*)+, ([A-Z][a-z]*\.?)+([^a-z]+$)")
listsome if title != title0

list part1 title