Python 将单词对齐以形成带括号的字符串

Python 将单词对齐以形成带括号的字符串,python,regex,string,nltk,Python,Regex,String,Nltk,假设我有以下带标点符号的括号字符串: s = "(S (NP-SBJ (NP (NP (NNP Ambassador) (NNP Paul) (NNP Nitze) (POS 's)) (NN statement)) (PRN (-LRB- -LRB-) (NP (NP-TTL (NNP Notable) (CC &) (NNP Quotable)) (, ,) (NP-TMP (NNP Sept.) (CD 20))) (-RRB- -RRB-) (, ,) (`` ``))

假设我有以下带标点符号的括号字符串:

s = "(S (NP-SBJ (NP (NP (NNP Ambassador) (NNP Paul) (NNP Nitze) (POS 's)) (NN statement)) (PRN (-LRB- -LRB-) (NP (NP-TTL (NNP Notable) (CC &) (NNP Quotable)) (, ,) (NP-TMP (NNP Sept.) (CD 20))) (-RRB- -RRB-) (, ,) (`` ``)) (S (SBAR-ADV (IN If) (S (NP-SBJ (PRP you)) (VP (VBP have) (NP (NP (DT a) (CD million) (NNS people)) (VP (VBG working) (PP (IN for) (NP (PRP you)))))))) (, ,) (NP-SBJ (NP (DT every) (JJ bad) (NN thing)) (SBAR (WHNP-1 (WDT that)) (S (VP (VBZ has) (NP (NP (CD one) (NN chance)) (PP (IN in) (NP (DT a) (CD million))) (PP (IN of) (S-NOM (VP (VBG going) (ADVP-CLR (NN wrong)))))))))) (VP (MD will) (VP (VB go) (ADVP-CLR (JJ wrong)) (ADVP-TMP (ADVP (IN at) (JJS least)) (IN once) (NP-ADV (DT a) (NN year)))))) (, ,) ('' '')) (VP (VBZ is) (NP-PRD (NP (DT a) (ADJP (RB pretty) (JJ negative)) (NN way)) (PP (IN of) (S-NOM (VP (VBG looking) (PP-CLR (IN at) (NP (NNS things)))))))) (. .))"
还有,我需要删除的穿孔参考列表:

punctuation_words = ['.', ',', ':', '-LRB-', '-RRB-', '\'\'', '``', '--', ';',
                     '-', '?', '!', '...', '-LCB-', '-RCB-']
currency_tags_words = ['#', '$', 'C$', 'A$', 'US$']
filterwords = punctuation_words + currency_tags_words
我希望得到如下输出:

out = "(S (NP-SBJ (NP (NP (NNP Ambassador) (NNP Paul) (NNP Nitze) (POS 's)) (NN statement)) (PRN (NP (NP-TTL (NNP Notable) (CC &) (NNP Quotable)) (NP-TMP (NNP Sept.) (CD 20)))) (S (SBAR-ADV (IN If) (S (NP-SBJ (PRP you)) (VP (VBP have) (NP (NP (DT a) (CD million) (NNS people)) (VP (VBG working) (PP (IN for) (NP (PRP you)))))))) (NP-SBJ (NP (DT every) (JJ bad) (NN thing)) (SBAR (WHNP-1 (WDT that)) (S (VP (VBZ has) (NP (NP (CD one) (NN chance)) (PP (IN in) (NP (DT a) (CD million))) (PP (IN of) (S-NOM (VP (VBG going) (ADVP-CLR (NN wrong)))))))))) (VP (MD will) (VP (VB go) (ADVP-CLR (JJ wrong)) (ADVP-TMP (ADVP (IN at) (JJS least)) (IN once) (NP-ADV (DT a) (NN year))))))) (VP (VBZ is) (NP-PRD (NP (DT a) (ADJP (RB pretty) (JJ negative)) (NN way)) (PP (IN of) (S-NOM (VP (VBG looking) (PP-CLR (IN at) (NP (NNS things)))))))))"
以下是我迄今为止所做的尝试:

import nltk

t = nltk.Tree.fromstring(s)
sent = " ".join(item[0] for item in t.pos())
sent_without_punct = " ".join([item for item in sent.split() if item not in filterwords])
print(sent_without_punct)
# "Ambassador Paul Nitze 's statement Notable & Quotable Sept. 20 If you have a million people working for you every bad thing that has one chance in a million of going wrong will go wrong at least once a year is a pretty negative way of looking at things"
这给了我没有标点符号的正确输出。但是我很难将它合并回来,以获得类似于
out
的括号中的字符串

编辑: POS标签在这里不相关。因此,如果有帮助的话,我们可以用开始符号“S”来代替它,比如:

"(S (S (S (S (S Ambassador) (S Paul) (S Nitze) (S 's)) (S statement)) (S (S -LRB-) (S (S (S Notable) (S &) (S Quotable)) (S ,) .... "

您希望删除像
(A-A)
这样的模式,其中
及其匹配的
旁边的文本是相同的,并且来自您的筛选字符串

你可以用

重新导入
标点符号\u words=['、'、'、':'、'-LRB-'、'-RRB-'、'\'\'、'-'、'-'、';',
“-”、“?”、“!”、“…”、“-LCB-”、“-RCB-”]
货币(标记)(单词=['#'、'$'、'C$'、'A$'、'US$']
过滤词=标点符号\单词+货币\标记\单词
filter_rx=“|”。.join(已排序(映射(转义,过滤词),key=len,reverse=True))
rx=r“\s*\(({0})\1\)”。格式(筛选器\u rx)
text=“(NP-SBJ)(NP-NP(NNP)大使)(NNP-Paul)(NNP-Nitze)(POS)(NN声明)(PRN(-LRB--LRB-)(NP-TTL(NNP-值得注意)(CC&)(NNP可引用))(,)(NP-TMP(NNP-Sept)(CD20))(-RRB--RRB-(,)(````')(S(SBAR-ADV)(在If中)(NP-SBJ(PRP-you))(VP)(VP-VBP有)(NP(NP-DT-a)(CD00万)(VBG)(VBG)工作人员)(VBG)(VP这句话的意思是:“我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友。”(NP-ADV(dta)(NN year‘‘‘‘‘’)(VP(VBZ is)(NP-PRD)(NP(dta)(ADJP(RB-pretty)(JJ negative))(NN-way))(PP(IN-of)(S-NOM)(VP(VBG-looking)(PP-CLR(IN-at)(NP(NNS things‘‘)’))(NN-way))(PP(IN-of)(S-NOM)(VP(VBG-looking)(PP-CLR)(在
打印(关于子(rx,“,文本))
请参阅和

正则表达式是
\s*\(\1\)
类型,并且匹配

  • \s*
    -零个或多个空格
  • \(
    -a
    字符
  • (\-LRB\-\\-\-RRB\-\\-\\-\\\-\\\\\-\\\\-\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
  • -空格
  • \1
    -组1反向参考匹配组1中捕获的相同文本
  • \)
    -a
    字符

谢谢你的回答。它因
(PP(ADVP(RB-Yet))(IN-on)(NP(NP(NNS-matters))(ADJP(JJ-close)(PP(TO)(INTJ(UH-er))(NP(NN-home)()))(:)”而失败。
但是,正如我刚才提到的,我想删除包含
“的括号。
。它前面可能有,也可能没有
@nikinlpds如果第二次出现的
A
不要求与第一次出现的相同,请使用
rx=r“\s*\({0}{0}\)”。格式化(filter\u rx)
。请参阅。获取值错误,因为某些括号具有形状(s字)),对于某些括号,只有一个元素。我认为应该检查第二个元素是否在filterwords中,然后完全移除包围它的括号。是的,工作非常完美。感谢您回来:)