Python 将单词对齐以形成带括号的字符串_Python_Regex_String_Nltk

Python 将单词对齐以形成带括号的字符串

python regex string

Python 将单词对齐以形成带括号的字符串,python,regex,string,nltk,Python,Regex,String,Nltk,假设我有以下带标点符号的括号字符串： s = "(S (NP-SBJ (NP (NP (NNP Ambassador) (NNP Paul) (NNP Nitze) (POS 's)) (NN statement)) (PRN (-LRB- -LRB-) (NP (NP-TTL (NNP Notable) (CC &) (NNP Quotable)) (, ,) (NP-TMP (NNP Sept.) (CD 20))) (-RRB- -RRB-) (, ,) (`` ``))

假设我有以下带标点符号的括号字符串：

s = "(S (NP-SBJ (NP (NP (NNP Ambassador) (NNP Paul) (NNP Nitze) (POS 's)) (NN statement)) (PRN (-LRB- -LRB-) (NP (NP-TTL (NNP Notable) (CC &) (NNP Quotable)) (, ,) (NP-TMP (NNP Sept.) (CD 20))) (-RRB- -RRB-) (, ,) (`` ``)) (S (SBAR-ADV (IN If) (S (NP-SBJ (PRP you)) (VP (VBP have) (NP (NP (DT a) (CD million) (NNS people)) (VP (VBG working) (PP (IN for) (NP (PRP you)))))))) (, ,) (NP-SBJ (NP (DT every) (JJ bad) (NN thing)) (SBAR (WHNP-1 (WDT that)) (S (VP (VBZ has) (NP (NP (CD one) (NN chance)) (PP (IN in) (NP (DT a) (CD million))) (PP (IN of) (S-NOM (VP (VBG going) (ADVP-CLR (NN wrong)))))))))) (VP (MD will) (VP (VB go) (ADVP-CLR (JJ wrong)) (ADVP-TMP (ADVP (IN at) (JJS least)) (IN once) (NP-ADV (DT a) (NN year)))))) (, ,) ('' '')) (VP (VBZ is) (NP-PRD (NP (DT a) (ADJP (RB pretty) (JJ negative)) (NN way)) (PP (IN of) (S-NOM (VP (VBG looking) (PP-CLR (IN at) (NP (NNS things)))))))) (. .))"

还有，我需要删除的穿孔参考列表：

punctuation_words = ['.', ',', ':', '-LRB-', '-RRB-', '\'\'', '``', '--', ';',
                     '-', '?', '!', '...', '-LCB-', '-RCB-']
currency_tags_words = ['#', '$', 'C$', 'A$', 'US$']
filterwords = punctuation_words + currency_tags_words

我希望得到如下输出：

out = "(S (NP-SBJ (NP (NP (NNP Ambassador) (NNP Paul) (NNP Nitze) (POS 's)) (NN statement)) (PRN (NP (NP-TTL (NNP Notable) (CC &) (NNP Quotable)) (NP-TMP (NNP Sept.) (CD 20)))) (S (SBAR-ADV (IN If) (S (NP-SBJ (PRP you)) (VP (VBP have) (NP (NP (DT a) (CD million) (NNS people)) (VP (VBG working) (PP (IN for) (NP (PRP you)))))))) (NP-SBJ (NP (DT every) (JJ bad) (NN thing)) (SBAR (WHNP-1 (WDT that)) (S (VP (VBZ has) (NP (NP (CD one) (NN chance)) (PP (IN in) (NP (DT a) (CD million))) (PP (IN of) (S-NOM (VP (VBG going) (ADVP-CLR (NN wrong)))))))))) (VP (MD will) (VP (VB go) (ADVP-CLR (JJ wrong)) (ADVP-TMP (ADVP (IN at) (JJS least)) (IN once) (NP-ADV (DT a) (NN year))))))) (VP (VBZ is) (NP-PRD (NP (DT a) (ADJP (RB pretty) (JJ negative)) (NN way)) (PP (IN of) (S-NOM (VP (VBG looking) (PP-CLR (IN at) (NP (NNS things)))))))))"

以下是我迄今为止所做的尝试：

import nltk

t = nltk.Tree.fromstring(s)
sent = " ".join(item[0] for item in t.pos())
sent_without_punct = " ".join([item for item in sent.split() if item not in filterwords])
print(sent_without_punct)
# "Ambassador Paul Nitze 's statement Notable & Quotable Sept. 20 If you have a million people working for you every bad thing that has one chance in a million of going wrong will go wrong at least once a year is a pretty negative way of looking at things"

这给了我没有标点符号的正确输出。但是我很难将它合并回来，以获得类似于

out

的括号中的字符串

编辑： POS标签在这里不相关。因此，如果有帮助的话，我们可以用开始符号“S”来代替它，比如：

"(S (S (S (S (S Ambassador) (S Paul) (S Nitze) (S 's)) (S statement)) (S (S -LRB-) (S (S (S Notable) (S &) (S Quotable)) (S ,) .... "

您希望删除像

（A-A）

这样的模式，其中

（

及其匹配的

）

旁边的文本是相同的，并且来自您的筛选字符串

你可以用

重新导入
标点符号\u words=['、'、'、'：'、'-LRB-'、'-RRB-'、'\'\'、'-'、'-'、'；'，
“-”、“？”、“！”、“…”、“-LCB-”、“-RCB-”]
货币(标记)(单词=['#'、'$'、'C$'、'A$'、'US$']
过滤词=标点符号\单词+货币\标记\单词
filter_rx=“|”。.join（已排序（映射（转义，过滤词），key=len，reverse=True））
rx=r“\s*\（（{0}）\1\）”。格式（筛选器\u rx）
text=“（NP-SBJ）（NP-NP（NNP）大使）（NNP-Paul）（NNP-Nitze）（POS）（NN声明）（PRN（-LRB--LRB-）（NP-TTL（NNP-值得注意）（CC&）（NNP可引用））（，）（NP-TMP（NNP-Sept）（CD20））（-RRB--RRB-（，）（````'）（S（SBAR-ADV）（在If中）（NP-SBJ（PRP-you））（VP）（VP-VBP有）（NP（NP-DT-a）（CD00万）（VBG）（VBG）工作人员）（VBG）（VP这句话的意思是：“我的朋友，我的朋友，我的朋友，我的朋友，我的朋友，我的朋友，我的朋友，我的朋友，我的朋友，我的朋友，我的朋友，我的朋友，我的朋友，我的朋友，我的朋友，我的朋友，我的朋友，我的朋友，我的朋友，我的朋友，我的朋友，我的朋友，我的朋友。”（NP-ADV（dta）（NN year‘‘‘‘‘’）（VP（VBZ is）（NP-PRD）（NP（dta）（ADJP（RB-pretty）（JJ negative））（NN-way））（PP（IN-of）（S-NOM）（VP（VBG-looking）（PP-CLR（IN-at）（NP（NNS things‘‘）’））（NN-way））（PP（IN-of）（S-NOM）（VP（VBG-looking）（PP-CLR）（在
打印（关于子（rx，“，文本））

请参阅和

正则表达式是

\s*\（\1\）

类型，并且匹配

```
\s*
```
-零个或多个空格
```
\（
```
-a
```
（
```
字符

（\-LRB\-\\-\-RRB\-\\-\\-\\\-\\\\\-\\\\-\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\


-空格
\1
-组1反向参考匹配组1中捕获的相同文本
\）
-a）
字符

谢谢你的回答。它因（PP（ADVP（RB-Yet））（IN-on）（NP（NP（NNS-matters））（ADJP（JJ-close）（PP（TO）（INTJ（UH-er））（NP（NN-home）（）））（：）”而失败。
但是，正如我刚才提到的，我想删除包含“的括号。
。它前面可能有，也可能没有

@nikinlpds如果第二次出现的

不要求与第一次出现的相同，请使用

rx=r“\s*\（{0}{0}\）”。格式化（filter\u rx）

。请参阅。获取值错误，因为某些括号具有形状（s字）），对于某些括号，只有一个元素。我认为应该检查第二个元素是否在filterwords中，然后完全移除包围它的括号。是的，工作非常完美。感谢您回来：）