Python 将单词对齐以形成带括号的字符串
假设我有以下带标点符号的括号字符串:Python 将单词对齐以形成带括号的字符串,python,regex,string,nltk,Python,Regex,String,Nltk,假设我有以下带标点符号的括号字符串: s = "(S (NP-SBJ (NP (NP (NNP Ambassador) (NNP Paul) (NNP Nitze) (POS 's)) (NN statement)) (PRN (-LRB- -LRB-) (NP (NP-TTL (NNP Notable) (CC &) (NNP Quotable)) (, ,) (NP-TMP (NNP Sept.) (CD 20))) (-RRB- -RRB-) (, ,) (`` ``))
s = "(S (NP-SBJ (NP (NP (NNP Ambassador) (NNP Paul) (NNP Nitze) (POS 's)) (NN statement)) (PRN (-LRB- -LRB-) (NP (NP-TTL (NNP Notable) (CC &) (NNP Quotable)) (, ,) (NP-TMP (NNP Sept.) (CD 20))) (-RRB- -RRB-) (, ,) (`` ``)) (S (SBAR-ADV (IN If) (S (NP-SBJ (PRP you)) (VP (VBP have) (NP (NP (DT a) (CD million) (NNS people)) (VP (VBG working) (PP (IN for) (NP (PRP you)))))))) (, ,) (NP-SBJ (NP (DT every) (JJ bad) (NN thing)) (SBAR (WHNP-1 (WDT that)) (S (VP (VBZ has) (NP (NP (CD one) (NN chance)) (PP (IN in) (NP (DT a) (CD million))) (PP (IN of) (S-NOM (VP (VBG going) (ADVP-CLR (NN wrong)))))))))) (VP (MD will) (VP (VB go) (ADVP-CLR (JJ wrong)) (ADVP-TMP (ADVP (IN at) (JJS least)) (IN once) (NP-ADV (DT a) (NN year)))))) (, ,) ('' '')) (VP (VBZ is) (NP-PRD (NP (DT a) (ADJP (RB pretty) (JJ negative)) (NN way)) (PP (IN of) (S-NOM (VP (VBG looking) (PP-CLR (IN at) (NP (NNS things)))))))) (. .))"
还有,我需要删除的穿孔参考列表:
punctuation_words = ['.', ',', ':', '-LRB-', '-RRB-', '\'\'', '``', '--', ';',
'-', '?', '!', '...', '-LCB-', '-RCB-']
currency_tags_words = ['#', '$', 'C$', 'A$', 'US$']
filterwords = punctuation_words + currency_tags_words
我希望得到如下输出:
out = "(S (NP-SBJ (NP (NP (NNP Ambassador) (NNP Paul) (NNP Nitze) (POS 's)) (NN statement)) (PRN (NP (NP-TTL (NNP Notable) (CC &) (NNP Quotable)) (NP-TMP (NNP Sept.) (CD 20)))) (S (SBAR-ADV (IN If) (S (NP-SBJ (PRP you)) (VP (VBP have) (NP (NP (DT a) (CD million) (NNS people)) (VP (VBG working) (PP (IN for) (NP (PRP you)))))))) (NP-SBJ (NP (DT every) (JJ bad) (NN thing)) (SBAR (WHNP-1 (WDT that)) (S (VP (VBZ has) (NP (NP (CD one) (NN chance)) (PP (IN in) (NP (DT a) (CD million))) (PP (IN of) (S-NOM (VP (VBG going) (ADVP-CLR (NN wrong)))))))))) (VP (MD will) (VP (VB go) (ADVP-CLR (JJ wrong)) (ADVP-TMP (ADVP (IN at) (JJS least)) (IN once) (NP-ADV (DT a) (NN year))))))) (VP (VBZ is) (NP-PRD (NP (DT a) (ADJP (RB pretty) (JJ negative)) (NN way)) (PP (IN of) (S-NOM (VP (VBG looking) (PP-CLR (IN at) (NP (NNS things)))))))))"
以下是我迄今为止所做的尝试:
import nltk
t = nltk.Tree.fromstring(s)
sent = " ".join(item[0] for item in t.pos())
sent_without_punct = " ".join([item for item in sent.split() if item not in filterwords])
print(sent_without_punct)
# "Ambassador Paul Nitze 's statement Notable & Quotable Sept. 20 If you have a million people working for you every bad thing that has one chance in a million of going wrong will go wrong at least once a year is a pretty negative way of looking at things"
这给了我没有标点符号的正确输出。但是我很难将它合并回来,以获得类似于out
的括号中的字符串
编辑:
POS标签在这里不相关。因此,如果有帮助的话,我们可以用开始符号“S”来代替它,比如:
"(S (S (S (S (S Ambassador) (S Paul) (S Nitze) (S 's)) (S statement)) (S (S -LRB-) (S (S (S Notable) (S &) (S Quotable)) (S ,) .... "
您希望删除像
(A-A)
这样的模式,其中(
及其匹配的)
旁边的文本是相同的,并且来自您的筛选字符串
你可以用
重新导入
标点符号\u words=['、'、'、':'、'-LRB-'、'-RRB-'、'\'\'、'-'、'-'、';',
“-”、“?”、“!”、“…”、“-LCB-”、“-RCB-”]
货币(标记)(单词=['#'、'$'、'C$'、'A$'、'US$']
过滤词=标点符号\单词+货币\标记\单词
filter_rx=“|”。.join(已排序(映射(转义,过滤词),key=len,reverse=True))
rx=r“\s*\(({0})\1\)”。格式(筛选器\u rx)
text=“(NP-SBJ)(NP-NP(NNP)大使)(NNP-Paul)(NNP-Nitze)(POS)(NN声明)(PRN(-LRB--LRB-)(NP-TTL(NNP-值得注意)(CC&)(NNP可引用))(,)(NP-TMP(NNP-Sept)(CD20))(-RRB--RRB-(,)(````')(S(SBAR-ADV)(在If中)(NP-SBJ(PRP-you))(VP)(VP-VBP有)(NP(NP-DT-a)(CD00万)(VBG)(VBG)工作人员)(VBG)(VP这句话的意思是:“我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友,我的朋友。”(NP-ADV(dta)(NN year‘‘‘‘‘’)(VP(VBZ is)(NP-PRD)(NP(dta)(ADJP(RB-pretty)(JJ negative))(NN-way))(PP(IN-of)(S-NOM)(VP(VBG-looking)(PP-CLR(IN-at)(NP(NNS things‘‘)’))(NN-way))(PP(IN-of)(S-NOM)(VP(VBG-looking)(PP-CLR)(在
打印(关于子(rx,“,文本))
请参阅和
正则表达式是\s*\(\1\)
类型,并且匹配
-零个或多个空格\s*
-a\(
字符(
(\-LRB\-\\-\-RRB\-\\-\\-\\\-\\\\\-\\\\-\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
-空格
-组1反向参考匹配组1中捕获的相同文本\1
-a\)
字符)
(PP(ADVP(RB-Yet))(IN-on)(NP(NP(NNS-matters))(ADJP(JJ-close)(PP(TO)(INTJ(UH-er))(NP(NN-home)()))(:)”而失败。
但是,正如我刚才提到的,我想删除包含“的括号。
。它前面可能有,也可能没有@nikinlpds如果第二次出现的A
不要求与第一次出现的相同,请使用rx=r“\s*\({0}{0}\)”。格式化(filter\u rx)
。请参阅。获取值错误,因为某些括号具有形状(s字)),对于某些括号,只有一个元素。我认为应该检查第二个元素是否在filterwords中,然后完全移除包围它的括号。是的,工作非常完美。感谢您回来:)