Regex Python3-处理连字符单词：合并和拆分_Regex_Python 3.x_Replace_Split_Nlp

Regex Python3-处理连字符单词：合并和拆分

regex python-3.x replace nlp

Regex Python3-处理连字符单词：合并和拆分,regex,python-3.x,replace,split,nlp,Regex,Python 3.x,Replace,Split,Nlp,我想处理连字符的单词。例如，我想用两种不同的方式来处理“众所周知”这个词首先，组合这个词，即（“众所周知”），第二种方法是拆分这个词，即（“众所周知”）输入为：“众所周知”，预期输出为： --wellknown --well --known 但我只能单独解析每个单词，但不能同时解析两个单词。当我在文本文件中循环时，如果我要查找连字符的单词，我会首先将它们组合起来然后，在我组合它们之后，我不知道如何再次回到原始单词并执行拆分操作。以下是我的代码中的短片段。（如果您需要查看更多详细信息，

我想处理连字符的单词。例如，我想用两种不同的方式来处理“众所周知”这个词

首先，组合这个词，即（“众所周知”），第二种方法是拆分这个词，即（“众所周知”）

输入为：“众所周知”，预期输出为：

--wellknown

--well

--known

但我只能单独解析每个单词，但不能同时解析两个单词。当我在文本文件中循环时，如果我要查找连字符的单词，我会首先将它们组合起来

然后，在我组合它们之后，我不知道如何再次回到原始单词并执行拆分操作。以下是我的代码中的短片段。（如果您需要查看更多详细信息，请告诉我）

我知道我不能同时执行这两个操作的原因，因为在我替换了连字符后，这个单词消失了。然后，我找不到连字符的单词来执行拆分（在代码中是“separate”）操作。有人知道怎么做吗？或者如何修正逻辑？

为什么不使用包含分离词和组合词的元组呢

先拆分，然后合并：

示例代码

separate = text.split('-')
combined = ''.join(separate)
words = (combined, separate[0], separate[1])

输出

('wellknown', 'well', 'known')

将令牌视为对象而不是字符串，然后可以创建具有多个属性的令牌

例如，我们可以使用

collections.namedtuple

容器作为一个简单对象来保存令牌：

from collections import namedtuple

from nltk import word_tokenize

Token = namedtuple('Token', ['surface', 'splitup', 'combined'])

text = "This is a well-known example of a small-business grant of $123,456."

tokenized_text = []

for token in word_tokenize(text):
    if '-' in token:
        this_token = Token(token, tuple(token.split('-')),  token.replace('-', ''))
    else:
        this_token = Token(token, token, token)
    tokenized_text.append(this_token)

然后，您可以通过

标记化的\u文本

作为

标记

namedtuple的列表进行迭代，例如，如果我们只需要表面字符串的列表：

for token in tokenized_text:
    print(token.surface)
    tokenized_text

[out]：

This
is
a
well-known
example
of
a
small-business
grant
of
$
123,456
.

This
is
a
wellknown
example
of
a
smallbusiness
grant
of
$
123,456
.

This
is
a
('well', 'known')
example
of
a
('small', 'business')
grant
of
$
123,456
.

如果您需要访问组合令牌：

for token in tokenized_text:
    print(token.combined)

[out]：

This
is
a
well-known
example
of
a
small-business
grant
of
$
123,456
.

This
is
a
wellknown
example
of
a
smallbusiness
grant
of
$
123,456
.

This
is
a
('well', 'known')
example
of
a
('small', 'business')
grant
of
$
123,456
.

如果您想访问拆分标记，请使用相同的循环，但您会看到得到的是元组而不是字符串，例如

for token in tokenized_text:
    print(token.splitup)

[out]：

This
is
a
well-known
example
of
a
small-business
grant
of
$
123,456
.

This
is
a
wellknown
example
of
a
smallbusiness
grant
of
$
123,456
.

This
is
a
('well', 'known')
example
of
a
('small', 'business')
grant
of
$
123,456
.

您也可以使用列表理解来访问

标记的属性namedtuples，例如
>>> [token.splitup for token in tokenized_text]
['This', 'is', 'a', ('well', 'known'), 'example', 'of', 'a', ('small', 'business'), 'grant', 'of', '$', '123,456', '.']

要识别带有连字符且已拆分的标记，您可以轻松检查其类型，例如
>>> [type(token.splitup) for token in tokenized_text]
[str, str, str, tuple, str, str, str, tuple, str, str, str, str, str]

像这样的？您是否有较大的文本输入示例？那么预期的输出呢？将标记更多地看作是一个对象而不是字符串，那么您可以创建一个具有多个属性的标记。是否有具有多个连字符的单词？多输入一些数据会有所帮助。如果一个单词中有多个连字符，则必须更改第三行代码。我想前两行就足够回答这个问题了。谢谢你们的帮助！。幸运的是，我的案例在单词之间只有一个连字符。但值得一提的是@keyur pottar。感谢@codekaizerThank you@alvas的长期解决方案。它是分开工作的，但我只想让它们一起工作。但还是感谢你的回答！：）它也一起工作。将标记视为对象而不是字符串。一个对象可以包含你想要的任何东西或任何函数。。我可以只向列表中添加元素。