Python 修复字符串中错误的分隔符_Python_Regex

Python 修复字符串中错误的分隔符

python regex

Python 修复字符串中错误的分隔符,python,regex,Python,Regex,给定不正确的字符串： s="rate implies depreciation. Th e straight lines show eff ective linear time trends in the nominal (dashed " 我想输出正确的字符串，如： s="rate implies depreciation. The straight lines show effective linear time trends in the nominal (dashed" 如果我尝试

给定不正确的字符串：

s="rate implies depreciation. Th  e straight lines show eff ective linear time trends in the nominal (dashed "

我想输出正确的字符串，如：

s="rate implies depreciation. The straight lines show effective linear time trends in the nominal (dashed"

如果我尝试使用以下命令删除所有分隔符：

re.sub("\\s*","",s)

它将给我：

“费率意味着折旧。行车线显示有效的线性部分以虚线显示”，这不是我想要的

您可以尝试使用pyspellchecker检查单词拼写，例如

（pip安装pyspellchecker）

然后检查一个单词是否不存在，但前一个单词+单词是否存在：

    valid_s = [splitted_s[0]]
    for i in range(1,len(splitted_s)):
      word = splitted_s[i]
      previous_word = splitted_s[i-1]
      valid_s.append(word)
      if spell.unknown([word]) and len(word)>0:
        if not spell.unknown([(previous_word+word).lower()]):
          valid_s.pop()
          valid_s.pop()
          valid_s.append(previous_word+word)

    print(' '.join(valid_s))

 >>>rate implies depreciation. Th e straight lines show effective linear time trends in the nominal (dashed

但在这里，因为e在字典中作为一个词存在，所以它不连接th和e

所以，如果上一个单词+单词在字典中的使用频率（远）高于单词，您还可以比较单词频率，并将上一个单词与单词连接起来：

    valid_s = [splitted_s[0]]
    for i in range(1,len(splitted_s)):
      word = splitted_s[i]
      previous_word = splitted_s[i-1]
      valid_s.append(splitted_s[i])
      if spell.word_probability(word.lower())<spell.word_probability((previous_word+word).lower()):
        valid_s.pop()
        valid_s.pop()
        valid_s.append(previous_word+word)


    print(' '.join(valid_s))

 >>>rate implies depreciation. The straight lines show effective linear time trends in the nominal (dashed

valid\u s=[splitted\u s[0]]
对于范围（1，len（分割的_s））中的i：
word=拆分的\u s[i]
上一个单词=拆分的单词[i-1]
有效的附加（拆分的附加[i]）
如果拼写.word_概率（word.lower（））>>比率意味着折旧。直线表示名义（虚线）中的有效线性时间趋势
<代码> >您正在删除所有空白区-我不知道如何实现您想要的，因为没有办法区分您想要删除的内容（例如，在<代码>中的空间EFIORION < /代码>）以及单词之间的空格。我同意@Ollie的观点。我看不出任何方法来区分你所追求的空格与其他空格之间的独特模式…可能是错误的。除了其他评论提到的内容之外，你可以尝试以下方法：找出以空格形式出现的确切字符。如果你运气好的话，结果可能是介于两者之间单词和有效
，它不是空格
，而是选项卡
或其他一些空白字符。然后，您可以仅为这些特定字符而不是\s
调用re.sub（）。
    valid_s = [splitted_s[0]]
    for i in range(1,len(splitted_s)):
      word = splitted_s[i]
      previous_word = splitted_s[i-1]
      valid_s.append(splitted_s[i])
      if spell.word_probability(word.lower())<spell.word_probability((previous_word+word).lower()):
        valid_s.pop()
        valid_s.pop()
        valid_s.append(previous_word+word)


    print(' '.join(valid_s))

 >>>rate implies depreciation. The straight lines show effective linear time trends in the nominal (dashed