Python 如何使用字长作为标记拆分字符串_Python_Regex_Split_String Length

Python 如何使用字长作为标记拆分字符串

python regex

Python 如何使用字长作为标记拆分字符串,python,regex,split,string-length,Python,Regex,Split,String Length,我正在使用Python 3准备保存文档标题的字符串，以便在美国专利网站上用作搜索词 1）保持长句是有益的，但是 2）当搜索包含许多长度为3个或更少字符的单词时，搜索效果不佳，因此我需要消除它们我曾尝试使用正则表达式“\b\w[1:3}\b*”将一到三个字母的单词拆分为带或不带尾随空格的单词，但没有成功。不过，我对正则表达式并不精通 for pubtitle in df_tpdownloads['PublicationTitleSplit']: pubtitle = pubtitle

我正在使用Python 3准备保存文档标题的字符串，以便在美国专利网站上用作搜索词

1）保持长句是有益的，但是

2）当搜索包含许多长度为3个或更少字符的单词时，搜索效果不佳，因此我需要消除它们

我曾尝试使用正则表达式“\b\w[1:3}\b*”将一到三个字母的单词拆分为带或不带尾随空格的单词，但没有成功。不过，我对正则表达式并不精通

for pubtitle in df_tpdownloads['PublicationTitleSplit']:
    pubtitle = pubtitle.lower() # make lower case
    pubtitle = re.split("[?:.,;\"\'\-()]+", pubtitle) # tokenize and remove punctuation
    #print(pubtitle)

    for subArray in pubtitle:
        print(subArray)
        subArray = subArray.strip()
        subArray = re.split("(\b\w{1:3}\b) *", subArray) # split on words that are < 4 letters
        print(subArray)

变成

[“培训要求”、“选定的盐应用”]

以及

“12月31日”

变成

['december']

以及

“新兴盐实验系统在过程热中的研究与应用”

变成

[“实验系统”、“盐”、“涌现研究”、“应用”、“过程热”]

但是split并没有捕获小词，我也无法判断问题是regex还是re.split命令，或者两者都有

我可能可以用蛮力的方法，但我想要一个优雅的解决方案。任何帮助都将不胜感激。

您可以使用

list(filter(None, re.split(r'\s*\b\w{1,3}\b\s*|[^\w\s]+', pubtitle.strip().lower())))

要获得所需的结果，请参阅

r'\s*\b\w{1,3}\b\s*.[^\w\s]+'

正则表达式将不带前导和尾随空格的小写字符串（由于

.strip（）

）拆分为没有标点（

[^\w\s]+

这样做）且没有1-3个字符（

\s*\b\w{1,3}\b*/code>这样做）
图案细节

\s*
-0+空格
\b
-单词边界
\w{1,3}
-1、2或3个单词字符（如果您不想匹配\uu
请使用[^\w\u]+
）
\b
-单词边界
\s*
-0+空格
|
-或
[^\w\s]+
-1个或多个字符，而不是单词和空白字符

见：
输出：
['training requirements', 'selected salt applications']
['december']
['experimental system', 'salt', 'emergence research', 'applications', 'process heat']

注意{1,3}
是正确的，您需要使用原始字符串文字。如果您需要在输出中保留这些文字，您可以在模式周围使用捕获组，否则删除（
和）
。请尝试重新拆分（r“\s*\b\w{1,3}\b\s*”，子数组）
import re

df_tpdownloads = [" and training requirements for selected salt applications",
                  "december 31",
                  "experimental system for salt in an emergence research and applications in process heat"]

#for pubtitle in df_tpdownloads['PublicationTitleSplit']:
for pubtitle in df_tpdownloads:
    result = list(filter(None, re.split(r'\s*\b\w{1,3}\b\s*|[^\w\s]+', pubtitle.strip().lower())))
    print(result)

['training requirements', 'selected salt applications']
['december']
['experimental system', 'salt', 'emergence research', 'applications', 'process heat']