Python 3.x Python正则表达式，用于模式2位到2位，如-26到40_Python 3.x_Regex_Data Cleaning_Data Wrangling

Python 3.x Python正则表达式，用于模式2位到2位，如-26到40

python-3.x regex

Python 3.x Python正则表达式，用于模式2位到2位，如-26到40,python-3.x,regex,data-cleaning,data-wrangling,Python 3.x,Regex,Data Cleaning,Data Wrangling,请帮帮我，regex让我大吃一惊我正在清理熊猫数据帧（python 3）中的数据我尝试了很多在网上找到的用于数字的正则表达式组合，但都不适用于我的情况。我似乎不知道如何为模式2数字空格到空格2数字（例如26到40）编写自己的正则表达式我的挑战是从熊猫栏中提取花瓣的数量（刮取的数据）。通常将花瓣指定为“dd至dd花瓣”。我知道正则表达式中的两个数字是\d\d或\d{2}，但如何合并按“to”拆分的数字呢？有一个条件也很好，图案后面跟着单词“花瓣” 我肯定不是第一个需要python中正则表达式

请帮帮我，regex让我大吃一惊

我正在清理熊猫数据帧（python 3）中的数据

我尝试了很多在网上找到的用于数字的正则表达式组合，但都不适用于我的情况。我似乎不知道如何为模式2数字空格到空格2数字（例如26到40）编写自己的正则表达式

我的挑战是从熊猫栏中提取花瓣的数量（刮取的数据）。通常将花瓣指定为“dd至dd花瓣”。我知道正则表达式中的两个数字是

\d\d

或

\d{2}

，但如何合并按“to”拆分的数字呢？有一个条件也很好，图案后面跟着单词“花瓣”

我肯定不是第一个需要python中正则表达式来实现模式\d\d到\d\d的人

编辑：

我意识到，没有样本数据框的问题有点令人困惑。下面是一个示例数据帧

import pandas as pd 
import re

# initialize list of lists 
data = [['Evert van Dijk', 'Carmine-pink, salmon-pink streaks, stripes, flecks.  Warm pink, clear carmine pink, rose pink shaded salmon.  Mild fragrance.  Large, very double, in small clusters, high-centered bloom form.  Blooms in flushes throughout the season.'],
    ['Every Good Gift', 'Red.  Flowers velvety red.  Moderate fragrance.  Average diameter 4".  Medium-large, full (26-40 petals), borne mostly solitary bloom form.  Blooms in flushes throughout the season.'], 
    ['Evghenya', 'Orange-pink.  75 petals.  Large, very double bloom form.  Blooms in flushes throughout the season.'], 
    ['Evita', 'White or white blend.  None to mild fragrance.  35 petals.  Large, full (26-40 petals), high-centered bloom form.  Blooms in flushes throughout the season.'],
    ['Evrathin', 'Light pink. [Deep pink.]  Outer petals white. Expand rarely.  Mild fragrance.  35 to 40 petals.  Average diameter 2.5".  Medium, double (17-25 petals), full (26-40 petals), cluster-flowered, in small clusters bloom form.  Prolific, once-blooming spring or summer.  Glandular sepals, leafy sepals, long sepals buds.'],
    ['Evita 2', 'White, blush shading.  Mild, wild rose fragrance.  20 to 25 petals.  Average diameter 1.25".  Small, very double, cluster-flowered bloom form.  Blooms in flushes throughout the season.']]

# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['NAME', 'BLOOM']) 

# print dataframe. 
df

这对我很有用：

import re

sample = '2 digits (example 26 to 40 petals) and 16 to 43 petals.'
re.compile(r"\d{2}\sto\s\d{2}\spetals").findall(sample)

输出：

['26 to 40 petals', '16 to 43 petals']

如您所述，\d{2}查找两位数，\sto\s查找由空格包围的单词“to”，然后\d{2}再次查找第二个两位数，后跟空格（\s）和单词“petals”。

您可以使用

df['res_col'] = df['src_col'].str.extract(r'(?<!\d)(\d{2}\s+to\s+\d{2})\s*petal', expand=False)

df['res_col']=df['src_col'].str.extract（r'（？发布了一个答案，说明我如何解决了从BLOOM列中提取petals数据的问题。我必须使用多个正则表达式来获取我想要的所有数据。这个问题只涉及我使用的一个正则表达式
示例数据框在打印时如下所示：

在我遇到导致这篇文章的问题之前，我创建了这些专栏。我最初的方法是获取括号中的所有数据
#coping content in column BLOOM inside first brackets into new column PETALS
df['PETALS'] = df['BLOOM'].str.extract('(\\(.*?)\\)', expand=False).str.strip()
df['PETALS'] = df['PETALS'].str.replace("(","") 

# #coping content in column BLOOM inside all brackets into new column ALL_PETALS_BRACKETS
df['ALL_PETALS_BRACKETS'] = df['BLOOM'].str.findall('(\\(.*?)\\)')
df[['NAME','BLOOM','PETALS', 'ALL_PETALS_BRACKETS']]


后来我意识到，这种方式只能获取某些行的花瓣值。花瓣可以在BLOOM列中以多种方式指定。另一种常见模式是“2位数到2位数”。还有一种模式是“2位数花瓣”
#Wiktor Stribiżew提供的解决方案
df['PETALS\u Wiktor\u S']=df['BLOOM'].str.extract（r'（？如果正则表达式写为r'（\d{2}\S+PETALS.
它抓住单词PETALS后跟
和（
df['source\u col']）.str extract（r'\b（\d{2}\S+to\S+\d}\S*petal'，expand=False）？是的，我误解了这个问题。这在我的情况下不起作用。但感谢正则表达式。在它的帮助下，我能够理解它。这篇文章中的所有示例都对我有帮助。谢谢。我必须修改正则表达式一点，以便为我工作，并在提取后添加“.str.strip（）”。
# solution provided by Wiktor Stribiżew
df['PETALS_Wiktor_S'] = df['BLOOM'].str.extract(r'(?<!\d)(\d{2}\s+to\s+\d{2})\s*petal', expand=False)

# my modification that worked on the main df and not only on the test one. 
# now lets copy part of column BLOOM that matches regex pattern two digits to two digits
df['PETALS5'] = df['BLOOM'].str.extract(r'(\d{2}\s+to\s+\d{2})', expand=False).str.strip()

# also came across cases where pattern is two digits followed by word "petals"
#now lets copy part of column BLOOM that matches regex patern two digits followed by word "petals"
df['PETALS6'] = df['BLOOM'].str.extract(r'(\d{2}\s+petals+\.)', expand=False).str.strip()
df