Python 3.x Python正则表达式,用于模式2位到2位,如-26到40
请帮帮我,regex让我大吃一惊 我正在清理熊猫数据帧(python 3)中的数据 我尝试了很多在网上找到的用于数字的正则表达式组合,但都不适用于我的情况。我似乎不知道如何为模式2数字空格到空格2数字(例如26到40)编写自己的正则表达式 我的挑战是从熊猫栏中提取花瓣的数量(刮取的数据)。通常将花瓣指定为“dd至dd花瓣”。我知道正则表达式中的两个数字是Python 3.x Python正则表达式,用于模式2位到2位,如-26到40,python-3.x,regex,data-cleaning,data-wrangling,Python 3.x,Regex,Data Cleaning,Data Wrangling,请帮帮我,regex让我大吃一惊 我正在清理熊猫数据帧(python 3)中的数据 我尝试了很多在网上找到的用于数字的正则表达式组合,但都不适用于我的情况。我似乎不知道如何为模式2数字空格到空格2数字(例如26到40)编写自己的正则表达式 我的挑战是从熊猫栏中提取花瓣的数量(刮取的数据)。通常将花瓣指定为“dd至dd花瓣”。我知道正则表达式中的两个数字是\d\d或\d{2},但如何合并按“to”拆分的数字呢?有一个条件也很好,图案后面跟着单词“花瓣” 我肯定不是第一个需要python中正则表达式
\d\d
或\d{2}
,但如何合并按“to”拆分的数字呢?有一个条件也很好,图案后面跟着单词“花瓣”
我肯定不是第一个需要python中正则表达式来实现模式\d\d到\d\d的人
编辑:
我意识到,没有样本数据框的问题有点令人困惑。下面是一个示例数据帧
import pandas as pd
import re
# initialize list of lists
data = [['Evert van Dijk', 'Carmine-pink, salmon-pink streaks, stripes, flecks. Warm pink, clear carmine pink, rose pink shaded salmon. Mild fragrance. Large, very double, in small clusters, high-centered bloom form. Blooms in flushes throughout the season.'],
['Every Good Gift', 'Red. Flowers velvety red. Moderate fragrance. Average diameter 4". Medium-large, full (26-40 petals), borne mostly solitary bloom form. Blooms in flushes throughout the season.'],
['Evghenya', 'Orange-pink. 75 petals. Large, very double bloom form. Blooms in flushes throughout the season.'],
['Evita', 'White or white blend. None to mild fragrance. 35 petals. Large, full (26-40 petals), high-centered bloom form. Blooms in flushes throughout the season.'],
['Evrathin', 'Light pink. [Deep pink.] Outer petals white. Expand rarely. Mild fragrance. 35 to 40 petals. Average diameter 2.5". Medium, double (17-25 petals), full (26-40 petals), cluster-flowered, in small clusters bloom form. Prolific, once-blooming spring or summer. Glandular sepals, leafy sepals, long sepals buds.'],
['Evita 2', 'White, blush shading. Mild, wild rose fragrance. 20 to 25 petals. Average diameter 1.25". Small, very double, cluster-flowered bloom form. Blooms in flushes throughout the season.']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['NAME', 'BLOOM'])
# print dataframe.
df
这对我很有用:
import re
sample = '2 digits (example 26 to 40 petals) and 16 to 43 petals.'
re.compile(r"\d{2}\sto\s\d{2}\spetals").findall(sample)
输出:
['26 to 40 petals', '16 to 43 petals']
如您所述,\d{2}查找两位数,\sto\s查找由空格包围的单词“to”,然后\d{2}再次查找第二个两位数,后跟空格(\s)和单词“petals”。您可以使用
df['res_col'] = df['src_col'].str.extract(r'(?<!\d)(\d{2}\s+to\s+\d{2})\s*petal', expand=False)
df['res_col']=df['src_col'].str.extract(r'(?发布了一个答案,说明我如何解决了从BLOOM列中提取petals数据的问题。我必须使用多个正则表达式来获取我想要的所有数据。这个问题只涉及我使用的一个正则表达式
示例数据框在打印时如下所示:
在我遇到导致这篇文章的问题之前,我创建了这些专栏。我最初的方法是获取括号中的所有数据
#coping content in column BLOOM inside first brackets into new column PETALS
df['PETALS'] = df['BLOOM'].str.extract('(\\(.*?)\\)', expand=False).str.strip()
df['PETALS'] = df['PETALS'].str.replace("(","")
# #coping content in column BLOOM inside all brackets into new column ALL_PETALS_BRACKETS
df['ALL_PETALS_BRACKETS'] = df['BLOOM'].str.findall('(\\(.*?)\\)')
df[['NAME','BLOOM','PETALS', 'ALL_PETALS_BRACKETS']]
后来我意识到,这种方式只能获取某些行的花瓣值。花瓣可以在BLOOM列中以多种方式指定。另一种常见模式是“2位数到2位数”。还有一种模式是“2位数花瓣”
#Wiktor Stribiżew提供的解决方案
df['PETALS\u Wiktor\u S']=df['BLOOM'].str.extract(r'(?如果正则表达式写为r'(\d{2}\S+PETALS.
它抓住单词PETALS后跟
和(
df['source\u col']).str extract(r'\b(\d{2}\S+to\S+\d}\S*petal',expand=False)
?是的,我误解了这个问题。这在我的情况下不起作用。但感谢正则表达式。在它的帮助下,我能够理解它。这篇文章中的所有示例都对我有帮助。谢谢。我必须修改正则表达式一点,以便为我工作,并在提取后添加“.str.strip()”。
# solution provided by Wiktor Stribiżew
df['PETALS_Wiktor_S'] = df['BLOOM'].str.extract(r'(?<!\d)(\d{2}\s+to\s+\d{2})\s*petal', expand=False)
# my modification that worked on the main df and not only on the test one.
# now lets copy part of column BLOOM that matches regex pattern two digits to two digits
df['PETALS5'] = df['BLOOM'].str.extract(r'(\d{2}\s+to\s+\d{2})', expand=False).str.strip()
# also came across cases where pattern is two digits followed by word "petals"
#now lets copy part of column BLOOM that matches regex patern two digits followed by word "petals"
df['PETALS6'] = df['BLOOM'].str.extract(r'(\d{2}\s+petals+\.)', expand=False).str.strip()
df