Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/15.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 3.x Python正则表达式,用于模式2位到2位,如-26到40_Python 3.x_Regex_Data Cleaning_Data Wrangling - Fatal编程技术网

Python 3.x Python正则表达式,用于模式2位到2位,如-26到40

Python 3.x Python正则表达式,用于模式2位到2位,如-26到40,python-3.x,regex,data-cleaning,data-wrangling,Python 3.x,Regex,Data Cleaning,Data Wrangling,请帮帮我,regex让我大吃一惊 我正在清理熊猫数据帧(python 3)中的数据 我尝试了很多在网上找到的用于数字的正则表达式组合,但都不适用于我的情况。我似乎不知道如何为模式2数字空格到空格2数字(例如26到40)编写自己的正则表达式 我的挑战是从熊猫栏中提取花瓣的数量(刮取的数据)。通常将花瓣指定为“dd至dd花瓣”。我知道正则表达式中的两个数字是\d\d或\d{2},但如何合并按“to”拆分的数字呢?有一个条件也很好,图案后面跟着单词“花瓣” 我肯定不是第一个需要python中正则表达式

请帮帮我,regex让我大吃一惊

我正在清理熊猫数据帧(python 3)中的数据

我尝试了很多在网上找到的用于数字的正则表达式组合,但都不适用于我的情况。我似乎不知道如何为模式2数字空格到空格2数字(例如26到40)编写自己的正则表达式

我的挑战是从熊猫栏中提取花瓣的数量(刮取的数据)。通常将花瓣指定为“dd至dd花瓣”。我知道正则表达式中的两个数字是
\d\d
\d{2}
,但如何合并按“to”拆分的数字呢?有一个条件也很好,图案后面跟着单词“花瓣”

我肯定不是第一个需要python中正则表达式来实现模式\d\d到\d\d的人

编辑:

我意识到,没有样本数据框的问题有点令人困惑。下面是一个示例数据帧

import pandas as pd 
import re

# initialize list of lists 
data = [['Evert van Dijk', 'Carmine-pink, salmon-pink streaks, stripes, flecks.  Warm pink, clear carmine pink, rose pink shaded salmon.  Mild fragrance.  Large, very double, in small clusters, high-centered bloom form.  Blooms in flushes throughout the season.'],
    ['Every Good Gift', 'Red.  Flowers velvety red.  Moderate fragrance.  Average diameter 4".  Medium-large, full (26-40 petals), borne mostly solitary bloom form.  Blooms in flushes throughout the season.'], 
    ['Evghenya', 'Orange-pink.  75 petals.  Large, very double bloom form.  Blooms in flushes throughout the season.'], 
    ['Evita', 'White or white blend.  None to mild fragrance.  35 petals.  Large, full (26-40 petals), high-centered bloom form.  Blooms in flushes throughout the season.'],
    ['Evrathin', 'Light pink. [Deep pink.]  Outer petals white. Expand rarely.  Mild fragrance.  35 to 40 petals.  Average diameter 2.5".  Medium, double (17-25 petals), full (26-40 petals), cluster-flowered, in small clusters bloom form.  Prolific, once-blooming spring or summer.  Glandular sepals, leafy sepals, long sepals buds.'],
    ['Evita 2', 'White, blush shading.  Mild, wild rose fragrance.  20 to 25 petals.  Average diameter 1.25".  Small, very double, cluster-flowered bloom form.  Blooms in flushes throughout the season.']]

# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['NAME', 'BLOOM']) 

# print dataframe. 
df 
这对我很有用:

import re

sample = '2 digits (example 26 to 40 petals) and 16 to 43 petals.'
re.compile(r"\d{2}\sto\s\d{2}\spetals").findall(sample)
输出:

['26 to 40 petals', '16 to 43 petals']
如您所述,\d{2}查找两位数,\sto\s查找由空格包围的单词“to”,然后\d{2}再次查找第二个两位数,后跟空格(\s)和单词“petals”。

您可以使用

df['res_col'] = df['src_col'].str.extract(r'(?<!\d)(\d{2}\s+to\s+\d{2})\s*petal', expand=False)

df['res_col']=df['src_col'].str.extract(r'(?发布了一个答案,说明我如何解决了从BLOOM列中提取petals数据的问题。我必须使用多个正则表达式来获取我想要的所有数据。这个问题只涉及我使用的一个正则表达式

示例数据框在打印时如下所示:

在我遇到导致这篇文章的问题之前,我创建了这些专栏。我最初的方法是获取括号中的所有数据

#coping content in column BLOOM inside first brackets into new column PETALS
df['PETALS'] = df['BLOOM'].str.extract('(\\(.*?)\\)', expand=False).str.strip()
df['PETALS'] = df['PETALS'].str.replace("(","") 

# #coping content in column BLOOM inside all brackets into new column ALL_PETALS_BRACKETS
df['ALL_PETALS_BRACKETS'] = df['BLOOM'].str.findall('(\\(.*?)\\)')
df[['NAME','BLOOM','PETALS', 'ALL_PETALS_BRACKETS']]

后来我意识到,这种方式只能获取某些行的花瓣值。花瓣可以在BLOOM列中以多种方式指定。另一种常见模式是“2位数到2位数”。还有一种模式是“2位数花瓣”

#Wiktor Stribiżew提供的解决方案

df['PETALS\u Wiktor\u S']=df['BLOOM'].str.extract(r'(?如果正则表达式写为
r'(\d{2}\S+PETALS.
它抓住单词PETALS后跟
df['source\u col']).str extract(r'\b(\d{2}\S+to\S+\d}\S*petal',expand=False)
?是的,我误解了这个问题。这在我的情况下不起作用。但感谢正则表达式。在它的帮助下,我能够理解它。这篇文章中的所有示例都对我有帮助。谢谢。我必须修改正则表达式一点,以便为我工作,并在提取后添加“.str.strip()”。
# solution provided by Wiktor Stribiżew
df['PETALS_Wiktor_S'] = df['BLOOM'].str.extract(r'(?<!\d)(\d{2}\s+to\s+\d{2})\s*petal', expand=False)

# my modification that worked on the main df and not only on the test one. 
# now lets copy part of column BLOOM that matches regex pattern two digits to two digits
df['PETALS5'] = df['BLOOM'].str.extract(r'(\d{2}\s+to\s+\d{2})', expand=False).str.strip()

# also came across cases where pattern is two digits followed by word "petals"
#now lets copy part of column BLOOM that matches regex patern two digits followed by word "petals"
df['PETALS6'] = df['BLOOM'].str.extract(r'(\d{2}\s+petals+\.)', expand=False).str.strip()
df