而使用regex或其他来处理新数据的condition或extractall
我从一个文件中读取了一个数据集,我认为整个文件都是这样的(总共500-600行): 所以我用这段代码生成了我需要的数据帧:而使用regex或其他来处理新数据的condition或extractall,regex,python-3.x,pandas,while-loop,Regex,Python 3.x,Pandas,While Loop,我从一个文件中读取了一个数据集,我认为整个文件都是这样的(总共500-600行): 所以我用这段代码生成了我需要的数据帧: with open("dataset.txt", 'r') as infile: l = [x.replace(']', ',').replace("[", '').replace('"', '').replace('\n', '').strip().split(',') for x in infile] df = pd.DataFrame(l) d
with open("dataset.txt", 'r') as infile:
l = [x.replace(']', ',').replace("[", '').replace('"', '').replace('\n', '').strip().split(',') for x in infile]
df = pd.DataFrame(l)
df['A'] = list(range(len(df.index)))
del df[2]
df.rename(columns={1: 'nutrient'}, inplace=True)
df[['amount_S']] = df['nutrient'].str.extract(pat=r'(?:\'\s\')(S|\d+\.\d+)', expand=True).fillna(0)
df['nutrient'] = df['nutrient'].str.replace(pat=r'\'\s\'S|\d+',repl ='')
df['nutrient'] = df['nutrient'].str.replace('\'', repl = '')
df['nutrient'] = df['nutrient'].str.replace('.', repl='')
代码的dataframe输出如下所示(准备与另一个dataset和pivot连接):
现在,我发现我的文件中还有一些条目如下所示:
0,['' '' '' '']
1,['Size' 'S' 'Size' 'M']
2,['Energy (kJ)' '351' 'Energy (kJ)' '617']
3,['Protein (g)' '2.3' 'Protein (g)' '4']
4,['Carbohydrates (g)' '15.4' 'Carbohydrates (g)' '26.9']
5,['Sugars (g)' '1.9' 'Sugars (g)' '3.3']
6,['Total Fat (g)' '0.6' 'Total Fat (g)' '1']
7,['Saturated Fat' '0.1' 'Saturated Fat' '0.1']
8,['Trans Fat (g)' '0' 'Trans Fat (g)' '0']
9,['Dietary Fibre (g)' '1.9' 'Dietary Fibre (g)' '3.4']
10,['Sodium (mg)' '2' 'Sodium (mg)' '4']
11,['Serving Size (g)' '75' 'Serving Size (g)' '125']
0,['' '' '' '' '' '' '' '']
1,['Size' 'S' 'Size' 'M' 'Size' 'L' 'Size' 'XL']
2,"['Energy (kJ)' '1431' 'Energy (kJ)' '2030' 'Energy (kJ)' '2863' 'Energy (kJ)' '3383']"
3,"['Protein (g)' '5.7' 'Protein (g)' '8.1' 'Protein (g)' '11.4' 'Protein (g)' '13.5']"
4,"['Carbohydrates (g)' '41.5' 'Carbohydrates (g)' '58.8' 'Carbohydrates (g)' '82.9' 'Carbohydrates (g)' '98']"
5,"['Sugars (g)' '1.2' 'Sugars (g)' '1.7' 'Sugars (g)' '2.4' 'Sugars (g)' '2.9']"
6,"['Total Fat (g)' '17.9' 'Total Fat (g)' '25.4' 'Total Fat (g)' '35.9' 'Total Fat (g)' '42.4']"
7,"['Saturated Fat' '7.9' 'Saturated Fat' '11.2' 'Saturated Fat' '15.8' 'Saturated Fat' '18.7']"
8,"['Trans Fat (g)' '0' 'Trans Fat (g)' '0' 'Trans Fat (g)' '0' 'Trans Fat (g)' '0']"
9,"['Dietary Fibre (g)' '3.7' 'Dietary Fibre (g)' '5.3' 'Dietary Fibre (g)' '7.5' 'Dietary Fibre (g)' '8.8']"
10,"['Sodium (mg)' '305' 'Sodium (mg)' '432.1' 'Sodium (mg)' '609' 'Sodium (mg)' '720']"
11,"['Serving Size (g)' '110' 'Serving Size (g)' '156' 'Serving Size (g)' '220' 'Serving Size (g)' '260']"
我想将数字数据移动到新的列(amount\u M、amount\u L、amount\u XL)。“营养素”栏无需重复。处理这些病例的最佳方法是什么?使用:
import ast
# convert output fo 2 column dataframe
df = pd.read_csv('file5.csv', names=['a','b'])
#add comma to ' ', convert each row to lists
df['b'] = df['b'].str.replace("'\s+'", "','").apply(ast.literal_eval)
#remove rows with 0 in a column
df = df[df['a'] != 0]
#print (df)
fin = {}
#create dictionary of dataframes - groupby by helper Series -
# necessary first value 1 for distinguish groups in a column
for i, x in dict(tuple(df.groupby(df['a'].eq(1).cumsum().sub(1)))).items():
# print (x)
#create DataFrame with column b, first row is header
df2 = pd.DataFrame(x.b.values.tolist()[1:], columns=x.b.iloc[0])
#remove duplicates columns names
df2 = df2.loc[:, ~df2.columns.duplicated()]
# print (df2)
#convert output to dictionary (if necessary)
fin[i] = df2
虽然您已经有了答案,但我想也提供我的方法。
这里,在使用正则表达式和生成器创建
DataFrame
时考虑不同的大小
(?:
\G(?!\A)\D+ # match after the last match + non-digit
(?P<value>\d+(?:\.\d+)?) # capture digits
| # or
'(?P<key>[A-Z][^']+)' # match between 'A-Z...'
)
产生
nutrition S M L XL
0 Energy (kJ) 1644 None None None
1 Protein (g) 20.9 None None None
2 Carbohydrates (g) 33.6 None None None
3 Sugars (g) 1.8 None None None
4 Total Fat (g) 18.7 None None None
5 Saturated Fat 4.9 None None None
6 Trans Fat (g) 0 None None None
7 Dietary Fibre (g) 5.2 None None None
8 Sodium (mg) 845 None None None
9 Serving Size (g) 180 None None None
10 Energy (kJ) 351 617 None None
11 Protein (g) 2.3 4 None None
12 Carbohydrates (g) 15.4 26.9 None None
13 Sugars (g) 1.9 3.3 None None
14 Total Fat (g) 0.6 1 None None
15 Saturated Fat 0.1 0.1 None None
16 Trans Fat (g) 0 0 None None
17 Dietary Fibre (g) 1.9 3.4 None None
18 Sodium (mg) 2 4 None None
19 Serving Size (g) 75 125 None None
20 Energy (kJ) 1431 2030 2863 3383
非常感谢这项工作,但即使有评论,我也不能说代码对我来说是完全清楚的。您是否可以明确解释此处发生的步骤:“对于dict中的i,x(tuple(df.groupby(df['a'].eq(1.cumsum().sub(1))))).items()?
df['a'].eq(1.cumsum().sub(1)
意味着将列a
与1
first(df['a'].eq(1)
相同)。然后获取True
和False
值,Trues
用于1
和cumsum
创建组的行。因此True-False-False
create1122
和lastsub1
减去1
计数表0-0101
。最后一个带有dict的元组转换groupby
对象到数据帧和项的字典()
用于循环。
print (fin[0])
Size S M
0 Energy (kJ) 351 617
1 Protein (g) 2.3 4
2 Carbohydrates (g) 15.4 26.9
3 Sugars (g) 1.9 3.3
4 Total Fat (g) 0.6 1
5 Saturated Fat 0.1 0.1
6 Trans Fat (g) 0 0
7 Dietary Fibre (g) 1.9 3.4
8 Sodium (mg) 2 4
9 Serving Size (g) 75 125
print (fin[1])
Size S M L XL
0 Energy (kJ) 1431 2030 2863 3383
1 Protein (g) 5.7 8.1 11.4 13.5
2 Carbohydrates (g) 41.5 58.8 82.9 98
3 Sugars (g) 1.2 1.7 2.4 2.9
4 Total Fat (g) 17.9 25.4 35.9 42.4
5 Saturated Fat 7.9 11.2 15.8 18.7
6 Trans Fat (g) 0 0 0 0
7 Dietary Fibre (g) 3.7 5.3 7.5 8.8
8 Sodium (mg) 305 432.1 609 720
9 Serving Size (g) 110 156 220 260
(?:
\G(?!\A)\D+ # match after the last match + non-digit
(?P<value>\d+(?:\.\d+)?) # capture digits
| # or
'(?P<key>[A-Z][^']+)' # match between 'A-Z...'
)
import regex as re
import pandas as pd
# construct the regular expression
rx = re.compile(r'''(?:\G(?!\A)\D+(?P<value>\d+(?:\.\d+)?)|'(?P<key>[A-Z][^']+)')''')
# remove empty lines and lines without any number
valid = re.compile(r'\d+,\D+\d')
# generator comprehension
result = ([m.group('key') if m.group('key') is not None else m.group('value')
for m in rx.finditer(line)]
for line in your_string_here.split('\n')
if valid.match(line))
df = pd.DataFrame(result, columns = ['nutrition', 'S', 'M', 'L', 'XL'])
print(df)
nutrition S M L XL
0 Energy (kJ) 1644 None None None
1 Protein (g) 20.9 None None None
2 Carbohydrates (g) 33.6 None None None
3 Sugars (g) 1.8 None None None
4 Total Fat (g) 18.7 None None None
5 Saturated Fat 4.9 None None None
6 Trans Fat (g) 0 None None None
7 Dietary Fibre (g) 5.2 None None None
8 Sodium (mg) 845 None None None
9 Serving Size (g) 180 None None None
10 Energy (kJ) 351 617 None None
11 Protein (g) 2.3 4 None None
12 Carbohydrates (g) 15.4 26.9 None None
13 Sugars (g) 1.9 3.3 None None
14 Total Fat (g) 0.6 1 None None
15 Saturated Fat 0.1 0.1 None None
16 Trans Fat (g) 0 0 None None
17 Dietary Fibre (g) 1.9 3.4 None None
18 Sodium (mg) 2 4 None None
19 Serving Size (g) 75 125 None None
20 Energy (kJ) 1431 2030 2863 3383