Python 创建一列,根据条件删除字符串中不需要的部分
我是python新手,我被困在这里。我有一个如下的数据框,我正在尝试创建一个新的列,其中只包含“类型”列的宏类型 数据帧:Python 创建一列,根据条件删除字符串中不需要的部分,python,pandas,loops,split,Python,Pandas,Loops,Split,我是python新手,我被困在这里。我有一个如下的数据框,我正在尝试创建一个新的列,其中只包含“类型”列的宏类型 数据帧: import pandas as pd d = {'Genres': ['Finance', 'Arcade', 'Business', 'Photography', 'Entertainment;Brain Games', 'Medical', 'Tools', 'Casual;Brain Games', 'Medical', 'Entertainment'],
import pandas as pd
d = {'Genres': ['Finance', 'Arcade', 'Business', 'Photography', 'Entertainment;Brain Games', 'Medical', 'Tools', 'Casual;Brain Games', 'Medical', 'Entertainment'],
'Last Updated': ['March 10, 2018', 'May 24, 2018', 'April 11, 2018', 'November 6, 2014', 'March 9, 2018', 'May 17, 2018', 'June 3, 2016', 'April 10, 2016', 'July 16, 2018', 'May 17, 2017']}
df = pd.DataFrame(data=d)
df
Genres Last Updated
0 Finance March 10, 2018
1 Arcade May 24, 2018
2 Business April 11, 2018
3 Photography November 6, 2014
4 Entertainment;Brain Games March 9, 2018
5 Medical May 17, 2018
6 Tools June 3, 2016
7 Casual;Brain Games April 10, 2016
8 Medical July 16, 2018
9 Entertainment May 17, 2017
所需的输出类似于:
Genres macro_genres Last Updated
0 Finance Finance March 10, 2018
1 Arcade Arcade May 24, 2018
2 Business Business April 11, 2018
3 Photography Photography November 6, 2014
4 Entertainment;Brain Games Entertainment March 9, 2018
5 Medical Medical May 17, 2018
6 Tools Tools June 3, 2016
7 Casual;Brain Games Casual April 10, 2016
8 Medical Medical July 16, 2018
9 Entertainment Entertainment May 17, 2017
我所尝试的:
def macro_genre(i):
for i in df['Genres']:
if ';' in i:
j = i.split(';')[0]
return j
else:
return i
df['macro_genres'] = df['Genres'].apply(macro_genre)
但它不起作用。它创建列,但对整个列重复第一个值
当我尝试函数外部的部分时,它工作正常
任何提示都将不胜感激!谢谢 您只需使用str.split(“;”)
。如果代码>不存在于字符串中,不会发生任何事情->返回带有原始字符串的列表(因此您可以始终使用[0]
):
印刷品:
Genres Last_Updated macro_genres
0 Finance March 10, 2018 Finance
1 Arcade May 24, 2018 Arcade
2 Business April 11, 2018 Business
3 Photography November 6, 2014 Photography
4 Entertainment;Brain_Games March 9, 2018 Entertainment
5 Medical May 17, 2018 Medical
6 Tools June 3, 2016 Tools
7 Casual;Brain Games April 10, 2016 Casual
8 Medical July 16, 2018 Medical
9 Entertainment May 17, 2017 Entertainment
您可以只使用str.split(“;”)
。如果代码>不存在于字符串中,不会发生任何事情->返回带有原始字符串的列表(因此您可以始终使用[0]
):
印刷品:
Genres Last_Updated macro_genres
0 Finance March 10, 2018 Finance
1 Arcade May 24, 2018 Arcade
2 Business April 11, 2018 Business
3 Photography November 6, 2014 Photography
4 Entertainment;Brain_Games March 9, 2018 Entertainment
5 Medical May 17, 2018 Medical
6 Tools June 3, 2016 Tools
7 Casual;Brain Games April 10, 2016 Casual
8 Medical July 16, 2018 Medical
9 Entertainment May 17, 2017 Entertainment
一种可能是使用map
:
df['macro_games'] = df['Genres'].astype(str).map(lambda x : x.split(';')[0])
输出:
>>> df
Genres macro_genres Last Updated
0 Finance Finance March 10, 2018
1 Arcade Arcade May 24, 2018
2 Business Business April 11, 2018
3 Photography Photography November 6, 2014
4 Entertainment;Brain Games Entertainment March 9, 2018
5 Medical Medical May 17, 2018
6 Tools Tools June 3, 2016
7 Casual;Brain Games Casual April 10, 2016
8 Medical Medical July 16, 2018
9 Entertainment Entertainment May 17, 2017
1k数据帧上的运行时比较:
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
535 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#str split method (slowest)
>>> %timeit -n 1000 df['Genres'].str.split(';').str[0]
1.36 ms ± 44.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].map(lambda x : x.split(';')[0])
527 µs ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
3.62 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#str split method (slowest)
>>> %timeit -n 1000 df['Genres'].str.split(';').str[0]
10 ms ± 259 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].map(lambda x : x.split(';')[0])
3.47 ms ± 59.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
17 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].map(lambda x : x.split(';')[0])
16.7 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
34.1 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].astype(str).map(lambda x : x.split(';')[0])
35.5 ms ± 596 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
10k数据帧上的运行时比较:
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
535 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#str split method (slowest)
>>> %timeit -n 1000 df['Genres'].str.split(';').str[0]
1.36 ms ± 44.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].map(lambda x : x.split(';')[0])
527 µs ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
3.62 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#str split method (slowest)
>>> %timeit -n 1000 df['Genres'].str.split(';').str[0]
10 ms ± 259 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].map(lambda x : x.split(';')[0])
3.47 ms ± 59.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
17 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].map(lambda x : x.split(';')[0])
16.7 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
34.1 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].astype(str).map(lambda x : x.split(';')[0])
35.5 ms ± 596 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
50k数据帧上的运行时比较:
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
535 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#str split method (slowest)
>>> %timeit -n 1000 df['Genres'].str.split(';').str[0]
1.36 ms ± 44.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].map(lambda x : x.split(';')[0])
527 µs ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
3.62 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#str split method (slowest)
>>> %timeit -n 1000 df['Genres'].str.split(';').str[0]
10 ms ± 259 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].map(lambda x : x.split(';')[0])
3.47 ms ± 59.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
17 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].map(lambda x : x.split(';')[0])
16.7 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
34.1 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].astype(str).map(lambda x : x.split(';')[0])
35.5 ms ± 596 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
100k数据帧上的运行时比较:
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
535 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#str split method (slowest)
>>> %timeit -n 1000 df['Genres'].str.split(';').str[0]
1.36 ms ± 44.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].map(lambda x : x.split(';')[0])
527 µs ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
3.62 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#str split method (slowest)
>>> %timeit -n 1000 df['Genres'].str.split(';').str[0]
10 ms ± 259 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].map(lambda x : x.split(';')[0])
3.47 ms ± 59.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
17 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].map(lambda x : x.split(';')[0])
16.7 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
34.1 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].astype(str).map(lambda x : x.split(';')[0])
35.5 ms ± 596 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
一种可能是使用map
:
df['macro_games'] = df['Genres'].astype(str).map(lambda x : x.split(';')[0])
输出:
>>> df
Genres macro_genres Last Updated
0 Finance Finance March 10, 2018
1 Arcade Arcade May 24, 2018
2 Business Business April 11, 2018
3 Photography Photography November 6, 2014
4 Entertainment;Brain Games Entertainment March 9, 2018
5 Medical Medical May 17, 2018
6 Tools Tools June 3, 2016
7 Casual;Brain Games Casual April 10, 2016
8 Medical Medical July 16, 2018
9 Entertainment Entertainment May 17, 2017
1k数据帧上的运行时比较:
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
535 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#str split method (slowest)
>>> %timeit -n 1000 df['Genres'].str.split(';').str[0]
1.36 ms ± 44.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].map(lambda x : x.split(';')[0])
527 µs ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
3.62 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#str split method (slowest)
>>> %timeit -n 1000 df['Genres'].str.split(';').str[0]
10 ms ± 259 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].map(lambda x : x.split(';')[0])
3.47 ms ± 59.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
17 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].map(lambda x : x.split(';')[0])
16.7 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
34.1 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].astype(str).map(lambda x : x.split(';')[0])
35.5 ms ± 596 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
10k数据帧上的运行时比较:
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
535 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#str split method (slowest)
>>> %timeit -n 1000 df['Genres'].str.split(';').str[0]
1.36 ms ± 44.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].map(lambda x : x.split(';')[0])
527 µs ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
3.62 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#str split method (slowest)
>>> %timeit -n 1000 df['Genres'].str.split(';').str[0]
10 ms ± 259 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].map(lambda x : x.split(';')[0])
3.47 ms ± 59.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
17 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].map(lambda x : x.split(';')[0])
16.7 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
34.1 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].astype(str).map(lambda x : x.split(';')[0])
35.5 ms ± 596 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
50k数据帧上的运行时比较:
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
535 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#str split method (slowest)
>>> %timeit -n 1000 df['Genres'].str.split(';').str[0]
1.36 ms ± 44.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].map(lambda x : x.split(';')[0])
527 µs ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
3.62 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#str split method (slowest)
>>> %timeit -n 1000 df['Genres'].str.split(';').str[0]
10 ms ± 259 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].map(lambda x : x.split(';')[0])
3.47 ms ± 59.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
17 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].map(lambda x : x.split(';')[0])
16.7 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
34.1 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].astype(str).map(lambda x : x.split(';')[0])
35.5 ms ± 596 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
100k数据帧上的运行时比较:
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
535 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#str split method (slowest)
>>> %timeit -n 1000 df['Genres'].str.split(';').str[0]
1.36 ms ± 44.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].map(lambda x : x.split(';')[0])
527 µs ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
3.62 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#str split method (slowest)
>>> %timeit -n 1000 df['Genres'].str.split(';').str[0]
10 ms ± 259 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].map(lambda x : x.split(';')[0])
3.47 ms ± 59.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
17 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].map(lambda x : x.split(';')[0])
16.7 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
34.1 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#map method
>>> %timeit -n 1000 df['Genres'].astype(str).map(lambda x : x.split(';')[0])
35.5 ms ± 596 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
我想说的是df['macro_-genres']=df['genres'].str.split(';').str[0]
可能是一个更好的答案,但是如果你用%timeit
快速回答,性能会更好。它是258µs
而不是474µs
@DavidErickson为什么不使用map
。我认为这比申请要快。查看我的答案在10k数据帧上的运行时比较。@grayrigel我认为您不会看到应用程序和映射程序之间的性能差异,如您的答案所示。@DavidErickson是的。你是对的。我做了不同长度的测试。我收回我的陈述,即map
更快。但是,在小型dfs上稍好一些,在大型dfs上似乎变得更慢。我想说的是df['macro_genres']=df['genres'].str.split(“;”).str[0]
可能是一个更好的答案,但是使用快速%timeit
的答案,性能会更好。它是258µs
而不是474µs
@DavidErickson为什么不使用map
。我认为这比申请要快。查看我的答案在10k数据帧上的运行时比较。@grayrigel我认为您不会看到应用程序和映射程序之间的性能差异,如您的答案所示。@DavidErickson是的。你是对的。我做了不同长度的测试。我收回我的陈述,即map
更快。在小型dfs上稍微好一些,但是在大型dfs上似乎变慢了。看到不同长度的数据帧进行比较会很好。@AndrejKesely感谢您的投票。用1K、10K、50K、100K数据帧更新了我的答案。我收回我的陈述,即map
更快。在小dfs上稍微好一些,但是在大dfs上似乎变慢了。非常感谢@Grayrigel!我确信有一个干净简单的解决方案,但没有达到目的。它完全奏效了!速度重要吗?使用更干净/更惯用的方法不是更好吗?@AMC我认为速度很重要,尤其是在处理大型数据帧时。我不确定有什么比半行代码更好。你有什么建议?你有其他的方法吗?投票支持基准测试。看到不同长度的数据帧进行比较会很好。@AndrejKesely感谢您的投票。用1K、10K、50K、100K数据帧更新了我的答案。我收回我的陈述,即map
更快。在小dfs上稍微好一些,但是在大dfs上似乎变慢了。非常感谢@Grayrigel!我确信有一个干净简单的解决方案,但没有达到目的。它完全奏效了!速度重要吗?使用更干净/更惯用的方法不是更好吗?@AMC我认为速度很重要,尤其是在处理大型数据帧时。我不确定有什么比半行代码更好。你有什么建议?你有其他的方法吗?请提供一个。很抱歉。我是新来的。你说哪一部分应该是最小可复制的,数据帧本身?我输入它是因为它只是一个更大数据框的一小部分。你说哪一部分应该是最小的可复制的,数据框本身?应该可以复制/粘贴您的代码和数据,并且能够立即运行代码。感谢AMC的提示。虽然已经给出了解决方案,但我已经包含了生成数据帧的代码。请提供一个。对此表示抱歉。我是新来的。你说哪一部分应该是最小可复制的,数据帧本身?我输入它是因为它只是一个更大数据框的一小部分。你说哪一部分应该是最小的可复制的,数据框本身?应该可以复制/粘贴您的代码和数据,并且能够立即运行代码。感谢AMC的提示。虽然已经给出了解决方案,但我已经包含了生成数据帧的代码。