Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/shell/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用numpy从CSV文件提取数据_Python_Csv_Pandas_Numpy - Fatal编程技术网

Python 使用numpy从CSV文件提取数据

Python 使用numpy从CSV文件提取数据,python,csv,pandas,numpy,Python,Csv,Pandas,Numpy,我正在与numpy合作,试图找出哪个平台在北美地区销售的拷贝最多 我有一个CSV文件,其中包含大量数据,如下所示: Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales 1,Wii Sports,Wii,2006,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74 2,Super Mario Bros.,NES,1985,Platf

我正在与
numpy
合作,试图找出哪个平台在北美地区销售的拷贝最多

我有一个CSV文件,其中包含大量数据,如下所示:

Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
1,Wii Sports,Wii,2006,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
2,Super Mario Bros.,NES,1985,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
3,Mario Kart Wii,Wii,2008,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
4,Wii Sports Resort,Wii,2009,Sports,Nintendo,15.75,11.01,3.28,2.96,33
5,Pokemon Red/Pokemon Blue,GB,1996,Role-Playing,Nintendo,11.27,8.89,10.22,1,31.37
6,Tetris,GB,1989,Puzzle,Nintendo,23.2,2.26,4.22,0.58,30.26
7,New Super Mario Bros.,DS,2006,Platform,Nintendo,11.38,9.23,6.5,2.9,30.01
8,Wii Play,Wii,2006,Misc,Nintendo,14.03,9.2,2.93,2.85,29.02
9,New Super Mario Bros. Wii,Wii,2009,Platform,Nintendo,14.59,7.06,4.7,2.26,28.62
10,Duck Hunt,NES,1984,Shooter,Nintendo,26.93,0.63,0.28,0.47,28.31
11,Nintendogs,DS,2005,Simulation,Nintendo,9.07,11,1.93,2.75,24.76

我想打印在北美地区销量和销量最多的平台。我怎样才能做到这一点呢?

对于熊猫来说,这是相当直截了当的

代码:

# read csv data into a dataframe
df = pd.read_csv(data, skipinitialspace=True)

# roll up by NA Sales
platform_roll_up = df.groupby('Platform')['NA_Sales'].sum()

# find row with max sales
idx_max = platform_roll_up.idxmax()

# show platform and sales for max
print(idx_max, platform_roll_up[idx_max])
Wii 101.71
data = StringIO(u"""
    Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
    1,Wii Sports,Wii,2006,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
    2,Super Mario Bros.,NES,1985,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
    3,Mario Kart Wii,Wii,2008,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
    4,Wii Sports Resort,Wii,2009,Sports,Nintendo,15.75,11.01,3.28,2.96,33
    5,Pokemon Red/Pokemon Blue,GB,1996,Role-Playing,Nintendo,11.27,8.89,10.22,1,31.37
    6,Tetris,GB,1989,Puzzle,Nintendo,23.2,2.26,4.22,0.58,30.26
    7,New Super Mario Bros.,DS,2006,Platform,Nintendo,11.38,9.23,6.5,2.9,30.01
    8,Wii Play,Wii,2006,Misc,Nintendo,14.03,9.2,2.93,2.85,29.02
    9,New Super Mario Bros. Wii,Wii,2009,Platform,Nintendo,14.59,7.06,4.7,2.26,28.62
    10,Duck Hunt,NES,1984,Shooter,Nintendo,26.93,0.63,0.28,0.47,28.31
    11,Nintendogs,DS,2005,Simulation,Nintendo,9.07,11,1.93,2.75,24.76
""")
结果:

# read csv data into a dataframe
df = pd.read_csv(data, skipinitialspace=True)

# roll up by NA Sales
platform_roll_up = df.groupby('Platform')['NA_Sales'].sum()

# find row with max sales
idx_max = platform_roll_up.idxmax()

# show platform and sales for max
print(idx_max, platform_roll_up[idx_max])
Wii 101.71
data = StringIO(u"""
    Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
    1,Wii Sports,Wii,2006,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
    2,Super Mario Bros.,NES,1985,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
    3,Mario Kart Wii,Wii,2008,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
    4,Wii Sports Resort,Wii,2009,Sports,Nintendo,15.75,11.01,3.28,2.96,33
    5,Pokemon Red/Pokemon Blue,GB,1996,Role-Playing,Nintendo,11.27,8.89,10.22,1,31.37
    6,Tetris,GB,1989,Puzzle,Nintendo,23.2,2.26,4.22,0.58,30.26
    7,New Super Mario Bros.,DS,2006,Platform,Nintendo,11.38,9.23,6.5,2.9,30.01
    8,Wii Play,Wii,2006,Misc,Nintendo,14.03,9.2,2.93,2.85,29.02
    9,New Super Mario Bros. Wii,Wii,2009,Platform,Nintendo,14.59,7.06,4.7,2.26,28.62
    10,Duck Hunt,NES,1984,Shooter,Nintendo,26.93,0.63,0.28,0.47,28.31
    11,Nintendogs,DS,2005,Simulation,Nintendo,9.07,11,1.93,2.75,24.76
""")
测试数据:

# read csv data into a dataframe
df = pd.read_csv(data, skipinitialspace=True)

# roll up by NA Sales
platform_roll_up = df.groupby('Platform')['NA_Sales'].sum()

# find row with max sales
idx_max = platform_roll_up.idxmax()

# show platform and sales for max
print(idx_max, platform_roll_up[idx_max])
Wii 101.71
data = StringIO(u"""
    Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
    1,Wii Sports,Wii,2006,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
    2,Super Mario Bros.,NES,1985,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
    3,Mario Kart Wii,Wii,2008,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
    4,Wii Sports Resort,Wii,2009,Sports,Nintendo,15.75,11.01,3.28,2.96,33
    5,Pokemon Red/Pokemon Blue,GB,1996,Role-Playing,Nintendo,11.27,8.89,10.22,1,31.37
    6,Tetris,GB,1989,Puzzle,Nintendo,23.2,2.26,4.22,0.58,30.26
    7,New Super Mario Bros.,DS,2006,Platform,Nintendo,11.38,9.23,6.5,2.9,30.01
    8,Wii Play,Wii,2006,Misc,Nintendo,14.03,9.2,2.93,2.85,29.02
    9,New Super Mario Bros. Wii,Wii,2009,Platform,Nintendo,14.59,7.06,4.7,2.26,28.62
    10,Duck Hunt,NES,1984,Shooter,Nintendo,26.93,0.63,0.28,0.47,28.31
    11,Nintendogs,DS,2005,Simulation,Nintendo,9.07,11,1.93,2.75,24.76
""")

对于熊猫,这是相当直接的

代码:

# read csv data into a dataframe
df = pd.read_csv(data, skipinitialspace=True)

# roll up by NA Sales
platform_roll_up = df.groupby('Platform')['NA_Sales'].sum()

# find row with max sales
idx_max = platform_roll_up.idxmax()

# show platform and sales for max
print(idx_max, platform_roll_up[idx_max])
Wii 101.71
data = StringIO(u"""
    Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
    1,Wii Sports,Wii,2006,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
    2,Super Mario Bros.,NES,1985,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
    3,Mario Kart Wii,Wii,2008,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
    4,Wii Sports Resort,Wii,2009,Sports,Nintendo,15.75,11.01,3.28,2.96,33
    5,Pokemon Red/Pokemon Blue,GB,1996,Role-Playing,Nintendo,11.27,8.89,10.22,1,31.37
    6,Tetris,GB,1989,Puzzle,Nintendo,23.2,2.26,4.22,0.58,30.26
    7,New Super Mario Bros.,DS,2006,Platform,Nintendo,11.38,9.23,6.5,2.9,30.01
    8,Wii Play,Wii,2006,Misc,Nintendo,14.03,9.2,2.93,2.85,29.02
    9,New Super Mario Bros. Wii,Wii,2009,Platform,Nintendo,14.59,7.06,4.7,2.26,28.62
    10,Duck Hunt,NES,1984,Shooter,Nintendo,26.93,0.63,0.28,0.47,28.31
    11,Nintendogs,DS,2005,Simulation,Nintendo,9.07,11,1.93,2.75,24.76
""")
结果:

# read csv data into a dataframe
df = pd.read_csv(data, skipinitialspace=True)

# roll up by NA Sales
platform_roll_up = df.groupby('Platform')['NA_Sales'].sum()

# find row with max sales
idx_max = platform_roll_up.idxmax()

# show platform and sales for max
print(idx_max, platform_roll_up[idx_max])
Wii 101.71
data = StringIO(u"""
    Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
    1,Wii Sports,Wii,2006,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
    2,Super Mario Bros.,NES,1985,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
    3,Mario Kart Wii,Wii,2008,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
    4,Wii Sports Resort,Wii,2009,Sports,Nintendo,15.75,11.01,3.28,2.96,33
    5,Pokemon Red/Pokemon Blue,GB,1996,Role-Playing,Nintendo,11.27,8.89,10.22,1,31.37
    6,Tetris,GB,1989,Puzzle,Nintendo,23.2,2.26,4.22,0.58,30.26
    7,New Super Mario Bros.,DS,2006,Platform,Nintendo,11.38,9.23,6.5,2.9,30.01
    8,Wii Play,Wii,2006,Misc,Nintendo,14.03,9.2,2.93,2.85,29.02
    9,New Super Mario Bros. Wii,Wii,2009,Platform,Nintendo,14.59,7.06,4.7,2.26,28.62
    10,Duck Hunt,NES,1984,Shooter,Nintendo,26.93,0.63,0.28,0.47,28.31
    11,Nintendogs,DS,2005,Simulation,Nintendo,9.07,11,1.93,2.75,24.76
""")
测试数据:

# read csv data into a dataframe
df = pd.read_csv(data, skipinitialspace=True)

# roll up by NA Sales
platform_roll_up = df.groupby('Platform')['NA_Sales'].sum()

# find row with max sales
idx_max = platform_roll_up.idxmax()

# show platform and sales for max
print(idx_max, platform_roll_up[idx_max])
Wii 101.71
data = StringIO(u"""
    Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
    1,Wii Sports,Wii,2006,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
    2,Super Mario Bros.,NES,1985,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
    3,Mario Kart Wii,Wii,2008,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
    4,Wii Sports Resort,Wii,2009,Sports,Nintendo,15.75,11.01,3.28,2.96,33
    5,Pokemon Red/Pokemon Blue,GB,1996,Role-Playing,Nintendo,11.27,8.89,10.22,1,31.37
    6,Tetris,GB,1989,Puzzle,Nintendo,23.2,2.26,4.22,0.58,30.26
    7,New Super Mario Bros.,DS,2006,Platform,Nintendo,11.38,9.23,6.5,2.9,30.01
    8,Wii Play,Wii,2006,Misc,Nintendo,14.03,9.2,2.93,2.85,29.02
    9,New Super Mario Bros. Wii,Wii,2009,Platform,Nintendo,14.59,7.06,4.7,2.26,28.62
    10,Duck Hunt,NES,1984,Shooter,Nintendo,26.93,0.63,0.28,0.47,28.31
    11,Nintendogs,DS,2005,Simulation,Nintendo,9.07,11,1.93,2.75,24.76
""")

使用
genfromtxt
加载此文件非常简单:

In [280]: data=np.genfromtxt('stack42602390.csv',delimiter=',',names=True, dtype=None)

In [281]: data
Out[281]: 
array([ ( 1, b'Wii Sports', b'Wii', 2006, b'Sports', b'Nintendo',  41.49,  29.02,   3.77,  8.46,  82.74),
       ( 2, b'Super Mario Bros.', b'NES', 1985, b'Platform', b'Nintendo',  29.08,   3.58,   6.81,  0.77,  40.24),
       ( 3, b'Mario Kart Wii', b'Wii', 2008, b'Racing', b'Nintendo',  15.85,  12.88,   3.79,  3.31,  35.82),
....
       (11, b'Nintendogs', b'DS', 2005, b'Simulation', b'Nintendo',   9.07,  11.  ,   1.93,  2.75,  24.76)], 
      dtype=[('Rank', '<i4'), ('Name', 'S25'), ('Platform', 'S3'), ('Year', '<i4'), ('Genre', 'S12'), ('Publisher', 'S8'), ('NA_Sales', '<f8'), ('EU_Sales', '<f8'), ('JP_Sales', '<f8'), ('Other_Sales', '<f8'), ('Global_Sales', '<f8')])
其中最大值为:

In [283]: np.argmax(data['NA_Sales'])
Out[283]: 0
以及相应的记录:

In [284]: data[0]
Out[284]: (1, b'Wii Sports', b'Wii', 2006, b'Sports', b'Nintendo',  41.49,  29.02,  3.77,  8.46,  82.74)

要充分利用此数组,您必须阅读结构化数组。

genfromtxt
加载此数组非常简单:

In [280]: data=np.genfromtxt('stack42602390.csv',delimiter=',',names=True, dtype=None)

In [281]: data
Out[281]: 
array([ ( 1, b'Wii Sports', b'Wii', 2006, b'Sports', b'Nintendo',  41.49,  29.02,   3.77,  8.46,  82.74),
       ( 2, b'Super Mario Bros.', b'NES', 1985, b'Platform', b'Nintendo',  29.08,   3.58,   6.81,  0.77,  40.24),
       ( 3, b'Mario Kart Wii', b'Wii', 2008, b'Racing', b'Nintendo',  15.85,  12.88,   3.79,  3.31,  35.82),
....
       (11, b'Nintendogs', b'DS', 2005, b'Simulation', b'Nintendo',   9.07,  11.  ,   1.93,  2.75,  24.76)], 
      dtype=[('Rank', '<i4'), ('Name', 'S25'), ('Platform', 'S3'), ('Year', '<i4'), ('Genre', 'S12'), ('Publisher', 'S8'), ('NA_Sales', '<f8'), ('EU_Sales', '<f8'), ('JP_Sales', '<f8'), ('Other_Sales', '<f8'), ('Global_Sales', '<f8')])
其中最大值为:

In [283]: np.argmax(data['NA_Sales'])
Out[283]: 0
以及相应的记录:

In [284]: data[0]
Out[284]: (1, b'Wii Sports', b'Wii', 2006, b'Sports', b'Nintendo',  41.49,  29.02,  3.77,  8.46,  82.74)

为了最大限度地利用这个数组,你必须阅读结构化数组。

到目前为止你做了哪些尝试?我将所有不同的平台硬编码为掩码,如:maskNES=(data[:,2]='NES'),然后我将其分配给一个变量,如:pfNES=data[maskNES][:,6]。sum()最后,我比较了所有平台,找到了价值最高的平台。这似乎是一种愚蠢的做法。如果我有数千个不同的平台哦,我将csv数据放入一个名为“data”的矩阵中,到目前为止你尝试了什么?我将所有不同的平台硬编码为掩码,如:maskNES=(data[:,2]='NES'),然后我将其分配给一个变量,如:pfNES=data[maskNES][:,6].sum()最后,我比较了所有平台,找到了价值最高的平台。这似乎是一种愚蠢的做法。如果我有数千个不同的平台,哦,我把csv数据放到一个叫做“数据”的矩阵中,谢谢你的快速回答!我正在尝试一个适用于numpy.ndarray的解决方案。它没有iloc属性。在这种情况下我应该远离Ndaray吗?此外,我还试图找出X平台所有产品的总NA_销售价值。而不是找到最高的单个值。顺便说一下,我对蟒蛇很陌生!非常感谢您的回答,您编辑的版本正是我想要的。谢谢您的快速回答!我正在尝试一个适用于numpy.ndarray的解决方案。它没有iloc属性。在这种情况下我应该远离Ndaray吗?此外,我还试图找出X平台所有产品的总NA_销售价值。而不是找到最高的单个值。顺便说一下,我对蟒蛇很陌生!非常感谢您的回答,您编辑的版本正是我想要的。尝试了此解决方案,但遇到了一个问题,即在我的csv文件中,标题中有逗号,我无法添加quotechar='1!'“'to np.getfromtext
csv
包处理引号,但
numpy
阅读器不处理引号
genfromtxt
接受来自任何输入行的输入,因此您可以预处理这些行,清理它们,以便使用简单的分隔符对它们进行解析。这在以前的许多SO问题中已经讨论过。最近的一个带有过滤器输入的
genfromtxt
示例:尝试了此解决方案,但遇到了一个问题,即在我的csv文件中,标题中有逗号,我无法添加quotechar=“'to np.getfromtext
csv
包处理引号,但
numpy
阅读器不处理引号
genfromtxt
接受来自任何输入行的输入,因此您可以预处理这些行,清理它们,以便使用简单的分隔符对它们进行解析。这在前面的许多SO问题中已经讨论过。最近一个带有过滤器输入的
genfromtxt
示例: