使用python清理大数据_Python_Pandas

使用python清理大数据

python pandas

使用python清理大数据,python,pandas,Python,Pandas,我必须用python清理输入数据文件。由于输入错误，数据字段可能有字符串而不是数字。我想确定所有字段都是字符串，并使用pandas填充这些字段。另外，我想记录这些字段的索引最粗糙的方法之一是循环遍历每个字段并检查它是否是数字，但如果数据很大，这会消耗大量时间我的csv文件包含与下表类似的数据：。。。。假设数据中有60000行这样的行理想情况下，我想确定行IND在SALES列下的值无效。有没有关于如何有效执行此操作的建议？尝试将“sales”字符串转换为int，如果格式正确，则继续执行，

我必须用python清理输入数据文件。由于输入错误，数据字段可能有字符串而不是数字。我想确定所有字段都是字符串，并使用pandas填充这些字段。另外，我想记录这些字段的索引

最粗糙的方法之一是循环遍历每个字段并检查它是否是数字，但如果数据很大，这会消耗大量时间

我的csv文件包含与下表类似的数据：

。。。。假设数据中有60000行这样的行

理想情况下，我想确定行IND在SALES列下的值无效。有没有关于如何有效执行此操作的建议？

尝试将“sales”字符串转换为

int

，如果格式正确，则继续执行，如果格式不正确，则将引发

ValueError

，我们将捕获并替换为占位符

bad_lines = []

with open(fname,'rb') as f:
    header = f.readline()
    for j,l in enumerate(f):
        country,count,sales = l.split()
        try:
            sales_count = int(sales)
        except ValueError:
            sales_count = 'NaN'
            bad_lines.append(j)
        # shove in to your data structure
        print country,count,sales_count

您可能需要编辑分割线的线（如您的示例复制为空格，而不是制表符）。用您想要对数据执行的操作替换打印行。您可能还需要将“NaN”与熊猫NaN重新对齐

import os
import numpy as np
import pandas as PD

filename = os.path.expanduser('~/tmp/data.csv')
df = PD.DataFrame(
        np.genfromtxt(
            filename, delimiter = '\t', names = True, dtype = '|O4,<i4,<f8'))
print(df)

要找到销售

NaN

的国家，您可以计算

print(y['Country'][np.isnan(y['Sales'])])

这将产生熊猫系列：

2    IND
Name: Country

有一个

na_values

参数用于：

na_值

：类似于或dict的列表，默认值

None

要识别为NA/NaN的其他字符串。如果dict通过，则每列特定NA值

使用，您只能选择

'Sales'

列或

'Country'

系列中带有NaN的行：

In [3]: df[pd.isnull(df['Sales'])]
Out[3]: 
  Country  Count  Sales
2     IND      8    NaN

In [4]: df[pd.isnull(df['Sales'])]['Country']
Out[4]: 
2    IND
Name: Country

如果它已经在数据帧中，您可以使用它将数字字符串转换为整数（使用）：

IND g

我建议使用正则表达式：

import re

ss = '''Country  Count  Sales
USA   ,      3  , 65000
UK    ,      3  ,  4000
IND   ,      8  ,     g
SPA   ,     ju  ,  9000
NTH   ,      5  , 80000
XSZ   ,    rob  ,    k3'''

with open('fofo.txt','w') as f:
    f.write(ss)

print ss
print

delimiter = ','

regx = re.compile('(.+?(?:{0}))'
                  '(( *\d+?)| *.+?)'
                  '( *(?:{0}))'
                  '(( *\d+?)| *.+?)'
                  '( *\r?\n?)$'.format(delimiter))

def READ(filepath, regx = regx):
    with open(filepath,'rb+') as f:
        yield f.readline()
        for line in f:
            if None in regx.match(line).group(3,6):
                g2,g3,g5,g6 = regx.match(line).group(2,3,5,6)
                tr = ('%%%ds' % len(g2) % 'NaN' if g3 is None else g3,
                      '%%%ds' % len(g5) % 'NaN' if g6 is None else g6)
                modified_line = regx.sub(('\g<1>%s\g<4>%s\g<7>' % tr),line)
                print ('------------------------------------------------\n'
                       '%r with aberration\n'
                       '%r modified line'
                       % (line,modified_line))
                yield modified_line
            else:
                yield line

with open('modified.txt','wb') as g:
    g.writelines(x for x in READ('fofo.txt'))

在

60000行的文件上循环实际上不会花费很长时间。在我看来，花在这种方法上的时间几乎不值得注意。你能展示一下你所做的尝试，并用基准测试来证明它对你的计算机来说确实是一个巨大的负载吗？如果它需要更长的时间，那么就使用多处理模块，但它确实不应该超过几秒钟，当然这取决于你需要编辑的行数。int（销售）
应该给出与int（sales.strip（））
（int
不关心空格）相同的东西，你也可以做country，count，sales=l.split（）
或者可能country，count，sales=l.split（None，2）
@mgilson真棒，python比我想象的还要聪明（再次）我认为在较大的数据帧上预测数据类型可能是不可能的/不合理的，特别是如果有比这个玩具示例中更多的列……很好。为什么要注释掉read_csv代码？它似乎起作用了。@Monir取消了注释，并将注释到了\u dict（）
。虽然，apply是我答案的重要部分（奇怪的是，似乎没有其他答案使用它）。
df = pd.read_csv('city.csv', sep='\s+', na_values=['g'])

In [2]: df
Out[2]:
  Country  Count  Sales
0     USA      1  65000
1      UK      3   4000
2     IND      8    NaN
3     SPA      3   9000
4     NTH      5  80000

In [3]: df[pd.isnull(df['Sales'])]
Out[3]: 
  Country  Count  Sales
2     IND      8    NaN

In [4]: df[pd.isnull(df['Sales'])]['Country']
Out[4]: 
2    IND
Name: Country

df = pd.DataFrame({'Count': {0: 1, 1: 3, 2: 8, 3: 3, 4: 5}, 'Country': {0: 'USA', 1: 'UK', 2: 'IND', 3: 'SPA', 4: 'NTH'}, 'Sales': {0: '65000', 1: '4000', 2: 'g', 3: '9000', 4: '80000'}})

In [12]: df
Out[12]: 
  Country  Count  Sales
0     USA      1  65000
1      UK      3   4000
2     IND      8      g
3     SPA      3   9000
4     NTH      5  80000

In [13]: df['Sales'] = df['Sales'].apply(lambda x: int(x) 
                                                  if str.isdigit(x)
                                                  else np.nan)

In [14]: df
Out[14]: 
  Country  Count  Sales
0     USA      1  65000
1      UK      3   4000
2     IND      8    NaN
3     SPA      3   9000
4     NTH      5  80000

filename = open('file.csv')
filename.readline()

for line in filename:
    currentline = line.split(',')
    try:
        int(currentline[2][:-1])
    except:
        print currentline[0], currentline[2][:-1]

import re

ss = '''Country  Count  Sales
USA   ,      3  , 65000
UK    ,      3  ,  4000
IND   ,      8  ,     g
SPA   ,     ju  ,  9000
NTH   ,      5  , 80000
XSZ   ,    rob  ,    k3'''

with open('fofo.txt','w') as f:
    f.write(ss)

print ss
print

delimiter = ','

regx = re.compile('(.+?(?:{0}))'
                  '(( *\d+?)| *.+?)'
                  '( *(?:{0}))'
                  '(( *\d+?)| *.+?)'
                  '( *\r?\n?)$'.format(delimiter))

def READ(filepath, regx = regx):
    with open(filepath,'rb+') as f:
        yield f.readline()
        for line in f:
            if None in regx.match(line).group(3,6):
                g2,g3,g5,g6 = regx.match(line).group(2,3,5,6)
                tr = ('%%%ds' % len(g2) % 'NaN' if g3 is None else g3,
                      '%%%ds' % len(g5) % 'NaN' if g6 is None else g6)
                modified_line = regx.sub(('\g<1>%s\g<4>%s\g<7>' % tr),line)
                print ('------------------------------------------------\n'
                       '%r with aberration\n'
                       '%r modified line'
                       % (line,modified_line))
                yield modified_line
            else:
                yield line

with open('modified.txt','wb') as g:
    g.writelines(x for x in READ('fofo.txt'))

Country  Count  Sales
USA   ,      3  , 65000
UK    ,      3  ,  4000
IND   ,      8  ,     g
SPA   ,     ju  ,  9000
NTH   ,      5  , 80000
XSZ   ,    rob  ,    k3

------------------------------------------------
'IND   ,      8  ,     g\r\n' with aberration
'IND   ,      8  ,   NaN\r\n' modified line
------------------------------------------------
'SPA   ,     ju  ,  9000\r\n' with aberration
'SPA   ,    NaN  ,  9000\r\n' modified line
------------------------------------------------
'XSZ   ,    rob  ,    k3' with aberration
'XSZ   ,    NaN  ,   NaN' modified line