Python，从文件中提取数据_Python_Pandas_Text Files

Python，从文件中提取数据

python pandas

Python，从文件中提取数据,python,pandas,text-files,Python,Pandas,Text Files,我已经阅读了所有的熊猫文档，但我想我需要一个实际的例子来理解我有一个包含所有sql数据的.TXT文件将值（'4065'，'lel lel'，'joel'，插入到jos_用户中， 'chazaa@frame.com'、'd0c9f71c7bc8c9'、'Membre'、'0'、'0'、'2'， '2013-01-31 17:15:29', '2014-12-10 11:29:13', '', '{}'); 将值（'4066'，'jame lea'，'jamal'，插入到jos_用户中， “贾马

我已经阅读了所有的熊猫文档，但我想我需要一个实际的例子来理解

我有一个包含所有sql数据的.TXT文件

将值（'4065'，'lel lel'，'joel'，插入到

jos_用户中，
'chazaa@frame.com'、'd0c9f71c7bc8c9'、'Membre'、'0'、'0'、'2'，
'2013-01-31 17:15:29', '2014-12-10 11:29:13', '', '{}');
将值（'4066'，'jame lea'，'jamal'，插入到jos_用户中，
“贾马尔。stan@frame.com'd0c9f71c7774c9'，'Membre'，'0'，'0'，'2'，'，
'2012-11-31 08:15:29', '2012-12-10 12:29:13', '', '{}');
（大约17000行）并且我的.txt文件中没有任何列名称
我想要实现的目标：
自己创建列
根据列重新排列内容（例如，我想选择列1并显示它）
我现在的代码，显示垃圾邮件：
import pandas as pd
import matplotlib.pyplot as plt

pd.set_option('display.mpl_style', 'default') 
plt.rcParams['figure.figsize'] = (15, 5)


df = pd.read_csv('2.txt', sep=',', na_values=['g'], error_bad_lines=False)

print df

好的，这是我刚刚编写的一个示例脚本，我并不认为这是清理SQL脚本的最有效方法，理想情况下，如果您可以访问原始数据库，那么您应该能够将其导出为csv
无论如何，下面要做的是打开文本文件，删除插入到
中的、打开和关闭大括号、引用字符（不需要，但我更喜欢这种样式）和任何多余的空格
In [91]:

with open(r'c:\data\clean.csv', 'wt') as clean:
    with open(r'c:\data\temp sql.txt', 'rt') as f:
        for line in f:
            if len(line) > 0:
                l = line.replace('INSERT INTO jos_users VALUES (', '')
                l = l.replace(", '", ",'")
                l = l.replace("'",'')
                l = l.replace(');','')
                clean.write(l)
clean.close()
f.close()
# read the file back in, there is no header so you need to specify this 
df = pd.read_csv(r'c:\data\clean.csv', header=None)
df
Out[91]:
     0         1      2                     3               4       5   6   \
0  4065   lel lel   joel      chazaa@frame.com  d0c9f71c7bc8c9  Membre   0   
1  4066  jame lea  jamal  jamal.stan@frame.com  d0c9f71c7774c9  Membre   0   

   7   8                    9                    10  11  12  
0   0   2  2013-01-31 17:15:29  2014-12-10 11:29:13 NaN  {}  
1   0   2  2012-11-31 08:15:29  2012-12-10 12:29:13 NaN  {}  

编辑：以下方法比将修改后的数据写入文件，然后使用read_csv（）将修改后的数据读取到数据帧中要慢得多。对于一个34000行的文件，需要约23分钟~3秒
import pandas as pd
import numpy as np
import re

pd.set_option('display.width', 1000)
#Pre-allocate all the space needed by your DataFrame:
df = pd.DataFrame(index=np.arange(18000), columns=np.arange(13))

pattern = r""" #Find all single quoted sequences:
    '          #Match a single quote, followed by...
    (          #(start a capture group)
      [^']*    #not a single quote, 0 or more times, followed by...
    )          #(end the capture group)
    '          #a single quote
"""

regex = re.compile(pattern, flags=re.X)

f = open('data.txt')

for i, line in enumerate(f):
    data = re.findall(regex, line)  #findall() returns a list of all the strings that matched the pattern's capture group 

    if data:
        df.iloc[i] = data  #insert data at row i

print df

--output:--
         0         1      2                     3               4       5    6    7    8                    9                    10   11   12
0      4065   lel lel   joel      chazaa@frame.com  d0c9f71c7bc8c9  Membre    0    0    2  2013-01-31 17:15:29  2014-12-10 11:29:13        {}
1      4066  jame lea  jamal  jamal.stan@frame.com  d0c9f71c7774c9  Membre    0    0    2  2012-11-31 08:15:29  2012-12-10 12:29:13        {}
...
...
1798    NaN       NaN    NaN                   NaN             NaN     NaN  NaN  NaN  NaN                  NaN                  NaN  NaN  NaN
1799    NaN       NaN    NaN                   NaN             NaN     NaN  NaN  NaN  NaN                  NaN                  NaN  NaN  NaN

re.findall（模式、字符串、标志=0）

返回字符串中模式的所有非重叠匹配项，作为字符串列表。从左到右扫描字符串，并按找到的顺序返回匹配项。如果模式中存在一个或多个组，则返回组列表；如果模式有多个组，这将是一个元组列表。空匹配将包含在结果中，除非它们触及另一个匹配的开头

将更改后的数据写入文件，然后使用read_csv（）将其读入：
作为第一步，您需要清理TXT文件，目前它是SQL脚本而不是csv文件，您可以去掉左括号内的所有内容，然后删除后面的括号和任何新行。谢谢您的回答，但我如何才能做到这一点？我想实现的是：自己创建列
确切地包含哪些数据？根据列重新安排内容（例如，我想选择列1并显示它）显示列1将如何重新排列数据？
import pandas as pd
import numpy as np
import re
import time

pd.set_option('display.width', 1000)

pattern = r"""
    '          #Match a single quote, followed by...
    (          #start a capture group.
      [^']*    #not a quote, 0 or more times, followed by...
    )          #end capture group.
    '          #a single quote
"""

regex = re.compile(pattern, flags=re.X)

fin = open('data2.txt')  #The two insert statemetns in the op, repeated 17,000 times
fout = open('data.csv', 'w')

results = {}

for line in fin: 
    data = re.findall(regex, line)  

    if data:
        print(*data, file=fout, sep=',')  

fin.close()
fout.close()

df = pd.read_csv(
    'data.csv', 
    sep=',', 
    header=None,
    names=np.arange(13),  #column names: 0 - 12
)


print(df)