Python 我的头在我的txt文件的第一列。我想创造一个新的世界

Python 我的头在我的txt文件的第一列。我想创造一个新的世界,python,dataframe,jupyter-notebook,Python,Dataframe,Jupyter Notebook,文本文件中的示例数据 [User] employeeNo=123 last_name=Toole first_name=Michael language=english email = michael.toole@123.ie department=Marketing role=Marketing Lead [User] employeeNo=456 last_name= Ronaldo first_name=Juan language=Spanish email=juan.ronaldo@sm

文本文件中的示例数据

[User]
employeeNo=123
last_name=Toole
first_name=Michael
language=english
email = michael.toole@123.ie
department=Marketing
role=Marketing Lead
[User]
employeeNo=456
last_name= Ronaldo
first_name=Juan
language=Spanish
email=juan.ronaldo@sms.ie
department=Data Science
role=Team Lead
Location=Spain
[User]
employeeNo=998
last_name=Lee
first_name=Damian
language=english
email=damian.lee@email.com
[User]
不知是否有人可以帮助我,你可以看到我的样本数据集上面。我想做的(请告诉我是否有更有效的方法)是循环第一列和唯一ID列表出现的位置(例如first_name、last_name、role等),将对应行中的值附加到该列表中,并对每个唯一ID执行此操作,这样我就剩下下面的内容了。 我读过关于多重索引的文章,我不确定这是否是一个更好的解决方案,但我无法让它工作(我对python很陌生)


我确信有一种更为理想的方法可以做到这一点,但这将是获得一个唯一的行名称列表,这次在循环过程中提取它们,并将它们组合到一个新的数据帧中。最后,用所需的列名更新它

import pandas as pd
import numpy as np
import io

data = '''
[User]
employeeNo=123
last_name=Toole
first_name=Michael
language=english
email=michael.toole@123.ie
department=Marketing
role="Marketing Lead"
[User]
employeeNo=456
last_name= Ronaldo
first_name=Juan
language=Spanish
email=juan.ronaldo@sms.ie
department="Data Science"
role=Team Lead
Location=Spain
[User]
employeeNo=998
last_name=Lee
first_name=Damian
language=english
email=damian.lee@email.com
[User]
'''

df = pd.read_csv(io.StringIO(data), sep='=', comment='[', header=None)

new_cols = df[0].unique()
new_df = pd.DataFrame()
for col in new_cols:
    tmp = df[df[0] == col]
    tmp.reset_index(inplace=True)
    new_df = pd.concat([new_df, tmp[1]], axis=1)
new_df.columns = new_cols
new_df['User'] = None
new_df = new_df[['User','employeeNo','last_name','first_name','language','email','department','role','Location']]

new_df
    User    employeeNo  last_name   first_name  language    email   department  role    Location
0   None    123     Toole   Michael     english     michael.toole@123.ie    Marketing   Marketing Lead  Spain
1   None    456     Ronaldo     Juan    Spanish     juan.ronaldo@sms.ie     Data Science    Team Lead   NaN
2   None    998     Lee     Damian  english     damian.lee@email.com    NaN     NaN     NaN

您有一个文本文件,其中每个记录都以
[User]
行开头,数据行的格式为
key=value
。我知道没有一个模块能够自动处理这个问题,但是手工解析它很容易。代码可以是:

with open('file.txt') as fd:
    data = []                          # a list of records
    for line in fd:
        line = line.strip()            # strip end of line
        if line == '[User]':           # new record
            row = {}                   # row will be a key: value dict
            data.append(row)
        else:
            k,v = line.split('=', 1)   # split on the = character
            row[k] = v

df = pd.DataFrame(data)                # list of key: value dicts => dataframe
根据所示的样本数据,我们得到:

  employeeNo last_name first_name language                 email     department            role                 email Location
0        123     Toole    Michael  english   michael.toole@123.ie     Marketing  Marketing Lead                   NaN      NaN
1        456   Ronaldo       Juan  Spanish                    NaN  Data Science       Team Lead   juan.ronaldo@sms.ie    Spain
2        998       Lee     Damian  english                    NaN           NaN             NaN  damian.lee@email.com      NaN

基于对以前版本偏移值的测试进行重写

import pandas as pd
# Revised from previous answer - ensures key value pairs are contained to the same
# record - previous version assumed the first record had all the expected keys - 
# inadvertently assigned (Location) value of second record to the first record 
# which did not have a Location key 
# This version should perform better - only dealing with one single df
#  - and using pandas own pivot() function

textFile = 'file.txt'
filter = '[User]'

# Decoration - enabling a check and balance - how many users are we processing?
textFileOpened = open(textFile,'r')
initialRead = textFileOpened.read()
userCount = initialRead.count(filter)  # sample has 4 [User] entries - but only three actual unique records
print ('User Count {}'.format(userCount))

# Create sets so able to manipulate and interrogate
allData = []
oneRow = []
userSeq = 0

#Iterate through file - assign record key and [userSeq] Key to each pair
with open(textFile, 'r') as fp:
    for fileLineSeq, line in enumerate(fp):
        if filter in str(line):
            userSeq = userSeq + 1 # Ensures each key value pair is grouped
        else: userSeq = userSeq
        oneRow = [fileLineSeq, userSeq, line]
        allData.append(oneRow)

df = pd.DataFrame(allData)

df.columns = ['FileRow','UserSeq','KeyValue']  # rename columns
userSeparators = df[df['KeyValue'] == str(filter+'\n') ].index # Locate [User Records]
df.drop(userSeparators, inplace = True) # Remove [User] records
df = df.replace(' = ' ,  '=' , regex=True ) # Input data dirty - cleaning up
df = df.replace('\n' ,  '' , regex=True ) # remove the new lines appended during the list generation

# print(df) # Test as necessary here

# split KeyValue column into two
df[['Key', 'Value']] = df.KeyValue.str.split('=', expand=True)
# very powerful function - convert to table
df = df.pivot(index='UserSeq', columns='Key', values='Value') 
print(df)
结果

User Count 4
Key     Location    department                 email employeeNo first_name language last_name            role
UserSeq                                                                                                      
1            NaN     Marketing  michael.toole@123.ie        123    Michael  english     Toole  Marketing Lead
2          Spain  Data Science   juan.ronaldo@sms.ie        456       Juan  Spanish   Ronaldo       Team Lead
3            NaN           NaN  damian.lee@email.com        998     Damian  english       Lee             NaN

您没有显示文本文件,而是显示了电子表格的图像。我无法从中猜出文本文件的格式,因此我无法帮助您。请以可复制文本和问题本身的形式显示文件内容。添加了示例数据文本文件。谢谢@r-初学者。这看起来是我需要的。我目前收到一个错误。ValueError:长度不匹配:预期轴有0个元素,新值有19个元素。我认为这可能是因为新的_df=pd.dataframe(index=[0,1])?,为什么在创建新的df时要在这里建立索引?只是想理解这里的逻辑,非常感谢我的数据框的列名是0,1。所以我使用tmp[1]来获取数据列。您需要通过数据列名称指定它。不需要为初始数据帧编制索引。这些都是代码创建过程中的遗留问题。
tmp[1]
需要修改为
textFile['your data column name']
。非常感谢r初学者,代码对我的巨大帮助!我发现的唯一一件事是,头不会为我改变(我在所有头上都有“数据”),我会玩转代码,看看我是否能解决这个问题,任何建议都将不胜感激!再次非常感谢。嗨@irnerd,如果可以的话,我想回到这个话题。请注意,第一个用户没有属性位置,因此我应该得到NaN,但是代码会遍历列表并获取它可以找到的第二个位置值(它实际上与第二个用户关联)。有没有办法阻止这种情况发生?Hi@sqlworrier-为延迟道歉-如果你已经解决了这个问题-但如果没有看到新的答案,日期与此评论相同
User Count 4
Key     Location    department                 email employeeNo first_name language last_name            role
UserSeq                                                                                                      
1            NaN     Marketing  michael.toole@123.ie        123    Michael  english     Toole  Marketing Lead
2          Spain  Data Science   juan.ronaldo@sms.ie        456       Juan  Spanish   Ronaldo       Team Lead
3            NaN           NaN  damian.lee@email.com        998     Damian  english       Lee             NaN