Python 我的头在我的txt文件的第一列。我想创造一个新的世界
文本文件中的示例数据Python 我的头在我的txt文件的第一列。我想创造一个新的世界,python,dataframe,jupyter-notebook,Python,Dataframe,Jupyter Notebook,文本文件中的示例数据 [User] employeeNo=123 last_name=Toole first_name=Michael language=english email = michael.toole@123.ie department=Marketing role=Marketing Lead [User] employeeNo=456 last_name= Ronaldo first_name=Juan language=Spanish email=juan.ronaldo@sm
[User]
employeeNo=123
last_name=Toole
first_name=Michael
language=english
email = michael.toole@123.ie
department=Marketing
role=Marketing Lead
[User]
employeeNo=456
last_name= Ronaldo
first_name=Juan
language=Spanish
email=juan.ronaldo@sms.ie
department=Data Science
role=Team Lead
Location=Spain
[User]
employeeNo=998
last_name=Lee
first_name=Damian
language=english
email=damian.lee@email.com
[User]
不知是否有人可以帮助我,你可以看到我的样本数据集上面。我想做的(请告诉我是否有更有效的方法)是循环第一列和唯一ID列表出现的位置(例如first_name、last_name、role等),将对应行中的值附加到该列表中,并对每个唯一ID执行此操作,这样我就剩下下面的内容了。
我读过关于多重索引的文章,我不确定这是否是一个更好的解决方案,但我无法让它工作(我对python很陌生)
我确信有一种更为理想的方法可以做到这一点,但这将是获得一个唯一的行名称列表,这次在循环过程中提取它们,并将它们组合到一个新的数据帧中。最后,用所需的列名更新它
import pandas as pd
import numpy as np
import io
data = '''
[User]
employeeNo=123
last_name=Toole
first_name=Michael
language=english
email=michael.toole@123.ie
department=Marketing
role="Marketing Lead"
[User]
employeeNo=456
last_name= Ronaldo
first_name=Juan
language=Spanish
email=juan.ronaldo@sms.ie
department="Data Science"
role=Team Lead
Location=Spain
[User]
employeeNo=998
last_name=Lee
first_name=Damian
language=english
email=damian.lee@email.com
[User]
'''
df = pd.read_csv(io.StringIO(data), sep='=', comment='[', header=None)
new_cols = df[0].unique()
new_df = pd.DataFrame()
for col in new_cols:
tmp = df[df[0] == col]
tmp.reset_index(inplace=True)
new_df = pd.concat([new_df, tmp[1]], axis=1)
new_df.columns = new_cols
new_df['User'] = None
new_df = new_df[['User','employeeNo','last_name','first_name','language','email','department','role','Location']]
new_df
User employeeNo last_name first_name language email department role Location
0 None 123 Toole Michael english michael.toole@123.ie Marketing Marketing Lead Spain
1 None 456 Ronaldo Juan Spanish juan.ronaldo@sms.ie Data Science Team Lead NaN
2 None 998 Lee Damian english damian.lee@email.com NaN NaN NaN
您有一个文本文件,其中每个记录都以
[User]
行开头,数据行的格式为key=value
。我知道没有一个模块能够自动处理这个问题,但是手工解析它很容易。代码可以是:
with open('file.txt') as fd:
data = [] # a list of records
for line in fd:
line = line.strip() # strip end of line
if line == '[User]': # new record
row = {} # row will be a key: value dict
data.append(row)
else:
k,v = line.split('=', 1) # split on the = character
row[k] = v
df = pd.DataFrame(data) # list of key: value dicts => dataframe
根据所示的样本数据,我们得到:
employeeNo last_name first_name language email department role email Location
0 123 Toole Michael english michael.toole@123.ie Marketing Marketing Lead NaN NaN
1 456 Ronaldo Juan Spanish NaN Data Science Team Lead juan.ronaldo@sms.ie Spain
2 998 Lee Damian english NaN NaN NaN damian.lee@email.com NaN
基于对以前版本偏移值的测试进行重写
import pandas as pd
# Revised from previous answer - ensures key value pairs are contained to the same
# record - previous version assumed the first record had all the expected keys -
# inadvertently assigned (Location) value of second record to the first record
# which did not have a Location key
# This version should perform better - only dealing with one single df
# - and using pandas own pivot() function
textFile = 'file.txt'
filter = '[User]'
# Decoration - enabling a check and balance - how many users are we processing?
textFileOpened = open(textFile,'r')
initialRead = textFileOpened.read()
userCount = initialRead.count(filter) # sample has 4 [User] entries - but only three actual unique records
print ('User Count {}'.format(userCount))
# Create sets so able to manipulate and interrogate
allData = []
oneRow = []
userSeq = 0
#Iterate through file - assign record key and [userSeq] Key to each pair
with open(textFile, 'r') as fp:
for fileLineSeq, line in enumerate(fp):
if filter in str(line):
userSeq = userSeq + 1 # Ensures each key value pair is grouped
else: userSeq = userSeq
oneRow = [fileLineSeq, userSeq, line]
allData.append(oneRow)
df = pd.DataFrame(allData)
df.columns = ['FileRow','UserSeq','KeyValue'] # rename columns
userSeparators = df[df['KeyValue'] == str(filter+'\n') ].index # Locate [User Records]
df.drop(userSeparators, inplace = True) # Remove [User] records
df = df.replace(' = ' , '=' , regex=True ) # Input data dirty - cleaning up
df = df.replace('\n' , '' , regex=True ) # remove the new lines appended during the list generation
# print(df) # Test as necessary here
# split KeyValue column into two
df[['Key', 'Value']] = df.KeyValue.str.split('=', expand=True)
# very powerful function - convert to table
df = df.pivot(index='UserSeq', columns='Key', values='Value')
print(df)
结果
User Count 4
Key Location department email employeeNo first_name language last_name role
UserSeq
1 NaN Marketing michael.toole@123.ie 123 Michael english Toole Marketing Lead
2 Spain Data Science juan.ronaldo@sms.ie 456 Juan Spanish Ronaldo Team Lead
3 NaN NaN damian.lee@email.com 998 Damian english Lee NaN
您没有显示文本文件,而是显示了电子表格的图像。我无法从中猜出文本文件的格式,因此我无法帮助您。请以可复制文本和问题本身的形式显示文件内容。添加了示例数据文本文件。谢谢@r-初学者。这看起来是我需要的。我目前收到一个错误。ValueError:长度不匹配:预期轴有0个元素,新值有19个元素。我认为这可能是因为新的_df=pd.dataframe(index=[0,1])?,为什么在创建新的df时要在这里建立索引?只是想理解这里的逻辑,非常感谢我的数据框的列名是0,1。所以我使用tmp[1]来获取数据列。您需要通过数据列名称指定它。不需要为初始数据帧编制索引。这些都是代码创建过程中的遗留问题。
tmp[1]
需要修改为textFile['your data column name']
。非常感谢r初学者,代码对我的巨大帮助!我发现的唯一一件事是,头不会为我改变(我在所有头上都有“数据”),我会玩转代码,看看我是否能解决这个问题,任何建议都将不胜感激!再次非常感谢。嗨@irnerd,如果可以的话,我想回到这个话题。请注意,第一个用户没有属性位置,因此我应该得到NaN,但是代码会遍历列表并获取它可以找到的第二个位置值(它实际上与第二个用户关联)。有没有办法阻止这种情况发生?Hi@sqlworrier-为延迟道歉-如果你已经解决了这个问题-但如果没有看到新的答案,日期与此评论相同
User Count 4
Key Location department email employeeNo first_name language last_name role
UserSeq
1 NaN Marketing michael.toole@123.ie 123 Michael english Toole Marketing Lead
2 Spain Data Science juan.ronaldo@sms.ie 456 Juan Spanish Ronaldo Team Lead
3 NaN NaN damian.lee@email.com 998 Damian english Lee NaN