Python 阅读pandas中的表格,了解如何将文本输入到数据框
这是我的文本,我需要创建一个数据框,其中1列为州名称,另一列为城镇名称,我知道如何删除大学名称。但我如何告诉熊猫,在每一次[编辑]都是一个新的状态 预期输出数据帧Python 阅读pandas中的表格,了解如何将文本输入到数据框,python,python-3.x,pandas,sklearn-pandas,Python,Python 3.x,Pandas,Sklearn Pandas,这是我的文本,我需要创建一个数据框,其中1列为州名称,另一列为城镇名称,我知道如何删除大学名称。但我如何告诉熊猫,在每一次[编辑]都是一个新的状态 预期输出数据帧 Alabama[edit] Auburn (Auburn University)[1] Florence (University of North Alabama) Jacksonville (Jacksonville State University)[2] Alaska[edit] Fairbanks (University of
Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
Tucson (University of Arizona)
我不确定是否可以使用read_表,如果可以,如何使用?我确实将所有内容导入到了数据框中,但州和市位于同一列中。我还尝试了一个列表,但问题仍然是一样的
我需要这样的东西,如果行中有一个[edit],那么它后面和下一个[edit]行之前的所有值都是介于这两行之间的行的状态,也许
熊猫可以做到,但你可以很容易做到
Alabama Auburn
Alabama Florence
Alabama Jacksonville
Alaska Fairbanks
Arizona Flagstaff
Arizona Tempe
Arizona Tucson
若你们从文件中读取,那个么
data = '''Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
Tucson (University of Arizona)'''
# ---
result = []
state = None
for line in data.split('\n'):
if line.endswith('[edit]'):
# remember new state
state = line[:-6] # without `[edit]`
else:
# add state, city to result
city, rest = line.split(' ', 1)
result.append( [state, city] )
# --- display ---
for state, city in result:
print(state, city)
现在您可以使用result
创建DataFrame
也许pandas
可以做到,但您可以轻松做到
Alabama Auburn
Alabama Florence
Alabama Jacksonville
Alaska Fairbanks
Arizona Flagstaff
Arizona Tempe
Arizona Tucson
若你们从文件中读取,那个么
data = '''Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
Tucson (University of Arizona)'''
# ---
result = []
state = None
for line in data.split('\n'):
if line.endswith('[edit]'):
# remember new state
state = line[:-6] # without `[edit]`
else:
# add state, city to result
city, rest = line.split(' ', 1)
result.append( [state, city] )
# --- display ---
for state, city in result:
print(state, city)
现在,您可以使用result
创建DataFrame
使用Pandas,您可以执行以下操作:
result = []
state = None
with open('your_file') as f:
for line in f:
line = line.strip() # remove '\n'
if line.endswith('[edit]'):
# remember new state
state = line[:-6] # without `[edit]`
else:
# add state, city to result
city, rest = line.split(' ', 1)
result.append( [state, city] )
# --- display ---
for state, city in result:
print(state, city)
产生
import pandas as pd
df = pd.read_table('data', sep='\n', header=None, names=['town'])
df['is_state'] = df['town'].str.contains(r'\[edit\]')
df['groupno'] = df['is_state'].cumsum()
df['index'] = df.groupby('groupno').cumcount()
df['state'] = df.groupby('groupno')['town'].transform('first')
df['state'] = df['state'].str.replace(r'\[edit\]', '')
df['town'] = df['town'].str.replace(r' \(.+$', '')
df = df.loc[~df['is_state']]
df = df[['state','town']]
下面是代码所做工作的分解。将文本文件加载到数据框中后,使用str.contains
标识状态行。使用cumsum
获取真/假值的累积和,其中真值被视为1,假值被视为0
state town
1 Alabama Auburn
2 Alabama Florence
3 Alabama Jacksonville
5 Alaska Fairbanks
7 Arizona Flagstaff
8 Arizona Tempe
9 Arizona Tucson
现在,对于每个groupno
number,我们可以为组中的每一行分配一个唯一的整数:
df = pd.read_table('data', sep='\n', header=None, names=['town'])
df['is_state'] = df['town'].str.contains(r'\[edit\]')
df['groupno'] = df['is_state'].cumsum()
# town is_state groupno
# 0 Alabama[edit] True 1
# 1 Auburn (Auburn University)[1] False 1
# 2 Florence (University of North Alabama) False 1
# 3 Jacksonville (Jacksonville State University)[2] False 1
# 4 Alaska[edit] True 2
# 5 Fairbanks (University of Alaska Fairbanks)[2] False 2
# 6 Arizona[edit] True 3
# 7 Flagstaff (Northern Arizona University)[6] False 3
# 8 Tempe (Arizona State University) False 3
# 9 Tucson (University of Arizona) False 3
同样,对于每个groupno
编号,我们可以通过选择每个组中的第一个城镇来查找州:
df['index'] = df.groupby('groupno').cumcount()
# town is_state groupno index
# 0 Alabama[edit] True 1 0
# 1 Auburn (Auburn University)[1] False 1 1
# 2 Florence (University of North Alabama) False 1 2
# 3 Jacksonville (Jacksonville State University)[2] False 1 3
# 4 Alaska[edit] True 2 0
# 5 Fairbanks (University of Alaska Fairbanks)[2] False 2 1
# 6 Arizona[edit] True 3 0
# 7 Flagstaff (Northern Arizona University)[6] False 3 1
# 8 Tempe (Arizona State University) False 3 2
# 9 Tucson (University of Arizona) False 3 3
我们基本上拥有所需的数据帧;剩下的就是美化结果。
我们可以使用str.replace
从状态中删除[edit]
,并从城镇中删除第一个括号后的所有内容:
df['state'] = df.groupby('groupno')['town'].transform('first')
# town is_state groupno index state
# 0 Alabama[edit] True 1 0 Alabama[edit]
# 1 Auburn (Auburn University)[1] False 1 1 Alabama[edit]
# 2 Florence (University of North Alabama) False 1 2 Alabama[edit]
# 3 Jacksonville (Jacksonville State University)[2] False 1 3 Alabama[edit]
# 4 Alaska[edit] True 2 0 Alaska[edit]
# 5 Fairbanks (University of Alaska Fairbanks)[2] False 2 1 Alaska[edit]
# 6 Arizona[edit] True 3 0 Arizona[edit]
# 7 Flagstaff (Northern Arizona University)[6] False 3 1 Arizona[edit]
# 8 Tempe (Arizona State University) False 3 2 Arizona[edit]
# 9 Tucson (University of Arizona) False 3 3 Arizona[edit]
删除城镇
实际上是一个州的行:
df['state'] = df['state'].str.replace(r'\[edit\]', '')
df['town'] = df['town'].str.replace(r' \(.+$', '')
最后,只保留所需的列:
df = df.loc[~df['is_state']]
使用熊猫,您可以执行以下操作:
result = []
state = None
with open('your_file') as f:
for line in f:
line = line.strip() # remove '\n'
if line.endswith('[edit]'):
# remember new state
state = line[:-6] # without `[edit]`
else:
# add state, city to result
city, rest = line.split(' ', 1)
result.append( [state, city] )
# --- display ---
for state, city in result:
print(state, city)
产生
import pandas as pd
df = pd.read_table('data', sep='\n', header=None, names=['town'])
df['is_state'] = df['town'].str.contains(r'\[edit\]')
df['groupno'] = df['is_state'].cumsum()
df['index'] = df.groupby('groupno').cumcount()
df['state'] = df.groupby('groupno')['town'].transform('first')
df['state'] = df['state'].str.replace(r'\[edit\]', '')
df['town'] = df['town'].str.replace(r' \(.+$', '')
df = df.loc[~df['is_state']]
df = df[['state','town']]
下面是代码所做工作的分解。将文本文件加载到数据框中后,使用str.contains
标识状态行。使用cumsum
获取真/假值的累积和,其中真值被视为1,假值被视为0
state town
1 Alabama Auburn
2 Alabama Florence
3 Alabama Jacksonville
5 Alaska Fairbanks
7 Arizona Flagstaff
8 Arizona Tempe
9 Arizona Tucson
现在,对于每个groupno
number,我们可以为组中的每一行分配一个唯一的整数:
df = pd.read_table('data', sep='\n', header=None, names=['town'])
df['is_state'] = df['town'].str.contains(r'\[edit\]')
df['groupno'] = df['is_state'].cumsum()
# town is_state groupno
# 0 Alabama[edit] True 1
# 1 Auburn (Auburn University)[1] False 1
# 2 Florence (University of North Alabama) False 1
# 3 Jacksonville (Jacksonville State University)[2] False 1
# 4 Alaska[edit] True 2
# 5 Fairbanks (University of Alaska Fairbanks)[2] False 2
# 6 Arizona[edit] True 3
# 7 Flagstaff (Northern Arizona University)[6] False 3
# 8 Tempe (Arizona State University) False 3
# 9 Tucson (University of Arizona) False 3
同样,对于每个groupno
编号,我们可以通过选择每个组中的第一个城镇来查找州:
df['index'] = df.groupby('groupno').cumcount()
# town is_state groupno index
# 0 Alabama[edit] True 1 0
# 1 Auburn (Auburn University)[1] False 1 1
# 2 Florence (University of North Alabama) False 1 2
# 3 Jacksonville (Jacksonville State University)[2] False 1 3
# 4 Alaska[edit] True 2 0
# 5 Fairbanks (University of Alaska Fairbanks)[2] False 2 1
# 6 Arizona[edit] True 3 0
# 7 Flagstaff (Northern Arizona University)[6] False 3 1
# 8 Tempe (Arizona State University) False 3 2
# 9 Tucson (University of Arizona) False 3 3
我们基本上拥有所需的数据帧;剩下的就是美化结果。
我们可以使用str.replace
从状态中删除[edit]
,并从城镇中删除第一个括号后的所有内容:
df['state'] = df.groupby('groupno')['town'].transform('first')
# town is_state groupno index state
# 0 Alabama[edit] True 1 0 Alabama[edit]
# 1 Auburn (Auburn University)[1] False 1 1 Alabama[edit]
# 2 Florence (University of North Alabama) False 1 2 Alabama[edit]
# 3 Jacksonville (Jacksonville State University)[2] False 1 3 Alabama[edit]
# 4 Alaska[edit] True 2 0 Alaska[edit]
# 5 Fairbanks (University of Alaska Fairbanks)[2] False 2 1 Alaska[edit]
# 6 Arizona[edit] True 3 0 Arizona[edit]
# 7 Flagstaff (Northern Arizona University)[6] False 3 1 Arizona[edit]
# 8 Tempe (Arizona State University) False 3 2 Arizona[edit]
# 9 Tucson (University of Arizona) False 3 3 Arizona[edit]
删除城镇
实际上是一个州的行:
df['state'] = df['state'].str.replace(r'\[edit\]', '')
df['town'] = df['town'].str.replace(r' \(.+$', '')
最后,只保留所需的列:
df = df.loc[~df['is_state']]
谢谢,很好用。只有city,rest=line.split(“”,1)
不适用于像纽约这样的地方,但它很容易修复可能split(“”)应该可以。是的,我已经修复了它,hanks,非常好用。只有city,rest=line.split(“”,1)
不适用于像纽约这样的地方,但它很容易修复可能split(“”)
应该可以工作。是的,我已经修复了它。这很好,但是它比for循环和列表快吗?对于足够大的文本文件,使用panda的基于列的向量化函数应该比使用Python循环快。对于小文件,Python循环可能更快。对于非常大的文本文件,panda代码应该比equiva快得多使用Python循环的lent代码。@lucarlig:经过一些基准测试后,对于任何大小合理的文件,Python循环似乎都更快(毕竟,州和镇并不多。)实际上,furas的答案是更实用的解决方案。完美。谢谢!这很好,但它比for循环和列表快吗?对于足够大的文本文件,使用Pandas基于列的向量化函数应该比使用Python循环快。对于小文件,Python循环可能更快。对于非常大的文本文件,Pandas代码should比使用Python循环的等效代码要快得多。@lucarlig:经过一些基准测试后,对于任何大小合理的文件(毕竟,州和镇并不多),Python循环似乎都要快。因此,实际上,furas的答案是更实用的解决方案。非常好。谢谢!