Python如何在读取文本文件时跳过空行_Python_Pandas_Data Science

Python如何在读取文本文件时跳过空行

python pandas

Python如何在读取文本文件时跳过空行,python,pandas,data-science,Python,Pandas,Data Science,我试图解决coursera上数据科学导论中的一个问题：从中返回城镇及其所在州的数据帧 university_towns.txt列表。数据帧的格式应为：数据帧（[[“密歇根州”、“安娜堡市”]、[“密歇根州”、“伊普西兰蒂市”]，列=[“状态”、“区域名称”]）我的脚本如下所示： uni_towns = pd.read_csv('university_towns.txt', header=None, names={'RegionName'}) uni_towns['State'] = np

我试图解决coursera上数据科学导论中的一个问题：

从中返回城镇及其所在州的数据帧 university_towns.txt列表。数据帧的格式应为：数据帧（[[“密歇根州”、“安娜堡市”]、[“密歇根州”、“伊普西兰蒂市”]，列=[“状态”、“区域名称”]）

我的脚本如下所示：

uni_towns = pd.read_csv('university_towns.txt', header=None, names={'RegionName'})
uni_towns['State'] = np.where(uni_towns['RegionName'].str.contains('edit'), uni_towns['RegionName'], '')
uni_towns['State'] = uni_towns['State'].replace('', np.nan).ffill()
import re
# Removing (...) from state names
uni_towns['RegionName'] = uni_towns['RegionName'].apply(lambda x: re.sub(r'\([^)]*\)', '', x))
split_string = "("
uni_towns['RegionName'] = uni_towns['RegionName'].apply(lambda x: x.split(split_string, 1)[0])
# Removing [...] from state names
uni_towns['RegionName'] = uni_towns['RegionName'].apply(lambda x: re.sub(r'\[[^\]]*\]', '', x))
uni_towns['State'] = uni_towns['State'].apply(lambda x: re.sub(r'\[[^\]]*\]', '', x))
uni_towns = pd.DataFrame(uni_towns,columns = ['State','RegionName']).sort_values(by=['State', 'RegionName'])
return uni_towns

第一行显然是关于读取文本文件的，然后

RegionName

中包含单词

edit

的所有字段也是状态：

uni_towns['State'] = np.where(uni_towns['RegionName'].str.contains('edit'), uni_towns['RegionName'], '')

然后，我将从

RegionName

行中删除括号（）和方括号[]之间的所有内容：

uni_towns['RegionName'] = uni_towns['RegionName'].apply(lambda x: re.sub(r'\([^)]*\)', '', x))

uni_towns['RegionName'] = uni_towns['RegionName'].apply(lambda x: re.sub(r'\[[^\]]*\]', '', x))

因此，如果一个值类似于阿拉巴马州[编辑]或图斯卡卢萨（阿拉巴马大学），它们将变成，

Alabama

和

Tusca卢萨

然后，我对

状态

列执行相同的操作，因为如果它包含

[edit]

，我会将一些值从

区域名称

移动到其中

我之所以使用下面的代码，是因为很少有行具有类似于“Tuscaloosa（阿拉巴马大学

”的内容，其中只有”（regex模式没有检测到它）：
uni_towns['RegionName'] = uni_towns['RegionName'].apply(lambda x: x.split(split_string, 1)[0])

最终结果是：567行×2列

州名
0阿拉巴马州阿拉巴马州
1阿拉巴马州奥本
2阿拉巴马州佛罗伦萨
3阿拉巴马州杰克逊维尔

564威斯康星州白水
551威斯康星州威斯康星州
566怀俄明州拉勒米
怀俄明州怀俄明州565
而正确的结果应该是'517行x 2列。
查看txt
文件后，我看到一些行在读取时使用\n
连续两行，但脚本没有检测到\n
之前的第二行仍然在同一行中
下面是。
显示读取csv
函数有一个跳过空白行
选项。因此，您可以将跳过空白行=True
添加到读取csv
调用中
last_data=[]
for line in lines:
  last_data.append(line.strip("\n") # so it will remove any new lines comes last of string

# or you can say if line equals "\n" continue 

last_data=[]
for line in lines:
  last_data.append(line.strip("\n") # so it will remove any new lines comes last of string

# or you can say if line equals "\n" continue