Python 解析大文本文档,仅保留;帐号;,和特定关键字(“市场价值”)
我有一个大的文本文档(~20000行),其主体如下所示:Python 解析大文本文档,仅保留;帐号;,和特定关键字(“市场价值”),python,python-3.x,nltk,Python,Python 3.x,Nltk,我有一个大的文本文档(~20000行),其主体如下所示: Invoice Account / Name: 0234523454 / XYZCORPORATIONS Charge Group Portfolio Fee Date Our / Your Ref Security / Category Charge Item No of Units Market Value Charge Amt Invoice Amt 30-Sep-2019 Debt Instruments PORTFOLIO F
Invoice Account / Name:
0234523454 / XYZCORPORATIONS
Charge Group
Portfolio Fee
Date
Our / Your Ref
Security / Category
Charge Item
No of Units
Market Value
Charge Amt Invoice Amt
30-Sep-2019
Debt Instruments
PORTFOLIO FEE
CS
USD
USD 219.12 USD 219.12
14,136,666.31
Invoice Account / Name:
021346676343/ abcdefgcopr
M0919-031 / Page 3 of 35
Charge Group
Portfolio Fee
Date
Our / Your Re
Security / Category
Charge Item
No of Units
Market Value
Charge Amt Invoice Amt
30-Sep-2019
Equity Instruments
USD 788,640.00 USD 12.22
USD 12.22
PORTFOLIO FEE-
EC_CS
Invoice Account / Name:
123498761233/ somethingelsecorporation
Charge Group
Portfolio Fee
Date
Our / Your Ref
Invoice Account / Name:
0234523454 / XYZCORPORATIONS
像这样的块重复了数千次。
正在尝试输出:
Invoice Account / Name:
0234523454 / XYZCORPORATIONS
Market Value
Invoice Account / Name:
021346676343/ abcdefgcopr
Market Value
Invoice Account / Name:
123498761233/ somethingelsecorporation
Market Value
由于我以前从未尝试过这样的事情,我有两个问题:1.如何识别和保留这样的句子:
Invoice Account / Name:
0234523454 / XYZCORPORATIONS
Charge Group
Portfolio Fee
Date
Our / Your Ref
Security / Category
Charge Item
No of Units
Market Value
Charge Amt Invoice Amt
30-Sep-2019
Debt Instruments
PORTFOLIO FEE
CS
USD
USD 219.12 USD 219.12
14,136,666.31
Invoice Account / Name:
021346676343/ abcdefgcopr
M0919-031 / Page 3 of 35
Charge Group
Portfolio Fee
Date
Our / Your Re
Security / Category
Charge Item
No of Units
Market Value
Charge Amt Invoice Amt
30-Sep-2019
Equity Instruments
USD 788,640.00 USD 12.22
USD 12.22
PORTFOLIO FEE-
EC_CS
Invoice Account / Name:
123498761233/ somethingelsecorporation
Charge Group
Portfolio Fee
Date
Our / Your Ref
Invoice Account / Name:
0234523454 / XYZCORPORATIONS
哪个没有固定长度
为此使用nltk明智吗?或者可以用正则表达式和字符串处理来处理它 您可以使用字符串处理来搜索并找到您正在寻找的内容
result = []
with open('num.txt', 'r') as file:
data = list(file.readlines())
for indx, row in enumerate(data):
if 'Invoice Account' in row:
accountnumber = data[indx+1].split('/')[0].strip() # Get account number from next line
companyname = data[indx+1].split('/')[1].strip() # Get company name from next line
# Store all results in a dictionary, you could print, store in other ways as well.
info = {'Account Number': accountnumber,
'Company Name': companyname,
'Market Value': '',
}
# Append the dictionary to a list called result
result.append(info)
然后,您可以直接从每个字典访问数据,其中只包含单个公司的值
for data in result:
print(f"""Account Name: {data['Company Name']}
Account Number: {data['Account Number']}
Market Value: {data['Market Value']}
""")
输出:
Account Name: XYZCORPORATIONS
Account Number: 0234523454
Market Value:
Account Name: abcdefgcopr
Account Number: 021346676343
Market Value:
Account Name: somethingelsecorporation
Account Number: 123498761233
Market Value:
您只能使用
regex
实现它:
import re
with open('file.txt', 'r') as f:
matches = re.findall('Invoice Account \/ Name:\n.*', f.read())
with open('result.txt', 'w') as f:
[f.write(f'{m}\nMarket Value\n') for m in matches]
输出文件:
Invoice Account / Name:
0234523454 / XYZCORPORATIONS
Market Value
Invoice Account / Name:
021346676343/ abcdefgcopr
Market Value
Invoice Account / Name:
123498761233/ somethingelsecorporation
Market Value
您是要存储实际市场价值,还是只存储文本
市场价值
?只存储文本市场价值。我不需要任何其他号码,除了帐号。到底是什么问题?你试过什么,做过什么研究吗?在更大的数据集上,这比我的答案快得多+1@Alderven,感谢您迄今为止的解决方案。但这假设每个“发票账户/名称”之间只有一个“市场价值”关键字。然而,在txt文件中,有时会多次出现“市场价值”在两个账号之间<代码>发票账户/名称:0234523454/XYZ公司市值市值发票账户/名称:021346676343/abcdefgcopr市值发票账户/名称:123498761233/Somethingelscorporation市值市值市值代码>为没有早些时候解释这一错误表示歉意,新手错误。