Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/15.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 解析大文本文档,仅保留;帐号;,和特定关键字(“市场价值”)_Python_Python 3.x_Nltk - Fatal编程技术网

Python 解析大文本文档,仅保留;帐号;,和特定关键字(“市场价值”)

Python 解析大文本文档,仅保留;帐号;,和特定关键字(“市场价值”),python,python-3.x,nltk,Python,Python 3.x,Nltk,我有一个大的文本文档(~20000行),其主体如下所示: Invoice Account / Name: 0234523454 / XYZCORPORATIONS Charge Group Portfolio Fee Date Our / Your Ref Security / Category Charge Item No of Units Market Value Charge Amt Invoice Amt 30-Sep-2019 Debt Instruments PORTFOLIO F

我有一个大的文本文档(~20000行),其主体如下所示:

Invoice Account / Name: 
0234523454 / XYZCORPORATIONS
Charge Group
Portfolio Fee
Date
Our / Your Ref
Security / Category
Charge Item
No of Units
Market Value
Charge Amt Invoice Amt
30-Sep-2019
Debt Instruments
PORTFOLIO FEE
CS
USD 
USD 219.12 USD 219.12
14,136,666.31
 Invoice Account / Name: 
021346676343/ abcdefgcopr
M0919-031  / Page 3 of 35
Charge Group
Portfolio Fee
Date
Our / Your Re
Security / Category
Charge Item
No of Units
Market Value
Charge Amt Invoice Amt
30-Sep-2019
Equity Instruments
USD 788,640.00 USD 12.22
USD 12.22
PORTFOLIO FEE-
EC_CS
 Invoice Account / Name: 
123498761233/ somethingelsecorporation
Charge Group
Portfolio Fee
Date
Our / Your Ref
Invoice Account / Name: 
0234523454 / XYZCORPORATIONS
像这样的块重复了数千次。 正在尝试输出:

Invoice Account / Name: 
    0234523454 / XYZCORPORATIONS
Market Value
Invoice Account / Name: 
    021346676343/ abcdefgcopr
Market Value
Invoice Account / Name: 
    123498761233/ somethingelsecorporation
Market Value
由于我以前从未尝试过这样的事情,我有两个问题:
1.如何识别和保留这样的句子:

Invoice Account / Name: 
0234523454 / XYZCORPORATIONS
Charge Group
Portfolio Fee
Date
Our / Your Ref
Security / Category
Charge Item
No of Units
Market Value
Charge Amt Invoice Amt
30-Sep-2019
Debt Instruments
PORTFOLIO FEE
CS
USD 
USD 219.12 USD 219.12
14,136,666.31
 Invoice Account / Name: 
021346676343/ abcdefgcopr
M0919-031  / Page 3 of 35
Charge Group
Portfolio Fee
Date
Our / Your Re
Security / Category
Charge Item
No of Units
Market Value
Charge Amt Invoice Amt
30-Sep-2019
Equity Instruments
USD 788,640.00 USD 12.22
USD 12.22
PORTFOLIO FEE-
EC_CS
 Invoice Account / Name: 
123498761233/ somethingelsecorporation
Charge Group
Portfolio Fee
Date
Our / Your Ref
Invoice Account / Name: 
0234523454 / XYZCORPORATIONS
哪个没有固定长度

  • 如何只保留关键字“市场价值”而不是这个

  • 为此使用nltk明智吗?或者可以用正则表达式和字符串处理来处理它

    您可以使用字符串处理来搜索并找到您正在寻找的内容

    result = []
    
    with open('num.txt', 'r') as file:
        data = list(file.readlines())
    
        for indx, row in enumerate(data): 
            if 'Invoice Account' in row:
    
                accountnumber = data[indx+1].split('/')[0].strip() # Get account number from next line
                companyname = data[indx+1].split('/')[1].strip() # Get company name from next line
    
                # Store all results in a dictionary, you could print, store in other ways as well.
    
                info = {'Account Number': accountnumber,
                        'Company Name': companyname,
                        'Market Value': '',
                    }
    
                # Append the dictionary to a list called result
                result.append(info)
    
    然后,您可以直接从每个字典访问数据,其中只包含单个公司的值

    for data in result:
        print(f"""Account Name: {data['Company Name']}
    Account Number: {data['Account Number']}
    Market Value: {data['Market Value']}
    """)
    
    输出:

    Account Name: XYZCORPORATIONS
    Account Number: 0234523454
    Market Value: 
    
    Account Name: abcdefgcopr
    Account Number: 021346676343
    Market Value: 
    
    Account Name: somethingelsecorporation
    Account Number: 123498761233
    Market Value: 
    

    您只能使用
    regex
    实现它:

    import re
    
    with open('file.txt', 'r') as f:
        matches = re.findall('Invoice Account \/ Name:\n.*', f.read())
    
    with open('result.txt', 'w') as f:
        [f.write(f'{m}\nMarket Value\n') for m in matches]
    
    输出文件:

    Invoice Account / Name:
    0234523454 / XYZCORPORATIONS
    Market Value
    Invoice Account / Name:
    021346676343/ abcdefgcopr
    Market Value
    Invoice Account / Name:
    123498761233/ somethingelsecorporation
    Market Value
    

    您是要存储实际市场价值,还是只存储文本
    市场价值
    ?只存储文本市场价值。我不需要任何其他号码,除了帐号。到底是什么问题?你试过什么,做过什么研究吗?在更大的数据集上,这比我的答案快得多+1@Alderven,感谢您迄今为止的解决方案。但这假设每个“发票账户/名称”之间只有一个“市场价值”关键字。然而,在txt文件中,有时会多次出现“市场价值”在两个账号之间<代码>发票账户/名称:0234523454/XYZ公司市值市值发票账户/名称:021346676343/abcdefgcopr市值发票账户/名称:123498761233/Somethingelscorporation市值市值市值为没有早些时候解释这一错误表示歉意,新手错误。