Vb.net 如何通过唯一标识头或FF form feed char解析文本文件?

Vb.net 如何通过唯一标识头或FF form feed char解析文本文件?,vb.net,csv,parsing,text,streamreader,Vb.net,Csv,Parsing,Text,Streamreader,我有一个来自信用报告机构的文本文件。文本文件带有唯一的文件编号,我可以使用regex搜索这些文件。问题是,我要提取的每个文件号的数据从来都不在准确的位置 例如,如果在文本文件中我有一个文件号TP067283,则出生日期或社会保险号可能会在一列中出现一次,或者在另一个文件号中出现不同 不变的是,对于每个唯一的文件号,都有一个唯一的标题“TransUnion Credit Report”,并以“TransUnion Report的结尾”终止 在这两个标题之间,数据将在其中 例如: RF_TP0672

我有一个来自信用报告机构的文本文件。文本文件带有唯一的文件编号,我可以使用regex搜索这些文件。问题是,我要提取的每个文件号的数据从来都不在准确的位置

例如,如果在文本文件中我有一个文件号TP067283,则出生日期或社会保险号可能会在一列中出现一次,或者在另一个文件号中出现不同

不变的是,对于每个唯一的文件号,都有一个唯一的标题“TransUnion Credit Report”,并以“TransUnion Report的结尾”终止

在这两个标题之间,数据将在其中

例如:

RF_TP067283               TRANSUNION CREDIT REPORT                          

 <FOR>          <SUB NAME>          <MKT SUB>  <INFILE>   <DATE>      <TIME> 
 (I) Y CH0001434                     06 CH     12/16      05/21/19    10:22CT

 <SUBJECT>                                          <SSN>                    
 TA****, K**                                        ###-##-####              
 <CURRENT ADDRESS>                                               <DATE RPTD> 
 1307 Blah CT., WHEELING IL. 66666                                12/16       
 ----------------------------------------------------------------------------
 S P E C I A L   M E S S A G E S                                             
 ****IDVISION ALERTS : CLEAR FOR ALL SEARCHES PERFORMED***                   
 ----------------------------------------------------------------------------
 M O D E L   P R O F I L E                                                   
 ***RECOVERY MODEL 1.0: NOT SCORED: INSUFFICIENT CREDIT***                   
 ----------------------------------------------------------------------------


                             END OF TRANSUNION REPORT                        
                                                                            
RF_TP067284               TRANSUNION CREDIT REPORT                          

 <FOR>          <SUB NAME>          <MKT SUB>  <INFILE>   <DATE>      <TIME> 
 (I) Y CH0001434                     07 RK      4/05      05/21/19    10:22CT

 <SUBJECT>                                          <SSN>        <BIRTH DATE>
 P****, A*****  K.                                 ***-**-****   2/87       

 <CURRENT ADDRESS>                                               <DATE RPTD> 
 93 W. AUBURNDALE AV., CORTLAND IL. 66666                        10/06       

 ----------------------------------------------------------------------------
 S P E C I A L   M E S S A G E S                                             
 ****IDVISION ALERTS : CLEAR FOR ALL SEARCHES PERFORMED***                   
 ----------------------------------------------------------------------------
 M O D E L   P R O F I L E                                                   
 ***RECOVERY MODEL 1.0 SCORE +519  : ***                                     
 ----------------------------------------------------------------------------
 C R E D I T   S U M M A R Y      * * *    T O T A L  F I L E  H I S T O R Y 
 PR=0 COL=5  NEG=13 HSTNEG=0     TRD=27 RVL=11 INST=16 MTG=0  OPN=0  INQ=9  
 C R E D I T  R E P O R T  S E R V I C E D  B Y :                            
 TRANSUNION                                                    800-888-4213  
 2 BALDWIN PLACE, P.O. BOX 1000 CHESTER, PA 19016                            
 CONSUMER DISCLOSURES CAN BE OBTAINED ONLINE THROUGH TRANSUNION AT:          
      HTTP://WWW.TRANSUNION.COM                                              

                             END OF TRANSUNION REPORT 
RF_TP067283跨工会信用报告
(一) Y CH0001434 06 CH 12/16 05/21/19 10:22CT
TA****,K**##############
1307布拉赫州,伊利诺伊州惠林。66666                                12/16       
----------------------------------------------------------------------------
S P E C I A L M E S A G E S
****IDVISION警报:清除所有执行的搜索***
----------------------------------------------------------------------------
M O D E L P R O F I L E
***恢复模式1.0:未评分:信用不足***
----------------------------------------------------------------------------
跨工会报告结束
                                                                            
RF_TP067284跨工会信用报告
(一) Y CH0001434 07 RK 4/05 05/21/19 10:22CT
P****,A****K.***-***-*****2/87
伊利诺伊州科特兰市奥本代尔大道西93号。66666                        10/06       
----------------------------------------------------------------------------
S P E C I A L M E S A G E S
****IDVISION警报:清除所有执行的搜索***
----------------------------------------------------------------------------
M O D E L P R O F I L E
***恢复模式1.0分+519:**
----------------------------------------------------------------------------
C R R I T S U M A R Y***T A L F I L I S O R Y
PR=0 COL=5 NEG=13 HSTNEG=0 TRD=27 RVL=11 INST=16 MTG=0 OPN=0 INQ=9
C R E D I T R O R T S E R V I C E D Y:
TRANSUNION 800-888-4213
宾夕法尼亚州切斯特市鲍德温广场2号邮政信箱1000号,邮编:19016
消费者披露可通过TRANSUNION在线获取,网址为:
HTTP://WWW.TRANSUNION.COM
跨工会报告结束
文件编号始终位于标题行的左上角。我想提取的信息总是介于两者之间

例如,第一个文件下面有一个。但它没有。有时他们的处境也不一样

我已经尝试过的是Streamreader,但如果文件号缺少出生日期或社交日期,由于缺少数据,我的数据列将不均匀

以下是我目前的代码:

Dim textfile = "C:\Users\username\DeskTop\Fucked up sample data.txt"


Sub Main()
Dim foundfileNumbers = FindFileNumbers(textfile)
    For Each filenumber In foundfileNumbers
        getFileNumberData(filenumber.ToString, textfile)
        'Console.WriteLine(filenumber.ToString)
    Next
End Sub




    Public Function FindFileNumbers(ByVal textfile As String)
        Dim filereader As New System.IO.StreamReader(textfile)
    Dim pages As List(Of String) = New List(Of String)
    Do While filereader.Peek() <> -1

        Dim regexPattern = "TP[0-9]{6}"
        Dim reg = New Regex(regexPattern)

        Dim currenttext As String = Nothing

        Dim textline As String = filereader.ReadLine()
        currenttext = textline.Substring(0, 77)
        currenttext.IndexOf("", StringComparison.InvariantCultureIgnoreCase)
        Dim matches = reg.Matches(currenttext)

        For Each m In matches
            'Console.WriteLine(m.ToString)
            pages.Add(m.ToString)
        Next m


    Loop
    Return pages

End Function
Dim textfile=“C:\Users\username\DeskTop\Fucked-up-sample-data.txt”
副标题()
Dim FOUNDFILENUMERS=FindFileNumber(文本文件)
对于FoundFileNumber中的每个文件号
getFileNumberData(filenumber.ToString,textfile)
'Console.WriteLine(filenumber.ToString)
下一个
端接头
公共函数findFileNumber(ByVal文本文件作为字符串)
Dim filereader作为新System.IO.StreamReader(文本文件)
将页面变暗为列表(字符串)=新列表(字符串)
Do While filereader.Peek()-1
Dim regexpatern=“TP[0-9]{6}”
Dim reg=新正则表达式(regexpatern)
Dim currenttext作为字符串=无
Dim textline As String=filereader.ReadLine()
currenttext=textline.Substring(0,77)
currenttext.IndexOf(“,StringComparison.InvariantCultureIgnoreCase)
Dim MATCHS=注册匹配(currenttext)
对于匹配中的每个m
'Console.WriteLine(m.ToString)
页数。添加(m.ToString)
下一个m
环
返回页
端函数
这就是我被卡住的地方:

    Public Function getFileNumberData(ByVal filenumber As String, ByVal textfile As String) As String
            Dim returnElement As String = Nothing
            Dim filereader As New System.IO.StreamReader(textfile)


            Do While filereader.Peek() <> -1

                Dim textline = filereader.ReadLine()
                If textline.Substring(0, 77).Contains(filenumber) Then
                Do While filereader.Peek() <> -1
                    Dim textline2 = filereader.ReadLine()
                    If textline2.Substring(0, 77).Contains("BIRTH DATE")Then
                        Dim textline3 = filereader.ReadLine()
                        returnElement = textline3.Substring(0, 77)
                        Console.WriteLine(returnElement)
                    Else
                        returnElement = "No DOB"
                        Console.WriteLine(returnElement)
                    End If


                Loop
            End If


        Loop
        Return returnElement
    End Function
公共函数getFileNumberData(ByVal filenumber作为字符串,ByVal textfile作为字符串)作为字符串
将返回元素设置为字符串=无
Dim filereader作为新System.IO.StreamReader(文本文件)
Do While filereader.Peek()-1
Dim textline=filereader.ReadLine()
如果textline.Substring(0,77).Contains(filenumber),则
Do While filereader.Peek()-1
Dim textline2=filereader.ReadLine()
如果textline2.Substring(0,77).包含(“出生日期”),则
Dim textline3=filereader.ReadLine()
returnElement=textline3.子字符串(0,77)
控制台写入线(返回元素)
其他的
returnElement=“无DOB”
控制台写入线(返回元素)
EN
Imports System.IO
Imports System.IO.Path
Imports Microsoft.Office.Interop.Excel
Imports Microsoft.Office.Interop
Imports System.Text.RegularExpressions
Imports System.Text
Imports iTextSharp.text.pdf
Imports iTextSharp.text
Module Module1
    Dim xlApp As New Excel.Application
    Dim xlWorkbook As Excel.Workbook
    Dim xlWorksheet1 As Excel.Worksheet
    Dim counter As Integer = 0
    Dim CBR = Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "CBR")
    Dim textfile = "C:\Users\username\DeskTop\CBR large data.txt"

    Dim TextFileFolder = Combine(CBR, "Text Files")
    Sub Main()
        If Not Directory.Exists(CBR) Then
            Directory.CreateDirectory(CBR)
        End If
        If Not Directory.Exists(TextFileFolder) Then
            Directory.CreateDirectory(TextFileFolder)
        End If
        Dim rmt = "C:\Users\xborja\DeskTop\CBR large data.txt"

        xlWorkbook = xlApp.Workbooks.Add()
        xlWorksheet1 = CType(xlWorkbook.Sheets("Sheet1"), Excel.Worksheet)
        xlWorksheet1.Cells(1, 1) = "FILENO"
        xlWorksheet1.Cells(1, 2) = "NUMBER"
        xlWorksheet1.Cells(1, 3) = "DOB"
        xlWorksheet1.Cells(1, 4) = "HIGH RISK"
        xlWorksheet1.Cells(1, 5) = "SCORE"
        xlWorksheet1.Cells(1, 6) = "PR"
        xlWorksheet1.Cells(1, 7) = "MTG"
        xlWorksheet1.Range("C:C").NumberFormat = "m/d/yyyy"




        Dim reader = File.OpenText(textfile)
        Dim builder = New StringBuilder(9000)
        'Allocate a reasonable size buffer to avoid mem allocs
        Dim line As String = reader.ReadLine
        Dim counter = 0
        While (Not (line) Is Nothing)
            counter += 1

            builder.AppendLine(line)
            If line.Contains("END OF TRANSUNION REPORT") Then

                CheckAndProcess(builder.ToString)


                builder.Clear()
            End If

            line = reader.ReadLine

        End While

        xlWorkbook.SaveAs(CBR & "/" & System.DateTime.Now.ToString("yyyyMMdd") & " CBR.xlsx")
        xlWorkbook.Close(True)
        Process.Start("explorer.exe", CBR)



        xlApp.Quit()
        releaseObject(xlApp)
        releaseObject(xlWorkbook)
        releaseObject(xlWorksheet1)
    End Sub
    Public Sub CheckAndProcess(ByVal page As String)
        Dim filenumber = Nothing
        Dim foundfileNumbers = FindFileNumbers(textfile)
        Dim BirthDateidx = Nothing
        Dim SSNidx = Nothing
        Dim ScoreIdx = Nothing
        Dim PRidx = Nothing
        Dim MTGidx = Nothing
        Dim DOB = Nothing
        Dim SSN = Nothing
        Dim Risk = Nothing
        Dim PR = Nothing
        Dim MTG = Nothing
        Dim Score = Nothing

        Dim pdfDoc As New Document()

        For Each filenumber In foundfileNumbers

            If page.Contains((filenumber)) Then
                Dim pdfWrite As PdfWriter = PdfWriter.GetInstance(pdfDoc, New FileStream("C:\Users\xborja\Desktop\CBR\Text Files\" & filenumber & ".pdf", FileMode.Create))
                pdfDoc.Open()
                pdfDoc.Add(New Paragraph(page))

                counter += 1
                xlWorksheet1.Cells(counter + 1, 1) = filenumber
                If Not page.Contains("BIRTH DATE") Then
                    Console.WriteLine("no DOB")
                    DOB = " "
                    xlWorksheet1.Cells(counter + 1, 3) = DOB

                Else
                    BirthDateidx = page.IndexOf("BIRTH DATE") + 78 'returning birth date index plus enough characters to go to next line
                    Console.WriteLine(page.Substring(BirthDateidx, 5).Replace(" ", "0"))
                    DOB = page.Substring(BirthDateidx, 5).Replace(" ", "0")
                    xlWorksheet1.Cells(counter + 1, 3) = DOB

                End If

                If Not page.Contains("SSN") Then
                    Console.WriteLine("no ssn")
                    SSN = " "
                    xlWorksheet1.Cells(counter + 1, 8) = SSN

                Else
                    SSNidx = page.IndexOf("SSN") + 78
                    Console.WriteLine(page.Substring(SSNidx, 11))
                    SSN = page.Substring(SSNidx, 11)
                    xlWorksheet1.Cells(counter + 1, 8) = SSN

                End If

                If page.Contains("CLEAR FOR ALL SEARCHES PERFORMED") Then
                    Risk = "N"
                    xlWorksheet1.Cells(counter + 1, 4) = Risk
                    Console.WriteLine("N")

                Else
                    Risk = "Y"
                    xlWorksheet1.Cells(counter + 1, 4) = Risk
                    Console.WriteLine("Y")

                End If
                If Not page.Contains("RECOVERY MODEL 1.0 SCORE") Then
                    Console.WriteLine("no score")
                    Score = " "
                    xlWorksheet1.Cells(counter + 1, 5) = Score
                Else
                    ScoreIdx = page.IndexOf("RECOVERY MODEL 1.0 SCORE") + 26
                    Console.WriteLine(page.Substring(ScoreIdx, 3))
                    Score = page.Substring(ScoreIdx, 3)
                    xlWorksheet1.Cells(counter + 1, 5) = Score

                End If

                If page.Contains("PR=") Then
                    PRidx = page.IndexOf("PR=") + 3
                    Console.WriteLine(page.Substring(PRidx, 1))
                    PR = page.Substring(PRidx, 1)
                    xlWorksheet1.Cells(counter + 1, 6) = PR

                Else
                    Console.WriteLine("NO PR")
                    PR = "0"
                    xlWorksheet1.Cells(counter + 1, 6) = PR

                End If

                If page.Contains("MTG=") Then
                    MTGidx = page.IndexOf("MTG=") + 4
                    Console.WriteLine(page.Substring(MTGidx, 1))
                    MTG = page.Substring(MTGidx, 1)
                    xlWorksheet1.Cells(counter + 1, 7) = MTG

                Else
                    Console.WriteLine("NO MTG")
                    MTG = "0"
                    xlWorksheet1.Cells(counter + 1, 7) = MTG

                End If

            End If


            File.AppendAllText("C:\Users\xborja\Desktop\CBR\Text Files\" & filenumber & ".txt", page)

        Next
        pdfDoc.Close()

    End Sub
    Public Function FindFileNumbers(ByVal textfile As String)
        Dim filereader As New System.IO.StreamReader(textfile)
        Dim pages As List(Of String) = New List(Of String)
        Do While filereader.Peek() <> -1

            Dim regexPattern = "TP[0-9]{6}"
            Dim reg = New Regex(regexPattern)

            Dim currenttext As String = Nothing

            Dim textline As String = filereader.ReadLine()
            currenttext = textline.Substring(0, 77)
            currenttext.IndexOf("", StringComparison.InvariantCultureIgnoreCase)
            Dim matches = reg.Matches(currenttext)

            For Each m In matches

                pages.Add(m.ToString)
            Next m


        Loop
        Return pages

    End Function
    Private Sub releaseObject(ByVal aCell As Object)
        Try
            System.Runtime.InteropServices.Marshal.ReleaseComObject(aCell)
            aCell = Nothing
        Catch ex As Exception
            aCell = Nothing
        Finally
            GC.Collect()
        End Try
    End Sub



End Module