Vb.net 如何通过唯一标识头或FF form feed char解析文本文件?
我有一个来自信用报告机构的文本文件。文本文件带有唯一的文件编号,我可以使用regex搜索这些文件。问题是,我要提取的每个文件号的数据从来都不在准确的位置 例如,如果在文本文件中我有一个文件号TP067283,则出生日期或社会保险号可能会在一列中出现一次,或者在另一个文件号中出现不同 不变的是,对于每个唯一的文件号,都有一个唯一的标题“TransUnion Credit Report”,并以“TransUnion Report的结尾”终止 在这两个标题之间,数据将在其中 例如:Vb.net 如何通过唯一标识头或FF form feed char解析文本文件?,vb.net,csv,parsing,text,streamreader,Vb.net,Csv,Parsing,Text,Streamreader,我有一个来自信用报告机构的文本文件。文本文件带有唯一的文件编号,我可以使用regex搜索这些文件。问题是,我要提取的每个文件号的数据从来都不在准确的位置 例如,如果在文本文件中我有一个文件号TP067283,则出生日期或社会保险号可能会在一列中出现一次,或者在另一个文件号中出现不同 不变的是,对于每个唯一的文件号,都有一个唯一的标题“TransUnion Credit Report”,并以“TransUnion Report的结尾”终止 在这两个标题之间,数据将在其中 例如: RF_TP0672
RF_TP067283 TRANSUNION CREDIT REPORT
<FOR> <SUB NAME> <MKT SUB> <INFILE> <DATE> <TIME>
(I) Y CH0001434 06 CH 12/16 05/21/19 10:22CT
<SUBJECT> <SSN>
TA****, K** ###-##-####
<CURRENT ADDRESS> <DATE RPTD>
1307 Blah CT., WHEELING IL. 66666 12/16
----------------------------------------------------------------------------
S P E C I A L M E S S A G E S
****IDVISION ALERTS : CLEAR FOR ALL SEARCHES PERFORMED***
----------------------------------------------------------------------------
M O D E L P R O F I L E
***RECOVERY MODEL 1.0: NOT SCORED: INSUFFICIENT CREDIT***
----------------------------------------------------------------------------
END OF TRANSUNION REPORT
RF_TP067284 TRANSUNION CREDIT REPORT
<FOR> <SUB NAME> <MKT SUB> <INFILE> <DATE> <TIME>
(I) Y CH0001434 07 RK 4/05 05/21/19 10:22CT
<SUBJECT> <SSN> <BIRTH DATE>
P****, A***** K. ***-**-**** 2/87
<CURRENT ADDRESS> <DATE RPTD>
93 W. AUBURNDALE AV., CORTLAND IL. 66666 10/06
----------------------------------------------------------------------------
S P E C I A L M E S S A G E S
****IDVISION ALERTS : CLEAR FOR ALL SEARCHES PERFORMED***
----------------------------------------------------------------------------
M O D E L P R O F I L E
***RECOVERY MODEL 1.0 SCORE +519 : ***
----------------------------------------------------------------------------
C R E D I T S U M M A R Y * * * T O T A L F I L E H I S T O R Y
PR=0 COL=5 NEG=13 HSTNEG=0 TRD=27 RVL=11 INST=16 MTG=0 OPN=0 INQ=9
C R E D I T R E P O R T S E R V I C E D B Y :
TRANSUNION 800-888-4213
2 BALDWIN PLACE, P.O. BOX 1000 CHESTER, PA 19016
CONSUMER DISCLOSURES CAN BE OBTAINED ONLINE THROUGH TRANSUNION AT:
HTTP://WWW.TRANSUNION.COM
END OF TRANSUNION REPORT
RF_TP067283跨工会信用报告
(一) Y CH0001434 06 CH 12/16 05/21/19 10:22CT
TA****,K**##############
1307布拉赫州,伊利诺伊州惠林。66666 12/16
----------------------------------------------------------------------------
S P E C I A L M E S A G E S
****IDVISION警报:清除所有执行的搜索***
----------------------------------------------------------------------------
M O D E L P R O F I L E
***恢复模式1.0:未评分:信用不足***
----------------------------------------------------------------------------
跨工会报告结束
RF_TP067284跨工会信用报告
(一) Y CH0001434 07 RK 4/05 05/21/19 10:22CT
P****,A****K.***-***-*****2/87
伊利诺伊州科特兰市奥本代尔大道西93号。66666 10/06
----------------------------------------------------------------------------
S P E C I A L M E S A G E S
****IDVISION警报:清除所有执行的搜索***
----------------------------------------------------------------------------
M O D E L P R O F I L E
***恢复模式1.0分+519:**
----------------------------------------------------------------------------
C R R I T S U M A R Y***T A L F I L I S O R Y
PR=0 COL=5 NEG=13 HSTNEG=0 TRD=27 RVL=11 INST=16 MTG=0 OPN=0 INQ=9
C R E D I T R O R T S E R V I C E D Y:
TRANSUNION 800-888-4213
宾夕法尼亚州切斯特市鲍德温广场2号邮政信箱1000号,邮编:19016
消费者披露可通过TRANSUNION在线获取,网址为:
HTTP://WWW.TRANSUNION.COM
跨工会报告结束
文件编号始终位于标题行的左上角。我想提取的信息总是介于两者之间
例如,第一个文件下面有一个。但它没有。有时他们的处境也不一样
我已经尝试过的是Streamreader,但如果文件号缺少出生日期或社交日期,由于缺少数据,我的数据列将不均匀
以下是我目前的代码:
Dim textfile = "C:\Users\username\DeskTop\Fucked up sample data.txt"
Sub Main()
Dim foundfileNumbers = FindFileNumbers(textfile)
For Each filenumber In foundfileNumbers
getFileNumberData(filenumber.ToString, textfile)
'Console.WriteLine(filenumber.ToString)
Next
End Sub
Public Function FindFileNumbers(ByVal textfile As String)
Dim filereader As New System.IO.StreamReader(textfile)
Dim pages As List(Of String) = New List(Of String)
Do While filereader.Peek() <> -1
Dim regexPattern = "TP[0-9]{6}"
Dim reg = New Regex(regexPattern)
Dim currenttext As String = Nothing
Dim textline As String = filereader.ReadLine()
currenttext = textline.Substring(0, 77)
currenttext.IndexOf("", StringComparison.InvariantCultureIgnoreCase)
Dim matches = reg.Matches(currenttext)
For Each m In matches
'Console.WriteLine(m.ToString)
pages.Add(m.ToString)
Next m
Loop
Return pages
End Function
Dim textfile=“C:\Users\username\DeskTop\Fucked-up-sample-data.txt”
副标题()
Dim FOUNDFILENUMERS=FindFileNumber(文本文件)
对于FoundFileNumber中的每个文件号
getFileNumberData(filenumber.ToString,textfile)
'Console.WriteLine(filenumber.ToString)
下一个
端接头
公共函数findFileNumber(ByVal文本文件作为字符串)
Dim filereader作为新System.IO.StreamReader(文本文件)
将页面变暗为列表(字符串)=新列表(字符串)
Do While filereader.Peek()-1
Dim regexpatern=“TP[0-9]{6}”
Dim reg=新正则表达式(regexpatern)
Dim currenttext作为字符串=无
Dim textline As String=filereader.ReadLine()
currenttext=textline.Substring(0,77)
currenttext.IndexOf(“,StringComparison.InvariantCultureIgnoreCase)
Dim MATCHS=注册匹配(currenttext)
对于匹配中的每个m
'Console.WriteLine(m.ToString)
页数。添加(m.ToString)
下一个m
环
返回页
端函数
这就是我被卡住的地方:
Public Function getFileNumberData(ByVal filenumber As String, ByVal textfile As String) As String
Dim returnElement As String = Nothing
Dim filereader As New System.IO.StreamReader(textfile)
Do While filereader.Peek() <> -1
Dim textline = filereader.ReadLine()
If textline.Substring(0, 77).Contains(filenumber) Then
Do While filereader.Peek() <> -1
Dim textline2 = filereader.ReadLine()
If textline2.Substring(0, 77).Contains("BIRTH DATE")Then
Dim textline3 = filereader.ReadLine()
returnElement = textline3.Substring(0, 77)
Console.WriteLine(returnElement)
Else
returnElement = "No DOB"
Console.WriteLine(returnElement)
End If
Loop
End If
Loop
Return returnElement
End Function
公共函数getFileNumberData(ByVal filenumber作为字符串,ByVal textfile作为字符串)作为字符串
将返回元素设置为字符串=无
Dim filereader作为新System.IO.StreamReader(文本文件)
Do While filereader.Peek()-1
Dim textline=filereader.ReadLine()
如果textline.Substring(0,77).Contains(filenumber),则
Do While filereader.Peek()-1
Dim textline2=filereader.ReadLine()
如果textline2.Substring(0,77).包含(“出生日期”),则
Dim textline3=filereader.ReadLine()
returnElement=textline3.子字符串(0,77)
控制台写入线(返回元素)
其他的
returnElement=“无DOB”
控制台写入线(返回元素)
EN
Imports System.IO
Imports System.IO.Path
Imports Microsoft.Office.Interop.Excel
Imports Microsoft.Office.Interop
Imports System.Text.RegularExpressions
Imports System.Text
Imports iTextSharp.text.pdf
Imports iTextSharp.text
Module Module1
Dim xlApp As New Excel.Application
Dim xlWorkbook As Excel.Workbook
Dim xlWorksheet1 As Excel.Worksheet
Dim counter As Integer = 0
Dim CBR = Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "CBR")
Dim textfile = "C:\Users\username\DeskTop\CBR large data.txt"
Dim TextFileFolder = Combine(CBR, "Text Files")
Sub Main()
If Not Directory.Exists(CBR) Then
Directory.CreateDirectory(CBR)
End If
If Not Directory.Exists(TextFileFolder) Then
Directory.CreateDirectory(TextFileFolder)
End If
Dim rmt = "C:\Users\xborja\DeskTop\CBR large data.txt"
xlWorkbook = xlApp.Workbooks.Add()
xlWorksheet1 = CType(xlWorkbook.Sheets("Sheet1"), Excel.Worksheet)
xlWorksheet1.Cells(1, 1) = "FILENO"
xlWorksheet1.Cells(1, 2) = "NUMBER"
xlWorksheet1.Cells(1, 3) = "DOB"
xlWorksheet1.Cells(1, 4) = "HIGH RISK"
xlWorksheet1.Cells(1, 5) = "SCORE"
xlWorksheet1.Cells(1, 6) = "PR"
xlWorksheet1.Cells(1, 7) = "MTG"
xlWorksheet1.Range("C:C").NumberFormat = "m/d/yyyy"
Dim reader = File.OpenText(textfile)
Dim builder = New StringBuilder(9000)
'Allocate a reasonable size buffer to avoid mem allocs
Dim line As String = reader.ReadLine
Dim counter = 0
While (Not (line) Is Nothing)
counter += 1
builder.AppendLine(line)
If line.Contains("END OF TRANSUNION REPORT") Then
CheckAndProcess(builder.ToString)
builder.Clear()
End If
line = reader.ReadLine
End While
xlWorkbook.SaveAs(CBR & "/" & System.DateTime.Now.ToString("yyyyMMdd") & " CBR.xlsx")
xlWorkbook.Close(True)
Process.Start("explorer.exe", CBR)
xlApp.Quit()
releaseObject(xlApp)
releaseObject(xlWorkbook)
releaseObject(xlWorksheet1)
End Sub
Public Sub CheckAndProcess(ByVal page As String)
Dim filenumber = Nothing
Dim foundfileNumbers = FindFileNumbers(textfile)
Dim BirthDateidx = Nothing
Dim SSNidx = Nothing
Dim ScoreIdx = Nothing
Dim PRidx = Nothing
Dim MTGidx = Nothing
Dim DOB = Nothing
Dim SSN = Nothing
Dim Risk = Nothing
Dim PR = Nothing
Dim MTG = Nothing
Dim Score = Nothing
Dim pdfDoc As New Document()
For Each filenumber In foundfileNumbers
If page.Contains((filenumber)) Then
Dim pdfWrite As PdfWriter = PdfWriter.GetInstance(pdfDoc, New FileStream("C:\Users\xborja\Desktop\CBR\Text Files\" & filenumber & ".pdf", FileMode.Create))
pdfDoc.Open()
pdfDoc.Add(New Paragraph(page))
counter += 1
xlWorksheet1.Cells(counter + 1, 1) = filenumber
If Not page.Contains("BIRTH DATE") Then
Console.WriteLine("no DOB")
DOB = " "
xlWorksheet1.Cells(counter + 1, 3) = DOB
Else
BirthDateidx = page.IndexOf("BIRTH DATE") + 78 'returning birth date index plus enough characters to go to next line
Console.WriteLine(page.Substring(BirthDateidx, 5).Replace(" ", "0"))
DOB = page.Substring(BirthDateidx, 5).Replace(" ", "0")
xlWorksheet1.Cells(counter + 1, 3) = DOB
End If
If Not page.Contains("SSN") Then
Console.WriteLine("no ssn")
SSN = " "
xlWorksheet1.Cells(counter + 1, 8) = SSN
Else
SSNidx = page.IndexOf("SSN") + 78
Console.WriteLine(page.Substring(SSNidx, 11))
SSN = page.Substring(SSNidx, 11)
xlWorksheet1.Cells(counter + 1, 8) = SSN
End If
If page.Contains("CLEAR FOR ALL SEARCHES PERFORMED") Then
Risk = "N"
xlWorksheet1.Cells(counter + 1, 4) = Risk
Console.WriteLine("N")
Else
Risk = "Y"
xlWorksheet1.Cells(counter + 1, 4) = Risk
Console.WriteLine("Y")
End If
If Not page.Contains("RECOVERY MODEL 1.0 SCORE") Then
Console.WriteLine("no score")
Score = " "
xlWorksheet1.Cells(counter + 1, 5) = Score
Else
ScoreIdx = page.IndexOf("RECOVERY MODEL 1.0 SCORE") + 26
Console.WriteLine(page.Substring(ScoreIdx, 3))
Score = page.Substring(ScoreIdx, 3)
xlWorksheet1.Cells(counter + 1, 5) = Score
End If
If page.Contains("PR=") Then
PRidx = page.IndexOf("PR=") + 3
Console.WriteLine(page.Substring(PRidx, 1))
PR = page.Substring(PRidx, 1)
xlWorksheet1.Cells(counter + 1, 6) = PR
Else
Console.WriteLine("NO PR")
PR = "0"
xlWorksheet1.Cells(counter + 1, 6) = PR
End If
If page.Contains("MTG=") Then
MTGidx = page.IndexOf("MTG=") + 4
Console.WriteLine(page.Substring(MTGidx, 1))
MTG = page.Substring(MTGidx, 1)
xlWorksheet1.Cells(counter + 1, 7) = MTG
Else
Console.WriteLine("NO MTG")
MTG = "0"
xlWorksheet1.Cells(counter + 1, 7) = MTG
End If
End If
File.AppendAllText("C:\Users\xborja\Desktop\CBR\Text Files\" & filenumber & ".txt", page)
Next
pdfDoc.Close()
End Sub
Public Function FindFileNumbers(ByVal textfile As String)
Dim filereader As New System.IO.StreamReader(textfile)
Dim pages As List(Of String) = New List(Of String)
Do While filereader.Peek() <> -1
Dim regexPattern = "TP[0-9]{6}"
Dim reg = New Regex(regexPattern)
Dim currenttext As String = Nothing
Dim textline As String = filereader.ReadLine()
currenttext = textline.Substring(0, 77)
currenttext.IndexOf("", StringComparison.InvariantCultureIgnoreCase)
Dim matches = reg.Matches(currenttext)
For Each m In matches
pages.Add(m.ToString)
Next m
Loop
Return pages
End Function
Private Sub releaseObject(ByVal aCell As Object)
Try
System.Runtime.InteropServices.Marshal.ReleaseComObject(aCell)
aCell = Nothing
Catch ex As Exception
aCell = Nothing
Finally
GC.Collect()
End Try
End Sub
End Module