C# 如何有效地从具有设置结构的PDF中提取有意义的数据?
我一直在开发一个应用程序来帮助管理一个健康管理组织部门。 这是我的第一个商业软件,如果我能解决这个问题,我很高兴能在本周完成这项工作。 问题是。。。 MIS部门每年收到4次PDF。此PDF包含两条信息 a) 根据该组织注册的所有医院的列表 b) 在每家医院登记的注册者名单 我的任务是编写一个程序,检索PDF中的所有医院,将其注册到应用程序的数据库中,然后检索所有注册者并将其注册到各自医院下的数据库中(这是使用数据库中的外键关系管理的) 我使用Regex编写了一个解决方案,可以注册所有的医院,并在解析PDF(4000页长)时节省了一些延迟,它工作得非常好 问题是,我的注册加入者的解决方案没有它应有的效率,大约有2/10的加入者没有注册,因为我的代码效率低下 当我将已经部分工作的解决方案转移到它最终将驻留的客户机服务器时,我得到一个错误,上面写着“找不到源代码”。但当我在调试模式下运行它来检查问题可能是什么时,它会按预期提取加入者的详细信息。所以我对此很困惑 如果我能得到以下方面的帮助:a)“源代码找不到”错误或b)为什么我的代码在我的开发机器上工作,而不是在服务器上,我将非常感激 我会包括我的代码,也会包括一个PDF的快照,但我怀疑有问题的附件 谢谢C# 如何有效地从具有设置结构的PDF中提取有意义的数据?,c#,regex,winforms,pdf,itextsharp,C#,Regex,Winforms,Pdf,Itextsharp,我一直在开发一个应用程序来帮助管理一个健康管理组织部门。 这是我的第一个商业软件,如果我能解决这个问题,我很高兴能在本周完成这项工作。 问题是。。。 MIS部门每年收到4次PDF。此PDF包含两条信息 a) 根据该组织注册的所有医院的列表 b) 在每家医院登记的注册者名单 我的任务是编写一个程序,检索PDF中的所有医院,将其注册到应用程序的数据库中,然后检索所有注册者并将其注册到各自医院下的数据库中(这是使用数据库中的外键关系管理的) 我使用Regex编写了一个解决方案,可以注册所有的医院,并在
private void extractEnrolleesFromPDF(string enrolleeExtraction, string hospital)
{
int start;
int end;
string substring;
try
{
MatchCollection policyNumbers = Regex.Matches(enrolleeExtraction, @"(\*)(\d{8})(\*)");
foreach (var policyNumber in policyNumbers)
{
Match match = Regex.Match(enrolleeExtraction, "\\" + policyNumber.ToString());
if (match.Success)
{
//Strore the first occurence of the enrollee's policy number
start = match.Index;
Match match2 = Regex.Match(enrolleeExtraction.Substring(start + 10), @"(\*)");
if (match2.Success)
{
end = match2.Index + 9;
substring = enrolleeExtraction.Substring(start, end);
enrolleePolicyNumber.Add(substring);
}
}
}
//Extract enrollee data an insert into the database
ArrayList individualEnrolees = new ArrayList();
int numberOfEnrollees = enrolleePolicyNumber.Count;
bool principal = false;
string fName;
string lName;
DateTime dob;
string sex;
string hospitalCode = hospital.Substring(1, 7);
for (int i = 0; i < numberOfEnrollees; i++)
{
string enrolleePolNumber;
Match policyNumber = Regex.Match(enrolleePolicyNumber[i].ToString(), @"((\*)(\d{8})(\*))");
if (policyNumber.Success)
{
enrolleePolNumber = policyNumber.Value;
}
MatchCollection enrolleeRecords = Regex.Matches(enrolleePolicyNumber[i].ToString(), @"(\d{1})(\s)(\D*)(\d{2})/(\d{2})/(\d{4})");
//Empty the array list each time to avoid going over the same recors over and over again
individualEnrolees.Clear();
foreach (var record in enrolleeRecords)
{
individualEnrolees.Add(record);
}
//The way our search works at the moment is that is uses the pattern *-------* at th ebeginning and end to
//mark where an enrolleee's records begin and end. The problem now is that the last record does not have
//that pattern at the end. So we need to find a way to retrieve the last record and add it to the collection we parse
//for the enrollee data.
try
{
Match lastPolicyNumberInHospital = Regex.Match(enrolleeExtraction, @"(\*)(\d{8})(\*)", RegexOptions.RightToLeft);
string lastRecord = enrolleeExtraction.Substring(lastPolicyNumberInHospital.Index);
enrolleePolicyNumber.Add(lastRecord);
}
catch (Exception ex)
{
MessageBox.Show("Failed to extract last record: " + ex.Message);
}
foreach (var record in individualEnrolees)
{
string princ;
string[] splitEnrolleeData = record.ToString().Split(' ');
//int splitSectionCount counts how many section our split enrollee data is
int splitSectionCount = splitEnrolleeData.Count();
//if we have six sections then we expect the Principal or Spouse record to be
//on index 1
if (splitSectionCount == 5)
{
princ = splitEnrolleeData[1].ToString();
if (princ == "Principal")
{
principal = true;
}
else
{
principal = false;
}
}
//if we have five sections then we expect the Principal or Spouse record to be
//on index 0.
//i.e. Merged with the serial number so we check to see if it contains
//the string "Principal" or "Spouse"
else if (splitSectionCount == 4)
{
if (splitEnrolleeData[0].ToString().Contains("0"))
{
principal = true;
}
else if (!splitEnrolleeData[0].ToString().Contains("0"))
{
principal = false;
}
}
//TO-DO: Eliminate this comment block is else-if above works properly
//princ = splitEnrolleeData[1].ToString();
//if (princ == "Principal")
//{
// principal = true;
//}
//else
//{
// principal = false;
//}
enrolleePolNumber = policyNumber.Value.Substring(1, policyNumber.Value.Length - 2);
//if we have 6 sections as expected carry on and register the enrollee as usual
//if not, if we have 5 do something else
//this is because some enrollees in the NHIS PDF arent split properly returning
//5 items instead of 6
if (splitSectionCount == 5)
{
lName = splitEnrolleeData[2].ToString();
fName = splitEnrolleeData[3].ToString();
dob = Convert.ToDateTime(splitEnrolleeData[4].ToString());
hosp = getHospitalID(hospitalCode);
if (principal == true)
{
if (checkIfPolicyNumberExists(enrolleePolNumber) == false)
{
registerEnrollee(enrolleePolNumber, fName, lName, dob, hosp.ToString());
}
}
else if (principal == false)
{
if (checkIfDependantPolicyNumberExists(enrolleePolNumber, fName) == false)
{
registerDependant(enrolleePolNumber, fName, lName, dob, hosp.ToString());
}
}
}
else if (splitSectionCount == 4)
{
lName = splitEnrolleeData[1].ToString();
fName = splitEnrolleeData[2].ToString();
dob = Convert.ToDateTime(splitEnrolleeData[3].ToString());
hosp = getHospitalID(hospitalCode);
if (principal == true)
{
if (checkIfPolicyNumberExists(enrolleePolNumber) == false)
{
registerEnrollee(enrolleePolNumber, fName, lName, dob, hosp.ToString());
}
}
else if (principal == false)
{
if (checkIfDependantPolicyNumberExists(enrolleePolNumber, fName) == false)
{
registerDependant(enrolleePolNumber, fName, lName, dob, hosp.ToString());
//else if (!parentExists(enrolleePolNumber))
//{
//}
}
}
}
}
}
}
catch (Exception ex)
{
MetroFramework.MetroMessageBox.Show(this, "Error retrieving subsitring: " + ex.Message);
}
}
private void extractionEnrolleesFromPDF(字符串enrolleeExtraction,字符串医院)
{
int启动;
内端;
字符串子串;
尝试
{
MatchCollection policyNumbers=Regex.Matches(enrolleeExtraction,@“(\*)(\d{8})(\*)”;
foreach(保单编号中的var保单编号)
{
Match Match=Regex.Match(enrolleeExtraction,“\\”+policyNumber.ToString());
如果(匹配成功)
{
//Strore首次出现的加入者的策略编号
start=match.Index;
Match match2=Regex.Match(enrolleeExtraction.Substring(start+10),@“(\*)”;
如果(匹配2.成功)
{
结束=匹配2。索引+9;
substring=enrolleeExtraction.substring(开始、结束);
enrolleePolicyNumber.Add(子字符串);
}
}
}
//提取加入者数据并插入数据库
ArrayList individualEnrolees=新ArrayList();
int numberOfEnrollements=enrolleePolicyNumber.Count;
bool-principal=false;
字符串fName;
字符串名称;
日期时间dob;
弦性;
字符串hospitalCode=hospital.Substring(1,7);
for(int i=0;i