如何将HTML包装的字符串从SQL表中的列提取到新表中?

如何将HTML包装的字符串从SQL表中的列提取到新表中?,sql,sql-server-2005,stored-procedures,complexity-theory,substring,Sql,Sql Server 2005,Stored Procedures,Complexity Theory,Substring,我在SQL Server 2005表中有一个名为BIO的列—BIO列中的数据格式如下: <HTML><HEAD><TITLE></TITLE></HEAD><BODY><STRONG><A name=SN>AARTS</A>, <A name=GN>Michelle Marie</A>, </STRONG><A name=HO>B.Sc.&l

我在SQL Server 2005表中有一个名为BIO的列—BIO列中的数据格式如下:

<HTML><HEAD><TITLE></TITLE></HEAD><BODY><STRONG><A name=SN>AARTS</A>, <A name=GN>Michelle Marie</A>, </STRONG><A name=HO>B.Sc.</A>, <A name=HO>M.Sc.</A>, <A name=HO>Ph.D.</A>; <A name=OC>scientist, professor</A>; b. <A name=BC>St. Marys</A>, Ont. <A name=BY>1970</A>; <A name=PA>d. Wm. and H. Aarts</A>; <A name=ED>e. Univ. of Western Ont. B.Sc.(Hons.) 1994, M.Sc. 1997</A>; <A name=ED>McGill Univ. Ph.D. 2002</A>; <A name=MA>m. L. MacManus</A>; two children; <A name=PO>CANADA RESEARCH CHAIR IN SIGNAL TRANSDUCTION IN ISCHEMIA</A> and <A name=PO>ASST. PROF., DEPT. OF BIOL. SCI., UNIV. OF TORONTO SCARBOROUGH 2006&ndash;&nbsp;&nbsp;</A>; Postdoctoral Fellow, Toronto Western Hosp. 2000&ndash;06; Expert Cons., Auris Med. SAS, Montpellier, France; mem., Centre for the Neurobiol. of Stress; named INMHA Brainstar of the Year 2003; Bd. of Dirs. &amp; Fundraising Chair, N'Sheemaehn Childcare; mem., Soc. for Neurosci.; Cdn. Physiol. Soc.; Cdn. Assn. for Neurosci.; <A name=WK>co-author: 'Therapeutic Tools in Brain Damage' in <EM>Proteomics and Protein Interactions: Biology, Chemistry, Bioinformatics and Drug Design </EM>2005; 18 pub. journal articles</A>; Office: <A name=OF1_L1>1265 Military Trail</A>, <A name=OF1_CT>Scarborough</A>, <A name=OF1_PR>Ont.</A> <A name=OF1_PC>M1C 1A4</A>. </BODY></HTML>
CONTACT_ID  SN  GN  HO  OC  PO  DB  PA  BY  ED
3   AARON   Raymond Leonard B.Sc.   business coach, professional speaker, real estate entrepreneur  D>AARON
5   AATAMI  Pita    C.Q.    business executive; Kuujjuaq
7   ABBOTT  Anthony C.  P.C.    lawyer  Montreal
8   ABBOTT  Elizabeth   M.A.    historian   Ottawa
9   ABBOTT  (Caroline) Louise   D>ABBOTT    writer, photographer, filmmaker Montreal
我可以继续并手动为每个不同名称的锚添加所有子字符串,但问题是我不知道锚中使用的所有“名称”,该表中有22000多条记录,我必须仔细查看,以确保捕获所有记录。此外,并非所有BIOs都有所有的锚定,因此如果您查看“ABBOTT Caroline Louise”的结果,她没有“HO”锚定,因此它返回的数据“D>ABBOTT”不正确,我还没有看到这一点,虽然我提出的结果有限,但有些记录有多个锚定,例如2个“HO”,我认为这会导致问题

最后一个问题是,并非所有锚定名称都是2个字母,因此我在charindex中使用的11个字母对于这些锚定名称来说是错误的

有更好的方法吗?任何帮助都将不胜感激

更新-我添加了CASE语句,以便在当前记录的锚名称不存在时删除不正确的数据

SELECT  CONTACT_ID
    ,'SN' = 
        CASE
            WHEN CHARINDEX('<A name=SN>', [BIO]) = 0 THEN NULL
            ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=SN>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=SN>', [BIO])) - CHARINDEX('<A name=SN>', [BIO])-11)))
        END     
    ,'GN' = 
        CASE
            WHEN CHARINDEX('<A name=GN>', [BIO]) = 0 THEN NULL
            ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=GN>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=GN>', [BIO])) - CHARINDEX('<A name=GN>', [BIO])-11)))
        END
    ,'HO' = 
        CASE
            WHEN CHARINDEX('<A name=HO>', [BIO]) = 0 THEN NULL
            ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=HO>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=HO>', [BIO])) - CHARINDEX('<A name=HO>', [BIO])-11)))
        END
    ,'OC' = 
        CASE
            WHEN CHARINDEX('<A name=OC>', [BIO]) = 0 THEN NULL
            ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=OC>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=OC>', [BIO])) - CHARINDEX('<A name=OC>', [BIO])-11)))
        END
    ,'PO' = 
        CASE
            WHEN CHARINDEX('<A name=PO>', [BIO]) = 0 THEN NULL
            ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=PO>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=PO>', [BIO])) - CHARINDEX('<A name=PO>', [BIO])-11)))
        END
    ,'BD' = 
        CASE
            WHEN CHARINDEX('<A name=BD>', [BIO]) = 0 THEN NULL
            ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=BD>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=BD>', [BIO])) - CHARINDEX('<A name=BD>', [BIO])-11)))
        END
    ,'PA' = 
        CASE
            WHEN CHARINDEX('<A name=PA>', [BIO]) = 0 THEN NULL
            ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=PA>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=PA>', [BIO])) - CHARINDEX('<A name=PA>', [BIO])-11)))
        END
    ,'BY' = 
        CASE
            WHEN CHARINDEX('<A name=BY>', [BIO]) = 0 THEN NULL
            ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=BY>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=BY>', [BIO])) - CHARINDEX('<A name=BY>', [BIO])-11)))
        END
    ,'ED' = 
        CASE
            WHEN CHARINDEX('<A name=ED>', [BIO]) = 0 THEN NULL
            ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=ED>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=ED>', [BIO])) - CHARINDEX('<A name=ED>', [BIO])-11)))
        END
--INTO [cww].[dbo].[BioDetails]
FROM [cww].[dbo].[Contacts]
ORDER BY CONTACT_ID

我将创建一个包含联系人id、锚类型和值的表,解析数据并向其中添加记录,然后使用交叉表将其吐出。这将允许您轻松找到所有锚类型,允许多个/无锚类型,等等


您可能需要对数据进行多次传递,或者考虑使用不是主要基于SQL设置的工具,如C.< /P> < P>我不知道如何在纯T-SQL中做到这一点。p> 如果可以在应用程序中检索CONTACT_ID和BIO列,那么可以迭代结果集,将BIO数据解析为XML,然后使用XPath获取name属性值和锚主体,构建要插入到新表中的数据映射。由于您不知道可能存在的所有不同名称,因此每次运行表时可能需要重新创建该表,因此存储在集合中找到的名称,并在遍历所有行后使用集合生成CREATETABLE语句

DB代码是纯粹的幻想,但这里有一个片段,展示了如何使用XOM XML库实现它。我不确定这是否可行,因为您的属性值没有被引用,但您可能会找到一个不太挑剔的解析器,我相信您可以在.NET中执行类似的操作

ResultSet results = db.query("select CONTACT_ID, BIO from [cww].[dbo].[Contacts]");

Set<String> newTableColumns = new Set<String>();
newTableColumns.put("CONTACT_ID");

List<Map<String,String> > dataToInsert = new ArrayList<Map<String,String> >();
Builder parser = new Builder();

for (ResultRow resultRow : results) { // iterate over the result set

    Map<String,String> rowDataToInsert = new HashMap<String,String>();
    rowData.put("CONTACT_ID", resultRow.get("CONTACT_ID"));

    // parse the BIO data as an XML document
    Document doc = parser.build(resultRow.get("BIO"), "");

    // query the document using XPath
    Nodes namedAnchors = doc.query("//a[@name]");

    for (int nItr = 0; nItr < namedAnchors.size(); nItr++) {

        Element anchor = (Element) namedAnchors.get(nItr);
        String name = anchor.getAttributeValue("name");
        String anchorBody = anchor.getValue();

        newTableColumns.put(name);
        rowDataToInsert.put(name, anchorBody);

    }

    // we've stored all the anchor data from this row, so put it away
    dataToInsert.add(rowDataToInsert);
}

// create your table
db.createTable("NEW_TABLE_NAME", newTableColumns);

// insert into your new table
db.batchInsert("NEW_TABLE_NAME", dataToInsert);

哎哟我认为这已经成为我所见过的最好的建议,因为它不在数据库中存储HTML,而是希望将数据取回。如前所述,此表还存在大量其他问题-我建议创建一组表来永久保存要提取的数据,根据需要创建手动查询,是的,您必须在执行时进行验证。。。。当你全部完成后,删除那个东西…是的,与别人的数据一起工作的缺点。这项工作的全部目的是将BIO列转换为关系数据,并使用可重用脚本为每种类型的锚点提供查找表,因为我们每年都会得到一份新的数据副本,这些数据需要进行关系化处理,以便更好地在网站上搜索……难道sql server的家伙们没有字段类型的XML吗?用它来存储XML有趣的是,我没有意识到这一点,因为我不是一个“sql server”的家伙。。。我会调查一下,看看这对我是否有帮助。非常感谢。另外,这是我获取数据的方式,这不是我的原因,感谢您的输入-尤其是C可能比我高出很多,您是否有可能提供一个代码示例,用于将锚名称解析为锚列?以及你所说的“交叉表”。我不是一个真正的SQL专家,我更喜欢html、css、js、php、.net、ruby等。。但是有时候工作中会有一些SQL…非常感谢。我想我要试试这样的东西。我会把结果告诉你。
SELECT  CONTACT_ID
    ,'SN' = 
        CASE
            WHEN CHARINDEX('<A name=SN>', [BIO]) = 0 THEN NULL
            ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=SN>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=SN>', [BIO])) - CHARINDEX('<A name=SN>', [BIO])-11)))
        END     
    ,'GN' = 
        CASE
            WHEN CHARINDEX('<A name=GN>', [BIO]) = 0 THEN NULL
            ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=GN>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=GN>', [BIO])) - CHARINDEX('<A name=GN>', [BIO])-11)))
        END
    ,'HO' = 
        CASE
            WHEN CHARINDEX('<A name=HO>', [BIO]) = 0 THEN NULL
            ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=HO>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=HO>', [BIO])) - CHARINDEX('<A name=HO>', [BIO])-11)))
        END
    ,'OC' = 
        CASE
            WHEN CHARINDEX('<A name=OC>', [BIO]) = 0 THEN NULL
            ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=OC>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=OC>', [BIO])) - CHARINDEX('<A name=OC>', [BIO])-11)))
        END
    ,'PO' = 
        CASE
            WHEN CHARINDEX('<A name=PO>', [BIO]) = 0 THEN NULL
            ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=PO>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=PO>', [BIO])) - CHARINDEX('<A name=PO>', [BIO])-11)))
        END
    ,'BD' = 
        CASE
            WHEN CHARINDEX('<A name=BD>', [BIO]) = 0 THEN NULL
            ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=BD>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=BD>', [BIO])) - CHARINDEX('<A name=BD>', [BIO])-11)))
        END
    ,'PA' = 
        CASE
            WHEN CHARINDEX('<A name=PA>', [BIO]) = 0 THEN NULL
            ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=PA>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=PA>', [BIO])) - CHARINDEX('<A name=PA>', [BIO])-11)))
        END
    ,'BY' = 
        CASE
            WHEN CHARINDEX('<A name=BY>', [BIO]) = 0 THEN NULL
            ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=BY>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=BY>', [BIO])) - CHARINDEX('<A name=BY>', [BIO])-11)))
        END
    ,'ED' = 
        CASE
            WHEN CHARINDEX('<A name=ED>', [BIO]) = 0 THEN NULL
            ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=ED>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=ED>', [BIO])) - CHARINDEX('<A name=ED>', [BIO])-11)))
        END
--INTO [cww].[dbo].[BioDetails]
FROM [cww].[dbo].[Contacts]
ORDER BY CONTACT_ID
ResultSet results = db.query("select CONTACT_ID, BIO from [cww].[dbo].[Contacts]");

Set<String> newTableColumns = new Set<String>();
newTableColumns.put("CONTACT_ID");

List<Map<String,String> > dataToInsert = new ArrayList<Map<String,String> >();
Builder parser = new Builder();

for (ResultRow resultRow : results) { // iterate over the result set

    Map<String,String> rowDataToInsert = new HashMap<String,String>();
    rowData.put("CONTACT_ID", resultRow.get("CONTACT_ID"));

    // parse the BIO data as an XML document
    Document doc = parser.build(resultRow.get("BIO"), "");

    // query the document using XPath
    Nodes namedAnchors = doc.query("//a[@name]");

    for (int nItr = 0; nItr < namedAnchors.size(); nItr++) {

        Element anchor = (Element) namedAnchors.get(nItr);
        String name = anchor.getAttributeValue("name");
        String anchorBody = anchor.getValue();

        newTableColumns.put(name);
        rowDataToInsert.put(name, anchorBody);

    }

    // we've stored all the anchor data from this row, so put it away
    dataToInsert.add(rowDataToInsert);
}

// create your table
db.createTable("NEW_TABLE_NAME", newTableColumns);

// insert into your new table
db.batchInsert("NEW_TABLE_NAME", dataToInsert);