Java 如何从表中获取html数据？_Java_Html_Regex

Java 如何从表中获取html数据？

java html regex

Java 如何从表中获取html数据？,java,html,regex,Java,Html,Regex,我正在尝试创建一个web scraper程序，该程序从网站获取表格并将其转换为“.csv”文件我使用Jsoup将数据向下拉入文档，并从下面的document.html（）doc.html（）读取数据。目前的阅读器在我的测试站点上读取了18个表，但没有表数据标签你知道会出什么问题吗 ArrayList<Data_Log> container = new ArrayList<Data_Log>(); ArrayList<ListData_Log> contai

我正在尝试创建一个web scraper程序，该程序从网站获取表格并将其转换为“.csv”文件

我使用Jsoup将数据向下拉入文档，并从下面的document.html（）doc.html（）读取数据。目前的阅读器在我的测试站点上读取了18个表，但没有表数据标签

你知道会出什么问题吗

ArrayList<Data_Log> container = new ArrayList<Data_Log>();
ArrayList<ListData_Log> containerList = new ArrayList<ListData_Log>();
ArrayList<String> tableNames = new ArrayList<String>();// Stores native names of tables
ArrayList<Double> meanStorage = new ArrayList<Double>();// Stores data mean per table
ArrayList<String> processlog = new ArrayList<String>();// Keeps a record of all actions taken per iteration
ArrayList<Double> modeStorage = new ArrayList<Double>();
Calendar cal;

private static final long serialVersionUID = -8174362940798098542L;

public void takeData() throws IOException {
    if (testModeActive == true) {
        System.out.println("Initializing Data Cruncher with developer logs");
        System.out.println("Taking data from: " + dataSource);      }
    int irow = 0;
    int icolumn = 0;
    int iTable = 0;
    // int iListno = 0;
    // int iListLevel;

    String u = null;
    boolean recording = false;
    boolean duplicate = false;
    Document doc = Jsoup.connect(dataSource).get();
    Webtitle = doc.title();
    Pattern tb = Pattern.compile("<table");
    Matcher tB = tb.matcher(doc.html());
    Pattern ttl = Pattern.compile("<title>(//s+)</title>");
    Matcher ttl2= ttl.matcher(doc.html());
    Pattern tr = Pattern.compile("<tr");
    Matcher tR = tr.matcher(doc.html());
    Pattern td = Pattern.compile("<td(//s+)</td>");
    Matcher tD = td.matcher(doc.html());
    Pattern tdc = Pattern.compile("<td class=(//s+)>(//s+)</td>");
    Matcher tDC = tdc.matcher(doc.html());
    Pattern tb2 = Pattern.compile("</table>");
    Matcher tB2 = tb2.matcher(doc.html());
    Pattern th = Pattern.compile("<th");
    Matcher tH = th.matcher(doc.html());
    while (tB.find()) {
        iTable++;

        while(ttl2.find()) {
        tableNames.add(ttl2.group(1));
        }
        while (tR.find()) {

            while (tD.find()||tH.find()) {
                u = tD.group(1);
                Data_Log v = new Data_Log();
                v.setTable(iTable);
                v.dataSort(u);
                v.setRow(irow);
                v.setColumn(icolumn);
                container.add(v);
                icolumn++;
            }
            while(tDC.find()) {
                u = tDC.group(2);
                Data_Log v = new Data_Log();
                v.setTable(iTable);
                v.dataSort(u);
                v.setRow(irow);
                v.setColumn(icolumn);
                container.add(v);
                icolumn++;
            }
            irow++;
        }

        if (tB2.find()) {
        irow=0;
        icolumn=0;
        }       
    }

ArrayList容器=新的ArrayList（）；
ArrayList containerList=新的ArrayList（）；
ArrayList tableNames=新建ArrayList（）；//存储表的本机名称
ArrayList meanStorage=新建ArrayList（）；//存储每个表的数据平均值
ArrayList processlog=新建ArrayList（）；//保存每次迭代所采取的所有操作的记录
ArrayList modeStorage=新的ArrayList（）；
日历校准；
私有静态最终长serialVersionUID=-8174362940798098542L；
public void takeData（）引发IOException{
如果（testModeActive==true）{
System.out.println（“使用开发人员日志初始化数据处理器”）；
System.out.println（“从：“+dataSource”）获取数据；}
int-irow=0；
int-icolumn=0；
int iTable=0；
//int-iListno=0；
//内部iListLevel；
字符串u=null；
布尔记录=假；
布尔重复=假；
Document doc=Jsoup.connect（dataSource.get（）；
Webtitle=doc.title（）；
Pattern tb=Pattern.compile（“因为您使用的是jsoup，所以请使用它
var url = "<your url>";
var doc = Jsoup.connect(url).get();
var tables = doc.body().getElementsByTag("table");
tables.forEach(table -> {
    System.out.println(table.id());
    System.out.println(table.className());  
    System.out.println(table.getElementsByTag("td"));
});

var url=”“；
var doc=Jsoup.connect（url.get（）；
var tables=doc.body（）.getElementsByTag（“表格”）；
tables.forEach（表->{
System.out.println（table.id（））；
System.out.println（table.className（））；
System.out.println（table.getElementsByTag（“td”）；
});

对于您使用正则表达式解析html的尝试，这里有一些建议阅读