Java unix中的JSoup从HTML中选择_Java_Unix_Select_Syntax_Jsoup

Java unix中的JSoup从HTML中选择

java unix select syntax

Java unix中的JSoup从HTML中选择,java,unix,select,syntax,jsoup,Java,Unix,Select,Syntax,Jsoup,我有一个程序，从PubMed站点的许多文章中提取某些元素（文章作者姓名）。虽然该程序在我的pc（windows）中正常工作，但当我尝试在unix上运行它时，返回一个空列表。我怀疑这是因为unix系统中的语法应该有所不同。问题是JSoup文档没有提到什么。有人知道这件事吗？我的代码是这样的： Document doc = Jsoup.connect("http://www.ncbi.nlm.nih.gov/pubmed/" + pmidString).timeout(60000).userAgen

我有一个程序，从PubMed站点的许多文章中提取某些元素（文章作者姓名）。虽然该程序在我的pc（windows）中正常工作，但当我尝试在unix上运行它时，返回一个空列表。我怀疑这是因为unix系统中的语法应该有所不同。问题是JSoup文档没有提到什么。有人知道这件事吗？我的代码是这样的：

Document doc = Jsoup.connect("http://www.ncbi.nlm.nih.gov/pubmed/" + pmidString).timeout(60000).userAgent("Mozilla/25.0").get();
            System.out.println("connected");
            Elements authors = doc.select("div.auths >*");
            System.out.println("number of elements is " + authors.size());

最后一个System.out.println总是说大小为0，因此它不能做更多的事情

提前谢谢

完整示例：

protected static void searchLink(HashMap<String, HashSet<String>> authorsMap,  HashMap<String, HashSet<String>> reverseAuthorsMap,
        String fileLine

        ) throws IOException, ParseException, InterruptedException
{

            JSONParser parser = new JSONParser();
            JSONObject jsonObj = (JSONObject) parser.parse(fileLine.substring(0, fileLine.length() - 1 ));
            String pmidString = (String)jsonObj.get("pmid");
            System.out.println(pmidString);

            Document doc = Jsoup.connect("http://www.ncbi.nlm.nih.gov/pubmed/" + pmidString).timeout(60000).userAgent("Mozilla/25.0").get();
            System.out.println("connected");
            Elements authors = doc.select("div.auths >*");
            System.out.println("found the element");

            HashSet<String> authorsList = new HashSet<>();
            System.out.println("authors list hashSet created");
            System.out.println("number of elements is " + authors.size());
            for (int i =0; i < authors.size(); i++)
            {


                // add the current name to the names list
                authorsList.add(authors.get(i).text());

                // pmidList variable
                HashSet<String> pmidList;
                System.out.println("stage 1");
                // if the author name is new, then create the list, add the current pmid and put it in the map
                if(!authorsMap.containsKey(authors.get(i).text()))
                {
                    pmidList = new HashSet<>();
                    pmidList.add(pmidString);
                    System.out.println("made it to searchLink");
                    authorsMap.put(authors.get(i).text(), pmidList);

                }
                // if the author name has been found before, get the list of articles and add the current
                else
                {
                    System.out.println("Author exists in map");
                    pmidList = authorsMap.get(authors.get(i).text());
                    pmidList.add(pmidString);


                    authorsMap.put(authors.get(i).text(), pmidList);
                    //authorsMap.put((String) authorName, null);
                }

                // finally, add the pmid-authorsList to the map
                reverseAuthorsMap.put(pmidString, authorsList);
                System.out.println("reverseauthors populated");

            }

}

受保护的静态无效搜索链接（HashMap authorsMap、HashMap reverseAuthorsMap、，
字符串文件行
)抛出IOException、ParseException、InterruptedException
{
JSONParser=新的JSONParser（）；
JSONObject JSONObject=（JSONObject）parser.parse（fileLine.substring（0，fileLine.length（）-1））；
字符串pmidString=（字符串）jsonObj.get（“pmid”）；
系统输出打印LN（pmidString）；
Document doc=Jsoup.connect（“http://www.ncbi.nlm.nih.gov/pubmed/“+pmidString）.超时（60000）.userAgent（“Mozilla/25.0”）.get（）；
System.out.println（“已连接”）；
元素authors=doc.select（“div.auths>*”；
System.out.println（“找到元素”）；
HashSet authorsList=新HashSet（）；
System.out.println（“创建的作者列表哈希集”）；
System.out.println（“元素数为”+authors.size（））；
对于（int i=0；i


我有一个线程池，每个线程使用此方法填充两个映射。fileline参数是一行，我将其解析为json并保留“pmid”字段。我使用这个字符串访问本文的url，并解析HTML中的作者姓名。其余部分应该可以工作（它在我的电脑中也可以工作），但由于authors.size始终为0，直接低于元素数System.out的for根本无法执行。
我尝试了一种与您所尝试的完全相同的方法：
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;

public class Test {
  public static void main (String[] args) throws IOException {
    String docId = "24312906";
    if (args.length > 0) {
      docId = args[0];
    }

    String url = "http://www.ncbi.nlm.nih.gov/pubmed/" + docId;
    Document doc = Jsoup.connect(url).timeout(60000).userAgent("Mozilla/25.0").get();
    Elements authors = doc.select("div.auths >*");

    System.out.println("os.name=" + System.getProperty("os.name"));
    System.out.println("os.arch=" + System.getProperty("os.arch"));

    // System.out.println("doc=" + doc);
    System.out.println("authors=" + authors);
    System.out.println("authors.length=" + authors.size());

    for (Element a : authors) {
      System.out.println("  author: " + a);
    }
  }
}

我的操作系统是Linux：
# uname -a
Linux graphene 3.11.0-13-generic #20-Ubuntu SMP Wed Oct 23 07:38:26 UTC 2013 x86_64 x86_64 x86_64 
GNU/Linux
# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 13.10
Release:        13.10
Codename:       saucy

运行该程序会产生：
os.name=Linux
os.arch=amd64
authors=<a href="/pubmed?term=Liu%20W%5BAuthor%5D&amp;cauthor=true&amp;cauthor_uid=24312906">Liu W</a>
<a href="/pubmed?term=Chen%20D%5BAuthor%5D&amp;cauthor=true&amp;cauthor_uid=24312906">Chen D</a>
authors.length=2
  author: <a href="/pubmed?term=Liu%20W%5BAuthor%5D&amp;cauthor=true&amp;cauthor_uid=24312906">Liu W</a>
  author: <a href="/pubmed?term=Chen%20D%5BAuthor%5D&amp;cauthor=true&amp;cauthor_uid=24312906">Chen D</a>

由于您的代码没有出现异常，我怀疑您得到的是一个文档，而不是您的代码预期的文档。打印出文档以便查看您得到的信息可能也会有所帮助。您能提供一个完整的示例吗？完整不是指包含您想要做的所有处理，而是提供一个完整的示例。我理解您的意思，但这实际上是大量代码的一部分，我不能粘贴在这里。我确信问题在于doc.select中的语法，我能提供的任何东西都不能帮助您解决这个问题，因为除非您在unix上运行它，否则它会工作的。感谢您的关注第一个代码片段是一个很好的开始，但是它需要pmidString的值（应该在类中）。我按照您的建议做了，显然我没有得到正确的HTML。返回的HTML解释说，该网站正在阻止我，天知道为什么。至少我知道问题到底出在哪里了。非常感谢你的帮助！
System.out.println("url='" + "http://www.ncbi.nlm.nih.gov/pubmed/" + pmidString+ "'");