Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/sqlite/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Search 如何在不使用API的情况下以编程方式执行搜索?_Search_Screen Scraping - Fatal编程技术网

Search 如何在不使用API的情况下以编程方式执行搜索?

Search 如何在不使用API的情况下以编程方式执行搜索?,search,screen-scraping,Search,Screen Scraping,我想创建一个程序,将输入一个字符串到文本框在一个网站,如谷歌(不使用他们的公共API),然后提交表单和抓取结果。这可能吗?我认为,获取结果需要使用HTML抓取,但是如何将数据输入文本字段并提交表单呢?我会被迫使用公共API吗?这样的事情不可行吗?我需要找出查询字符串/参数吗 大多数时候,您只需发送一个简单的HTTP POST请求即可 我建议你试着玩弄一下,了解一下网络是如何工作的 几乎所有的编程语言和框架都有发送原始请求的方法 并且您可以始终针对Internet Explorer ActiveX

我想创建一个程序,将输入一个字符串到文本框在一个网站,如谷歌(不使用他们的公共API),然后提交表单和抓取结果。这可能吗?我认为,获取结果需要使用HTML抓取,但是如何将数据输入文本字段并提交表单呢?我会被迫使用公共API吗?这样的事情不可行吗?我需要找出查询字符串/参数吗


大多数时候,您只需发送一个简单的HTTP POST请求即可

我建议你试着玩弄一下,了解一下网络是如何工作的

几乎所有的编程语言和框架都有发送原始请求的方法


并且您可以始终针对Internet Explorer ActiveX控件进行编程。我相信很多编程语言都支持它。

好吧,下面是谷歌页面上的html:

<form action="/search" name=f><table cellpadding=0 cellspacing=0><tr valign=top>
<td width=25%>&nbsp;</td><td align=center nowrap>
<input name=hl type=hidden value=en>
<input type=hidden name=ie value="ISO-8859-1">
<input autocomplete="off" maxlength=2048 name=q size=55 title="Google Search" value="">
<br>
<input name=btnG type=submit value="Google Search">
<input name=btnI type=submit value="I'm Feeling Lucky">
</td><td nowrap width=25% align=left>
<font size=-2>&nbsp;&nbsp;<a href=/advanced_search?hl=en>
Advanced Search</a><br>&nbsp;&nbsp;
<a href=/preferences?hl=en>Preferences</a><br>&nbsp;&nbsp;
<a href=/language_tools?hl=en>Language Tools</a></font></td></tr></table>
</form>

我相信这会违反使用条款(咨询律师:程序员不擅长提供法律建议!),但从技术上讲,您可以通过访问URL搜索foobar,并如您所说,删除生成的HTML来搜索foobar。您可能还需要伪造
用户代理
HTTP头,或者其他一些头


也许有些搜索引擎的使用条款并不禁止这样做;您和您的律师最好四处看看,看看情况是否属实。

如果您下载Cygwin,并将Cygwin\bin添加到路径中,您可以使用curl检索页面和grep/sed/任何解析结果的方法。既然谷歌可以使用querystring参数,为什么还要填写表单呢?使用curl,您也可以发布信息、设置标题信息等。我使用它从命令行调用web服务。

我要做的是创建一个小程序,它可以自动将任何表单数据提交到任何地方,并返回结果。这在Java中很容易实现。任务如下:

  • 连接到web服务器
  • 解析页面
  • 获取页面上的第一个表单
  • 填写表格数据
  • 提交表格
  • 读取(并解析)结果
您选择的解决方案将取决于多种因素,包括:

  • 是否需要模拟JavaScript
  • 之后您需要如何处理这些数据
  • 你精通哪些语言
  • 应用程序速度(这是一次查询还是100000?)
  • 应用程序需要多久才能工作
  • 它是一次性的,还是必须维护
例如,您可以尝试以下应用程序为您提交数据:

然后(awk或sed)生成的网页

屏幕抓取的另一个技巧是下载一个示例HTML文件并在vi(或VIM)中手动解析它。将击键保存到文件中,然后在运行查询时,将这些击键应用于生成的网页以提取数据。此解决方案不可维护,也不100%可靠(但从网站上抓取屏幕很少)。它工作起来很快

示例

下面是一个提交网站表单(特别是登录网站)的半通用Java类,希望它可能有用。不要用它作恶

import java.io.FileInputStream;

import java.util.Enumeration;
import java.util.Hashtable;  
import java.util.Properties; 

import com.meterware.httpunit.GetMethodWebRequest;
import com.meterware.httpunit.SubmitButton;       
import com.meterware.httpunit.WebClient;          
import com.meterware.httpunit.WebConversation;    
import com.meterware.httpunit.WebForm;            
import com.meterware.httpunit.WebLink;            
import com.meterware.httpunit.WebRequest;         
import com.meterware.httpunit.WebResponse;        

public class FormElements extends Properties
{                                           
  private static final String FORM_URL = "form.url";
  private static final String FORM_ACTION = "form.action";

  /** These are properly provided property parameters. */
  private static final String FORM_PARAM = "form.param.";

  /** These are property parameters that are required; must have values. */
  private static final String FORM_REQUIRED = "form.required.";            

  private Hashtable fields = new Hashtable( 10 );

  private WebConversation webConversation;

  public FormElements()
  {                    
  }                    

  /**
   * Retrieves the HTML page, populates the form data, then sends the
   * information to the server.                                      
   */                                                                
  public void run()                                                  
    throws Exception                                                 
  {                                                                  
    WebResponse response = receive();                                
    WebForm form = getWebForm( response );                           

    populate( form );

    form.submit();
  }               

  protected WebResponse receive()
    throws Exception             
  {                              
    WebConversation webConversation = getWebConversation();
    GetMethodWebRequest request = getGetMethodWebRequest();

    // Fake the User-Agent so the site thinks that encryption is supported.
    //                                                                     
    request.setHeaderField( "User-Agent",                                  
      "Mozilla/5.0 (X11; U; Linux i686; en-US; rv\\:1.7.3) Gecko/20040913" );

    return webConversation.getResponse( request );
  }                                               

  protected void populate( WebForm form )
    throws Exception                     
  {                                      
    // First set all the .param variables.
    //                                    
    setParamVariables( form );            

    // Next, set the required variables.
    //                                  
    setRequiredVariables( form );       
  }                                     

  protected void setParamVariables( WebForm form )
    throws Exception                              
  {                                               
    for( Enumeration e = propertyNames(); e.hasMoreElements(); )
    {                                                           
      String property = (String)(e.nextElement());              

      if( property.startsWith( FORM_PARAM ) )
      {                                      
        String fieldName = getProperty( property );
        String propertyName = property.substring( FORM_PARAM.length() );
        String fieldValue = getField( propertyName );                   

        // Skip blank fields (most likely, this is a blank last name, which
        // means the form wants a full name).                              
        //                                                                 
        if( "".equals( fieldName ) )                                       
          continue;                                                        

        // If this is the first name, and the last name parameter is blank,
        // then append the last name field to the first name field.        
        //                                                                 
        if( "first_name".equals( propertyName ) &&                         
            "".equals( getProperty( FORM_PARAM + "last_name" ) ) )         
          fieldValue += " " + getField( "last_name" );                     

        showSet( fieldName, fieldValue );
        form.setParameter( fieldName, fieldValue );
      }                                            
    }                                              
  }                                                

  protected void setRequiredVariables( WebForm form )
    throws Exception                                 
  {                                                  
    for( Enumeration e = propertyNames(); e.hasMoreElements(); )
    {                                                           
      String property = (String)(e.nextElement());              

      if( property.startsWith( FORM_REQUIRED ) )
      {                                         
        String fieldValue = getProperty( property );
        String fieldName = property.substring( FORM_REQUIRED.length() );

        // If the field starts with a ~, then copy the field.
        //                                                   
        if( fieldValue.startsWith( "~" ) )                   
        {                                                    
          String copyProp = fieldValue.substring( 1, fieldValue.length() );
          copyProp = getProperty( copyProp );                              

          // Since the parameters have been copied into the form, we can   
          // eke out the duplicate values.                                 
          //                                                               
          fieldValue = form.getParameterValue( copyProp );                 
        }                                                                  

        showSet( fieldName, fieldValue );
        form.setParameter( fieldName, fieldValue );
      }                                            
    }                                              
  }                                                

  private void showSet( String fieldName, String fieldValue )
  {                                                          
    System.out.print( "<p class='setting'>" );               
    System.out.print( fieldName );                           
    System.out.print( " = " );                               
    System.out.print( fieldValue );                          
    System.out.println( "</p>" );                            
  }                                                          

  private WebForm getWebForm( WebResponse response )
    throws Exception                                
  {                                                 
    WebForm[] forms = response.getForms();          
    String action = getProperty( FORM_ACTION );     

    // Not supposed to break out of a for-loop, but it makes the code easy ...
    //                                                                        
    for( int i = forms.length - 1; i >= 0; i-- )                              
      if( forms[ i ].getAction().equalsIgnoreCase( action ) )                 
        return forms[ i ];                                                    

    // Sadly, no form was found.
    //                          
    throw new Exception();      
  }                             

  private GetMethodWebRequest getGetMethodWebRequest()
  {
    return new GetMethodWebRequest( getProperty( FORM_URL ) );
  }

  private WebConversation getWebConversation()
  {
    if( this.webConversation == null )
      this.webConversation = new WebConversation();

    return this.webConversation;
  }

  public void setField( String field, String value )
  {
    Hashtable fields = getFields();
    fields.put( field, value );
  }

  private String getField( String field )
  {
    Hashtable<String, String> fields = getFields();
    String result = fields.get( field );

    return result == null ? "" : result;
  }

  private Hashtable getFields()
  {
    return this.fields;
  }

  public static void main( String args[] )
    throws Exception
  {
    FormElements formElements = new FormElements();

    formElements.setField( "first_name", args[1] );
    formElements.setField( "last_name", args[2] );
    formElements.setField( "email", args[3] );
    formElements.setField( "comments",  args[4] );

    FileInputStream fis = new FileInputStream( args[0] );
    formElements.load( fis );
    fis.close();

    formElements.run();
  }
}
类似于以下方式运行它(将HTTPUnit的路径和FormElements类替换为$CLASSPATH):

合法性

另一个答案提到它可能违反使用条款。在你花时间研究技术解决方案之前,先检查一下。非常好的建议

import java.io.FileInputStream;

import java.util.Enumeration;
import java.util.Hashtable;  
import java.util.Properties; 

import com.meterware.httpunit.GetMethodWebRequest;
import com.meterware.httpunit.SubmitButton;       
import com.meterware.httpunit.WebClient;          
import com.meterware.httpunit.WebConversation;    
import com.meterware.httpunit.WebForm;            
import com.meterware.httpunit.WebLink;            
import com.meterware.httpunit.WebRequest;         
import com.meterware.httpunit.WebResponse;        

public class FormElements extends Properties
{                                           
  private static final String FORM_URL = "form.url";
  private static final String FORM_ACTION = "form.action";

  /** These are properly provided property parameters. */
  private static final String FORM_PARAM = "form.param.";

  /** These are property parameters that are required; must have values. */
  private static final String FORM_REQUIRED = "form.required.";            

  private Hashtable fields = new Hashtable( 10 );

  private WebConversation webConversation;

  public FormElements()
  {                    
  }                    

  /**
   * Retrieves the HTML page, populates the form data, then sends the
   * information to the server.                                      
   */                                                                
  public void run()                                                  
    throws Exception                                                 
  {                                                                  
    WebResponse response = receive();                                
    WebForm form = getWebForm( response );                           

    populate( form );

    form.submit();
  }               

  protected WebResponse receive()
    throws Exception             
  {                              
    WebConversation webConversation = getWebConversation();
    GetMethodWebRequest request = getGetMethodWebRequest();

    // Fake the User-Agent so the site thinks that encryption is supported.
    //                                                                     
    request.setHeaderField( "User-Agent",                                  
      "Mozilla/5.0 (X11; U; Linux i686; en-US; rv\\:1.7.3) Gecko/20040913" );

    return webConversation.getResponse( request );
  }                                               

  protected void populate( WebForm form )
    throws Exception                     
  {                                      
    // First set all the .param variables.
    //                                    
    setParamVariables( form );            

    // Next, set the required variables.
    //                                  
    setRequiredVariables( form );       
  }                                     

  protected void setParamVariables( WebForm form )
    throws Exception                              
  {                                               
    for( Enumeration e = propertyNames(); e.hasMoreElements(); )
    {                                                           
      String property = (String)(e.nextElement());              

      if( property.startsWith( FORM_PARAM ) )
      {                                      
        String fieldName = getProperty( property );
        String propertyName = property.substring( FORM_PARAM.length() );
        String fieldValue = getField( propertyName );                   

        // Skip blank fields (most likely, this is a blank last name, which
        // means the form wants a full name).                              
        //                                                                 
        if( "".equals( fieldName ) )                                       
          continue;                                                        

        // If this is the first name, and the last name parameter is blank,
        // then append the last name field to the first name field.        
        //                                                                 
        if( "first_name".equals( propertyName ) &&                         
            "".equals( getProperty( FORM_PARAM + "last_name" ) ) )         
          fieldValue += " " + getField( "last_name" );                     

        showSet( fieldName, fieldValue );
        form.setParameter( fieldName, fieldValue );
      }                                            
    }                                              
  }                                                

  protected void setRequiredVariables( WebForm form )
    throws Exception                                 
  {                                                  
    for( Enumeration e = propertyNames(); e.hasMoreElements(); )
    {                                                           
      String property = (String)(e.nextElement());              

      if( property.startsWith( FORM_REQUIRED ) )
      {                                         
        String fieldValue = getProperty( property );
        String fieldName = property.substring( FORM_REQUIRED.length() );

        // If the field starts with a ~, then copy the field.
        //                                                   
        if( fieldValue.startsWith( "~" ) )                   
        {                                                    
          String copyProp = fieldValue.substring( 1, fieldValue.length() );
          copyProp = getProperty( copyProp );                              

          // Since the parameters have been copied into the form, we can   
          // eke out the duplicate values.                                 
          //                                                               
          fieldValue = form.getParameterValue( copyProp );                 
        }                                                                  

        showSet( fieldName, fieldValue );
        form.setParameter( fieldName, fieldValue );
      }                                            
    }                                              
  }                                                

  private void showSet( String fieldName, String fieldValue )
  {                                                          
    System.out.print( "<p class='setting'>" );               
    System.out.print( fieldName );                           
    System.out.print( " = " );                               
    System.out.print( fieldValue );                          
    System.out.println( "</p>" );                            
  }                                                          

  private WebForm getWebForm( WebResponse response )
    throws Exception                                
  {                                                 
    WebForm[] forms = response.getForms();          
    String action = getProperty( FORM_ACTION );     

    // Not supposed to break out of a for-loop, but it makes the code easy ...
    //                                                                        
    for( int i = forms.length - 1; i >= 0; i-- )                              
      if( forms[ i ].getAction().equalsIgnoreCase( action ) )                 
        return forms[ i ];                                                    

    // Sadly, no form was found.
    //                          
    throw new Exception();      
  }                             

  private GetMethodWebRequest getGetMethodWebRequest()
  {
    return new GetMethodWebRequest( getProperty( FORM_URL ) );
  }

  private WebConversation getWebConversation()
  {
    if( this.webConversation == null )
      this.webConversation = new WebConversation();

    return this.webConversation;
  }

  public void setField( String field, String value )
  {
    Hashtable fields = getFields();
    fields.put( field, value );
  }

  private String getField( String field )
  {
    Hashtable<String, String> fields = getFields();
    String result = fields.get( field );

    return result == null ? "" : result;
  }

  private Hashtable getFields()
  {
    return this.fields;
  }

  public static void main( String args[] )
    throws Exception
  {
    FormElements formElements = new FormElements();

    formElements.setField( "first_name", args[1] );
    formElements.setField( "last_name", args[2] );
    formElements.setField( "email", args[3] );
    formElements.setField( "comments",  args[4] );

    FileInputStream fis = new FileInputStream( args[0] );
    formElements.load( fis );
    fis.close();

    formElements.run();
  }
}
$ cat com.mellon.properties

form.url=https://www.mellon.com/contact/index.cfm
form.action=index.cfm
form.param.first_name=name
form.param.last_name=
form.param.email=emailhome
form.param.comments=comments

# Submit Button
#form.submit=submit

# Required Fields
#
form.required.to=zzwebmaster
form.required.phone=555-555-1212
form.required.besttime=5 to 7pm
java -cp $CLASSPATH FormElements com.mellon.properties "John" "Doe" "John.Doe@gmail.com" "To whom it may concern  ..."