Python 创建一个函数来提取电影数量,以及每部电影';从URL中删除属性

Python 创建一个函数来提取电影数量,以及每部电影';从URL中删除属性,python,web-scraping,html-parsing,imdb,Python,Web Scraping,Html Parsing,Imdb,我正在创建一个函数read\u from\u url(url,num\u of\u m=50)以从url中提取电影的数量。它还将返回一个字典列表,每个字典代表一部电影。有人能告诉我第67行(标记为注释)中我做错了什么吗 第67行,在read_m_from_url tables=movie_table[0]#为表创建列表第203行,在main()行198,在main test_read_from_url()行141,在test_read_from_url print read_from_url(u

我正在创建一个函数
read\u from\u url(url,num\u of\u m=50)
以从url中提取电影的数量。它还将返回一个字典列表,每个字典代表一部电影。有人能告诉我第67行(标记为注释)中我做错了什么吗


第67行,在read_m_from_url tables=movie_table[0]#为表创建列表第203行,在main()行198,在main test_read_from_url()行141,在test_read_from_url print read_from_url(url,21)行67,在read_m_from_url tables=movie__table[0]#为表958创建列表,在getitem return self.attrs[key]KeyError:0
soup.find
应更改为
soup\u findall
,如果
tables
是一个没有任何属性的列表
find\u all
@Arman,那么对于该部分,应该是:movie\u table=soup.find\u all('table',attrs={class':“results”})tables=movie\u table[0]trs=tables.find\u all('tr'))我没有在这一部分中得到任何错误,所以我猜我已经修复了它….???运行编辑的代码并在这里报告错误,然后编辑您的问题
def read_m_from_url(url, num_of_m=50):
        #this function, read a number of movies from a url. That's say you set num_of_m=25, you want to read 25 movies from the page. The default value is 50
    #url = 'http://www.imdb.com/search/title?at=0&sort=user_rating&start=1&title_type=feature&year=2005,2016' #MAY NEED TO TAKE THIS OUT SINCE READ IS IN MAIN
    html_string = util.read_html(url) # given a url you need to read the hmtl file as a string. I have implemented this read_html function in util_imdb.py. Please take a look
        # create a soup object
    soup = BeautifulSoup(html_string, "html.parser")
        # Fetching a table that includes all the movies. In our lecture, we talked about find and find_all functions.
        #  for example, find_all('table') will give you all tables on the page. Actually, this find or find_all function can have two parameters,
        # in the code below 'table' is the tag name and 'results' is an attribute value of the tag. You can also do # movie_table = soup.find('table', {'class':'result'}).
        # Here you explicitly say: I want to find a table with attribute class = 'result'.
        # Since on each imdb page, there's only one table with class = 'results', we can use find rather than find_all. Find_all will return a list of table tags, while # find() will return only one table
    movie_table = soup.find('table', attrs = {'class': "results"}) # equivalent to  movie_table = soup.find('table', {'class':'result'})
    tables = movie_table[0] #line 67. create list for tables
    tb = tables.find_all('movie_table')[0]
    trs = tb.find_all('tr')
    list_movies = [] # initialize the return value, a list of movies
        # Using count track the number of movies processed. now it's 0 - No movie has been processed yet.
    count = 0 #increase count by 1 for every movie processed
    '''
    #Add your code here...., based on the following pseudo code.'''

    for tr in trs: # each row represents information of a movie
      dict_each_movie = {} # create an empty dictionary

      # your code to fetch title first.
      title = tr.findChildren('td', attrs= {'class': "title"})
      title = title.encode("ascii", "ignore") # convert the unicode string into an ascii string
      util.process_str_with_comma(title) # this method is in util_imdb.py.
      #  Sometimes, a title can include a comma (e.g. "Oh, My God!"). This will cause a problem
        # if your code outputs the title to a csv file. To deal with this problem, # we use quotation marks to enclose the title.
      # When you load the csv using the python package pandas
        # (or SAS or many other packages for processing csv), when pandas sees a string in csv enclosed in "", # it will recognized it as a cvs field with commas within it.
      dict_each_movie["title"] = title

      # your code to fetch year
      year = tr.findchildren('td', attrs= {'class': "year_type"})#tag is 'a' on page source.
      year = year.encode("ascii","ignore")
      dict_each_movie["rank"] = rank

      # your code to fetch year rank. Rank here means the number (such as 1.,2.) in front of the image of each image. Remove the '.'
      dotted_rank = tr.findChildren('td', attrs = {'class': "number"})
      rank = dotted_rank.replace(".", "") #takes out period at end
      rank = rank.encode("ascii","ignore")
      dict_each_movie["year"] = year

      # your code to fetch genres. Here I used try except; you can implement this part in a different way without using exception handling
      genres = [] # a movie can have a list of or none genre values
      try: # you need to deal with exception, since a movie may not have a tag for genres. If there are genres:
          genre = tr.findChildren('td', attrs = {'class': "genre"})
          genre = genre.encode("ascii", "ignore")
          genres.append(genre)
          #  '''find_all genres #add all the genres to the list "genres". Remember first encode('ascii', ignore) and then add to the list'''
      except:
          genres = []
          "do nothing. genres is still [], an empty list"
      finally: # whether an exception or not, you want to do the following
            dict_each_movie["genres"] = genres

      #your code to fetch runtime. Again there some movies that do not have runtime value
      runtime = ""
      try:
           runtime = tr.findchildren('td', attrs = {'class': "runtime"})#find runtime
           runtime = runtime.encode('ascii','ignore')
           runtime.remove('mins.')#a runtime string looks like "90 mins." you need to remove " mins."
      except:
           runtime = "" #do nothing
      finally:
           dict_each_movie["runtime"] = runtime

      #your code to fetch rating
      rating = tr.findChildren('td', attrs = {'class': "rating-rating"})
      rating = rating.runtime.encode('ascii','ignore')
      dict_each_movie["rating"] = rating

      list_movies.append(dict_each_movie)
      count += 1
      if count == num_of_m:
          break
      '''now we are done with processing a movie, increment count
      check if we have processed num_of_m movies (if count == num_of_m)? if so, break.'''

    return list_movies

def test_read_m_from_url():

    url = "http://www.imdb.com/search/title?at=0&sort=user_rating&start=51&title_type=feature&year=2005,2014"
        print read_m_from_url(url, 21)