Python 创建一个函数来提取电影数量,以及每部电影';从URL中删除属性
我正在创建一个函数Python 创建一个函数来提取电影数量,以及每部电影';从URL中删除属性,python,web-scraping,html-parsing,imdb,Python,Web Scraping,Html Parsing,Imdb,我正在创建一个函数read\u from\u url(url,num\u of\u m=50)以从url中提取电影的数量。它还将返回一个字典列表,每个字典代表一部电影。有人能告诉我第67行(标记为注释)中我做错了什么吗 第67行,在read_m_from_url tables=movie_table[0]#为表创建列表第203行,在main()行198,在main test_read_from_url()行141,在test_read_from_url print read_from_url(u
read\u from\u url(url,num\u of\u m=50)
以从url中提取电影的数量。它还将返回一个字典列表,每个字典代表一部电影。有人能告诉我第67行(标记为注释)中我做错了什么吗
第67行,在read_m_from_url tables=movie_table[0]#为表创建列表第203行,在main()行198,在main test_read_from_url()行141,在test_read_from_url print read_from_url(url,21)行67,在read_m_from_url tables=movie__table[0]#为表958创建列表,在getitem return self.attrs[key]KeyError:0
soup.find
应更改为soup\u findall
,如果tables
是一个没有任何属性的列表find\u all
@Arman,那么对于该部分,应该是:movie\u table=soup.find\u all('table',attrs={class':“results”})tables=movie\u table[0]trs=tables.find\u all('tr'))我没有在这一部分中得到任何错误,所以我猜我已经修复了它….???运行编辑的代码并在这里报告错误,然后编辑您的问题
def read_m_from_url(url, num_of_m=50):
#this function, read a number of movies from a url. That's say you set num_of_m=25, you want to read 25 movies from the page. The default value is 50
#url = 'http://www.imdb.com/search/title?at=0&sort=user_rating&start=1&title_type=feature&year=2005,2016' #MAY NEED TO TAKE THIS OUT SINCE READ IS IN MAIN
html_string = util.read_html(url) # given a url you need to read the hmtl file as a string. I have implemented this read_html function in util_imdb.py. Please take a look
# create a soup object
soup = BeautifulSoup(html_string, "html.parser")
# Fetching a table that includes all the movies. In our lecture, we talked about find and find_all functions.
# for example, find_all('table') will give you all tables on the page. Actually, this find or find_all function can have two parameters,
# in the code below 'table' is the tag name and 'results' is an attribute value of the tag. You can also do # movie_table = soup.find('table', {'class':'result'}).
# Here you explicitly say: I want to find a table with attribute class = 'result'.
# Since on each imdb page, there's only one table with class = 'results', we can use find rather than find_all. Find_all will return a list of table tags, while # find() will return only one table
movie_table = soup.find('table', attrs = {'class': "results"}) # equivalent to movie_table = soup.find('table', {'class':'result'})
tables = movie_table[0] #line 67. create list for tables
tb = tables.find_all('movie_table')[0]
trs = tb.find_all('tr')
list_movies = [] # initialize the return value, a list of movies
# Using count track the number of movies processed. now it's 0 - No movie has been processed yet.
count = 0 #increase count by 1 for every movie processed
'''
#Add your code here...., based on the following pseudo code.'''
for tr in trs: # each row represents information of a movie
dict_each_movie = {} # create an empty dictionary
# your code to fetch title first.
title = tr.findChildren('td', attrs= {'class': "title"})
title = title.encode("ascii", "ignore") # convert the unicode string into an ascii string
util.process_str_with_comma(title) # this method is in util_imdb.py.
# Sometimes, a title can include a comma (e.g. "Oh, My God!"). This will cause a problem
# if your code outputs the title to a csv file. To deal with this problem, # we use quotation marks to enclose the title.
# When you load the csv using the python package pandas
# (or SAS or many other packages for processing csv), when pandas sees a string in csv enclosed in "", # it will recognized it as a cvs field with commas within it.
dict_each_movie["title"] = title
# your code to fetch year
year = tr.findchildren('td', attrs= {'class': "year_type"})#tag is 'a' on page source.
year = year.encode("ascii","ignore")
dict_each_movie["rank"] = rank
# your code to fetch year rank. Rank here means the number (such as 1.,2.) in front of the image of each image. Remove the '.'
dotted_rank = tr.findChildren('td', attrs = {'class': "number"})
rank = dotted_rank.replace(".", "") #takes out period at end
rank = rank.encode("ascii","ignore")
dict_each_movie["year"] = year
# your code to fetch genres. Here I used try except; you can implement this part in a different way without using exception handling
genres = [] # a movie can have a list of or none genre values
try: # you need to deal with exception, since a movie may not have a tag for genres. If there are genres:
genre = tr.findChildren('td', attrs = {'class': "genre"})
genre = genre.encode("ascii", "ignore")
genres.append(genre)
# '''find_all genres #add all the genres to the list "genres". Remember first encode('ascii', ignore) and then add to the list'''
except:
genres = []
"do nothing. genres is still [], an empty list"
finally: # whether an exception or not, you want to do the following
dict_each_movie["genres"] = genres
#your code to fetch runtime. Again there some movies that do not have runtime value
runtime = ""
try:
runtime = tr.findchildren('td', attrs = {'class': "runtime"})#find runtime
runtime = runtime.encode('ascii','ignore')
runtime.remove('mins.')#a runtime string looks like "90 mins." you need to remove " mins."
except:
runtime = "" #do nothing
finally:
dict_each_movie["runtime"] = runtime
#your code to fetch rating
rating = tr.findChildren('td', attrs = {'class': "rating-rating"})
rating = rating.runtime.encode('ascii','ignore')
dict_each_movie["rating"] = rating
list_movies.append(dict_each_movie)
count += 1
if count == num_of_m:
break
'''now we are done with processing a movie, increment count
check if we have processed num_of_m movies (if count == num_of_m)? if so, break.'''
return list_movies
def test_read_m_from_url():
url = "http://www.imdb.com/search/title?at=0&sort=user_rating&start=51&title_type=feature&year=2005,2014"
print read_m_from_url(url, 21)