Web Scraping In Python Using BeautifulSoup Library — Part 2
Missed previous article — click here
In previous article we discussed about grabbing the title and rating of each book. All other information about each book is available on respective book’s URL. Let’s check any random book and try to access the information.
As we already know to capture URL of a book :
Now for each book we need to separate request and soup object.
Running above code will give you HTML code of that website in the raw text format.
Visit : https://books.toscrape.com/catalogue/in-her-wake_980/index.html
Look for ‘Product Information’ table on that webpage and ‘inspect’ that table.
You will find th
and td
tags contains required information which can be captured using select
method.
Line 1 in above code returns a list :
[<th>UPC</th>,
<th>Product Type</th>,
<th>Price (excl. tax)</th>,
<th>Price (incl. tax)</th>,
<th>Tax</th>,
<th>Availability</th>,
<th>Number of reviews</th>]
Line 2 returns :
'Price (excl. tax)'
Here .text
method is used to grab the text present between the tags as in <th> Tax </th>.
Similarly to select td
tag :
Using for
loop to get all the information about each book in the form of dataframe :
%%timenew_df = pd.DataFrame() # empty dataframe
index_num= 0 # index of dataframefor page_num in range(1,51): # total pages
request = requests.get(f’http://books.toscrape.com/catalogue/page-{page_num}.html') # url of each page
soup_ = bs4.BeautifulSoup(request.text)
book_info_list = soup_.select(‘article h3 a’) for n in range(0,20):
res = requests.get(‘http://books.toscrape.com/catalogue/'+ book_info_list[n][‘href’]) # url of each book in above page
soup = bs4.BeautifulSoup(res.text, ‘lxml’)
book_features_column_list = soup.select(‘tr th’) # book features like price, rating of each book
book_column_values = soup.select(‘tr td’) # values of that features
column_names = [ item.text for item in book_features_column_list ]
column_values = [item.text for item in book_column_values ]
d = dict(zip(column_names, column_values)) # making dictionary with features as columns and values as rows
df = pd.DataFrame(d, index = [index_num + n])
new_df = new_df.append(df) # append df of every page
index_num += 20 # index for each page to be incremented by 20
In above code, one for
loop is inside another for
loop, so it should look like the image given below :
In previous article we had created two lists containing titles and raring of the books. Let’s add those two columns in this dataframe.
For every book, there is a category associated with it like — fiction, thriller, romance etc.
Category can be found in ul
tag.
Using for
loop to get the categories of each book in the form of a list :
Above code may take around 10 minutes to get executed.
Now we need to add this category list as a column in our dataframe.
To store dataframe we created in the form of csv file :
You can check this file saved in the current working directory.
Happy learning !!!
This code can take around 12 minutes to execute and it will return a dataframe of the shape (1000, 7).