Web Scraping In Python Using BeautifulSoup Library — Part 2

Gaurav Patil
3 min readApr 13, 2022
Photo by Stephen Phillips - Hostreviews.co.uk on Unsplash

Missed previous article — click here

In previous article we discussed about grabbing the title and rating of each book. All other information about each book is available on respective book’s URL. Let’s check any random book and try to access the information.

As we already know to capture URL of a book :

Now for each book we need to separate request and soup object.

Running above code will give you HTML code of that website in the raw text format.

Visit : https://books.toscrape.com/catalogue/in-her-wake_980/index.html

Look for ‘Product Information’ table on that webpage and ‘inspect’ that table.

You will find th and td tags contains required information which can be captured using select method.

Line 1 in above code returns a list :

[<th>UPC</th>,
<th>Product Type</th>,
<th>Price (excl. tax)</th>,
<th>Price (incl. tax)</th>,
<th>Tax</th>,
<th>Availability</th>,
<th>Number of reviews</th>]

Line 2 returns :

'Price (excl. tax)'

Here .text method is used to grab the text present between the tags as in <th> Tax </th>.

Similarly to select td tag :

Using for loop to get all the information about each book in the form of dataframe :

%%timenew_df = pd.DataFrame() # empty dataframe
index_num= 0 # index of dataframe
for page_num in range(1,51): # total pages
request = requests.get(f’http://books.toscrape.com/catalogue/page-{page_num}.html')
# url of each page
soup_ = bs4.BeautifulSoup(request.text)
book_info_list = soup_.select(‘article h3 a’)
for n in range(0,20):
res = requests.get(‘http://books.toscrape.com/catalogue/'+ book_info_list[n][‘href’]) # url of each book in above page
soup = bs4.BeautifulSoup(res.text, ‘lxml’)

book_features_column_list = soup.select(‘tr th’)
# book features like price, rating of each book
book_column_values = soup.select(‘tr td’) # values of that features

column_names = [ item.text for item in book_features_column_list ]

column_values = [item.text for item in book_column_values ]

d = dict(zip(column_names, column_values))
# making dictionary with features as columns and values as rows

df = pd.DataFrame(d, index = [index_num + n])
new_df = new_df.append(df) # append df of every page

index_num += 20 # index for each page to be incremented by 20

In above code, one for loop is inside another for loop, so it should look like the image given below :

In previous article we had created two lists containing titles and raring of the books. Let’s add those two columns in this dataframe.

For every book, there is a category associated with it like — fiction, thriller, romance etc.

Category can be found in ul tag.

Using for loop to get the categories of each book in the form of a list :

Above code may take around 10 minutes to get executed.

Now we need to add this category list as a column in our dataframe.

To store dataframe we created in the form of csv file :

You can check this file saved in the current working directory.

Happy learning !!!

This code can take around 12 minutes to execute and it will return a dataframe of the shape (1000, 7).

--

--

Gaurav Patil

Machine Learning Engineer drawing insights from mathematics.