Web Scraping In Python Using BeautifulSoup Library — Part 2

Photo by Stephen Phillips - Hostreviews.co.uk on Unsplash

Missed previous article — click here

In previous article we discussed about grabbing the title and rating of each book. All other information about each book is available on respective book’s URL. Let’s check any random book and try to access the information.

As we already know to capture URL of a book :

Now for each book we need to separate request and soup object.

Running above code will give you HTML code of that website in the raw text format.

Visit : https://books.toscrape.com/catalogue/in-her-wake_980/index.html

Look for ‘Product Information’ table on that webpage and ‘inspect’ that table.

You will find th and td tags contains required information which can be captured using select method.

Line 1 in above code returns a list :

[<th>UPC</th>,
<th>Product Type</th>,
<th>Price (excl. tax)</th>,
<th>Price (incl. tax)</th>,
<th>Tax</th>,
<th>Availability</th>,
<th>Number of reviews</th>]

Line 2 returns :

'Price (excl. tax)'

Here .text method is used to grab the text present between the tags as in <th> Tax </th>.

Similarly to select td tag :

Using for loop to get all the information about each book in the form of dataframe :

%%timenew_df = pd.DataFrame() # empty dataframe
index_num= 0 # index of dataframe
for page_num in range(1,51): # total pages
request = requests.get(f’http://books.toscrape.com/catalogue/page-{page_num}.html')
# url of each page
soup_ = bs4.BeautifulSoup(request.text)
book_info_list = soup_.select(‘article h3 a’)
for n in range(0,20):
res = requests.get(‘http://books.toscrape.com/catalogue/'+ book_info_list[n][‘href’]) # url of each book in above page
soup = bs4.BeautifulSoup(res.text, ‘lxml’)

book_features_column_list = soup.select(‘tr th’)
# book features like price, rating of each book
book_column_values = soup.select(‘tr td’) # values of that features

column_names = [ item.text for item in book_features_column_list ]

column_values = [item.text for item in book_column_values ]

d = dict(zip(column_names, column_values))
# making dictionary with features as columns and values as rows

df = pd.DataFrame(d, index = [index_num + n])
new_df = new_df.append(df) # append df of every page

index_num += 20 # index for each page to be incremented by 20

In above code, one for loop is inside another for loop, so it should look like the image given below :

In previous article we had created two lists containing titles and raring of the books. Let’s add those two columns in this dataframe.

For every book, there is a category associated with it like — fiction, thriller, romance etc.

Category can be found in ul tag.

Using for loop to get the categories of each book in the form of a list :

Above code may take around 10 minutes to get executed.

Now we need to add this category list as a column in our dataframe.

To store dataframe we created in the form of csv file :

You can check this file saved in the current working directory.

Happy learning !!!

This code can take around 12 minutes to execute and it will return a dataframe of the shape (1000, 7).

--

--

--

I am a keen learner and diligent teacher with special interest in mathematics and machine learning.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Bullish on Good people…

Unboxing Memória RAM DDR4 Team Group comprado no Ali Express 8gb 3200mhz.

Dissecting Docker for Laravel: Your first command.

Building a Notification Framework for Microservice-based Application

Take me back, please! - Git tricks (1)

Arduino: code for OLED W click

Dart Shelf

Keep data in sync between multiple services using ThingsDB

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Gaurav Patil

Gaurav Patil

I am a keen learner and diligent teacher with special interest in mathematics and machine learning.

More from Medium

Web Scraping In Python Using BeautifulSoup Library — Part 1

Creating Your First Word Cloud

Core Python Programming By Examples

Web scraping with python