Web Scraping In Python Using BeautifulSoup Library — Part 1
What is web scraping ?
Web scraping is extracting the data from websites. Generally the data available on websites is unstructured and it is in HTML format. We can extract it in the form of dataframe and store it in csv file. This is what we will be doing in this article.
Website to be scraped — https://books.toscrape.com/catalogue/page-1.html
We need two libraries viz. requests and bs4. Install them by running following code in the command prompt.
pip install bs4
pip install requests
pip install pandas
Now let’s import necessary libraries.
There are in total 20 pages on the website. We will first try to scrape any random page, say second page. Later all the pages can be scraped using same logic.
To establish the connection between python environment that we are using and the webpage to be scraped, we use requests
library.
Run above code snippet. The output `<Response [200]>` means connection is successfully established.
Now let’s use `BeautifulSoup` library to create the soup object.
Above code will return HTML code of the webpage in the form of raw text as the output.
Visit the webpage using URL, right-click and hit ‘inspect’.
To select HTML tag, we use select
method from BeautifulSoup library. Let’s try selecting `article` tag. This tag gives me all the information of about each book as you can in the image given below.(Look at the box appearing above blue box.)
Above code will return all the information in that article
tag. But if we hit ‘inspect’ after right-clicking in the title of the book, we can see that tag a
contains the title information as shown in the image given below.
To select the tags inside tag, again we use `select` method.
Always remember select
method returns a list. So we can check the number of the elements in this list using len
method. Since there are 20 books present in single page, code given below should return 20 as output.
Since select
method returns a list, we can inspect each element from that list using indexing.
In the above code, line 1 gives output like :
<a href="in-her-wake_980/index.html" title="In Her Wake">In Her Wake</a>
Line 2 gives :
'In Her Wake'
The name of the book is stored in the key ‘title’, so it works like key-value pair in a dictionary.
Line 3 gives :
'in-her-wake_980/index.html'
The link of the book is stored in the key ‘href’, again it works like key-value pair in a dictionary.
Now let’s look a rating of each book in the article tag :
We can see in above image, p
tag contains the rating of the book, so we access that tag using select
method.
As we already know, using `select` method returns a list, so we inspect any element using indexing.
Line 1 gives :
<p class="star-rating One">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
The rating ‘one’ is stored in a key ‘class’, so we again use the key-value pair analogy of a dictionary and that gives output as :
['star-rating', 'One']
As above output is a list, we grab rating ‘one’ using indexing which is done in line 3.
Now that we know how to grab title and the rating of a book, let’s use a for
loop to grab all the titles and ratings of all the 20 books from each of the 50 pages. There are total 50 pages, which can be seen at the bottom of the page.
So for each page we need separate request
and a soup
object.
Above code can take 5–6 minutes to run. At the end you can check if all the 1000 (20*50) titles and ratings are stored in the respective lists.
There is more to the web scraping which is covered in next article.
Thank you !!