The World Wide Web is made up of billions of in­ter­linked documents, more commonly known as web pages. The source code for websites is written in Hypertext Markup Language (HTML). HTML source code is a mixture of human-readable in­form­a­tion and machine-readable code known as tags. A web browser (e.g. Chrome, Firefox, Safari or Edge) processes the source code, in­ter­prets the tags, and displays the in­form­a­tion contained to the user.

Special software is used to extract only the source code that is useful to people. These programs are referred to as web scrapers, spiders, and bots. They search the source code of a web page using pre­defined patterns and extract the in­form­a­tion contained. The in­form­a­tion obtained through web scraping is sum­mar­ised, combined, analysed, and stored for further use.

In the following, we explain why Python is par­tic­u­larly well-suited for creating web scrapers and provide you with an in­tro­duc­tion to the topic and a tutorial.

Create a website with your own domain
The fast track to your own website
  • Pro­fes­sion­al templates
  • One-click design changes
  • Free domain, SSL and email

Why should you use Python for web scraping?

The popular pro­gram­ming language Python is a great tool for creating web scraping software. Since websites are con­stantly being modified, web content changes over time. For example, the website’s design may be modified or new page com­pon­ents may be added. A web scraper is pro­grammed according to the specific structure of a page. If the structure of the page is changed, the scraper must be updated. This is par­tic­u­larly easy to do with Python.

Python is also effective for word pro­cessing and web resource retrieval, both of which form the technical found­a­tions for web scraping. Fur­ther­more, Python is an es­tab­lished standard for data analysis and pro­cessing. In addition to its general suit­ab­il­ity, Python has a thriving pro­gram­ming ecosystem. This ecosystem includes libraries, open-source projects, doc­u­ment­a­tion, and language ref­er­ences as well as forum posts, bug reports, and blog articles.

There are multiple soph­ist­ic­ated tools for per­form­ing web scraping with Python. Here we will introduce you to three popular tools: Scrapy, Selenium, and Beau­ti­ful­Soup. For some hands-on ex­per­i­ence, you can use our tutorial on web scraping with Python based on Beau­ti­ful­Soup. This will allow you to directly fa­mil­i­ar­ise yourself with the scraping process.

Web scraping overview

The basic procedure for the scraping process is easy to explain. First, the scraper developer analyses the HTML source code of the page in question. Usually, there are unique patterns that are used to extract the desired in­form­a­tion. The scraper is pro­grammed using these patterns. The rest of the work is done auto­mat­ic­ally by the scraper:

  1. It requests the website via the URL address.
  2. It auto­mat­ic­ally extracts the struc­tured data cor­res­pond­ing to the patterns.
  3. The extracted data is sum­mar­ised, stored, analysed, combined, etc.

Ap­plic­a­tions for web scraping

Web scraping is highly versatile. In addition to search engine indexing, web scraping is used for a range of other purposes including:

  • creating contact databases;
  • mon­it­or­ing and comparing the prices of online offers;
  • merging data from different online sources;
  • tracking online presence and repu­ta­tion;
  • col­lect­ing financial, weather and other data;
  • mon­it­or­ing web content for changes;
  • col­lect­ing data for research purposes; and
  • data mining.

Demon­strat­ing web scraping through an example

Let us consider a website for selling used cars. When you navigate to the website in your browser, you will be shown a list of cars. Below we will examine an example of source code for a car listing:

raw_html = """
<h1>Used cars for sale</h1>
<ul class="cars-listing">
    <li class="car-listing">
        <div class="car-title">
            Volkswagen Beetle
        </div>
        <div class="car-description">
            <span class="car-make">Volkswagen</span>
            <span class="”car-model”">Beetle</span>
            <span class="car-build">1973</span>
        </div>
        <div class="sales-price">
            € <span class="”car-price”">14,998.—</span>
        </div>
    </li>
</ul>
 """

A web scraper can search through the available online listings of used cars. The scraper will search for a specific model in ac­cord­ance with what the developer intended. In our example, the model is a Volk­swa­gen Beetle. In the source code, the in­form­a­tion for the make and model of the car is tagged with the CSS classes 'car-make' and 'car-model'. By using these class names, the desired in­form­a­tion can be easily scraped. Here is an example using Beau­ti­ful­Soup:

# parse the HTML source code stored in raw_html
html = BeautifulSoup(raw_html, 'html.parser')
# extract the content of the tag with the class 'car-title'
car_title = html.find(class_ = 'car-title').text.strip()
# if this car is a Volkswagen Beetle
if (car_title == 'Volkswagen Beetle'):
    # jump up from the car title to the wrapping <li> tag</li>
    html.find_parent('li')
    
    # find the car price
    car_price = html.find(class_ = 'sales-price').text.strip()
    
    # output the car price
    print(car_price)

Legal risks of web scraping

As handy as web scraping is, it also comes with some legal risks. Since the website operator has actually intended for their website to be used by humans, automated data re­triev­als using web scrapers can con­sti­tute a violation of the terms of use. This is es­pe­cially true when it is re­triev­ing large amounts of data from multiple pages sim­ul­tan­eously or in rapid suc­ces­sion. A human could not interact with the website in this way.

Fur­ther­more, the automated retrieval, storage and analysis of the data published on the website may con­sti­tute a violation of copyright law. If the scraped data contains per­son­ally iden­ti­fi­able in­form­a­tion, storing and analysing it without the consent of the person concerned might violate current data pro­tec­tion reg­u­la­tions, e.g. GDPR or CCPA. For example, it is pro­hib­ited to scrape Facebook profiles to collect personal in­form­a­tion.

Note

Violating privacy and copyright laws may result in severe penalties. You should ensure that you do not break any laws if you intend to use web scraping. Under ab­so­lutely no cir­cum­stance should you cir­cum­vent existing access re­stric­tions.

Technical lim­it­a­tions of web scraping

It is often in the interest of website operators to limit the automated scraping of their online offers. Firstly, if large numbers of scrapers access the website, this can neg­at­ively affect its per­form­ance. Secondly, there are often internal areas of a website that should not appear in search results.

The robots.txt standard was es­tab­lished to limit scrapers’ access to websites. To do so, the website operator places a text file called robots.txt in the root directory of the website. In this file, there are specific entries that define which scrapers or bots are allowed to access which areas of the website. The entries in the robots.txt file always apply to the entire domain.

The following is an example of a robots.txt file that disallows scraping by any bot across the entire website:

# Any bot
User-agent: *
# Disallow for the entire root directory
Disallow: /

Adhering to the re­stric­tions laid out in the robots.txt file is com­pletely voluntary. The bots are supposed to comply with the spe­cific­a­tions, but tech­nic­ally, this cannot be enforced. Therefore, to ef­fect­ively regulate web scrapers’ access to their websites, website operators also use more ag­gress­ive tech­niques. These tech­niques include re­strict­ing their access by limiting through­put and blocking their IP addresses if they re­peatedly access the site ignoring the spe­cific­a­tions.

APIs as an al­tern­at­ive to web scraping

While web scraping can be useful, it is not the preferred method for obtaining data from websites. There is often a better way to get this done. Many website operators present their data in a struc­tured, machine-readable format. This data is accessed via special pro­gram­ming in­ter­faces called ap­plic­a­tion pro­gram­ming in­ter­faces (APIs).

There are sig­ni­fic­ant ad­vant­ages to using an API:

  • The API is ex­pli­citly made available by the provider for the purpose of accessing the data: There are fewer legal risks, and it is easier for the provider to control access to the data. For example, an API key may be required to access the data. The provider can also limit through­put more precisely.
  • The API delivers the data directly in a machine-readable format: This elim­in­ates the need to tediously extract the data from the source code. In addition, the data structure is separate from its graphical present­a­tion. The structure therefore remains the same even if the website design is changed.

If there is an API available that provides access to all the data, this is the preferred way to access it. However, scraping can in principle be used to retrieve all text presented in a human-readable format on web pages.

Python web scraping tools

In the Python ecosystem, there are several well-es­tab­lished tools for executing a web scraping project:

  • Scrapy
  • Selenium
  • Beau­ti­ful­Soup

In the following, we will go over the ad­vant­ages and dis­ad­vant­ages of each of these three tools.

Web scraping with Scrapy

The Python web scraping tool Scrapy uses an HTML parser to extract in­form­a­tion from the HTML source code of a page. This results in the following schema il­lus­trat­ing web scraping with Scrapy:

URL → HTTP request → HTML → Scrapy

The core concept for scraper de­vel­op­ment with Scrapy are scrapers called web spiders. These are small programs based on Scrapy. Each spider is pro­grammed to scrape a specific website and crawls across the web from page to page as a spider is wont to do. Object-oriented pro­gram­ming is used for this purpose. Each spider is its own Python class.

In addition to the core Python package, the Scrapy in­stall­a­tion comes with a command-line tool. The spiders are con­trolled using this Scrapy shell. In addition, existing spiders can be uploaded to the Scrapy cloud. There the spiders can be run on a schedule. As a result, even large websites can be scraped without having to use your own computer and home internet con­nec­tion. Al­tern­at­ively, you can set up your own web scraping server using the open-source software Scrapyd.

Scrapy is a soph­ist­ic­ated platform for per­form­ing web scraping with Python. The ar­chi­tec­ture of the tool is designed to meet the needs of pro­fes­sion­al projects. For example, Scrapy contains an in­teg­rated pipeline for pro­cessing scraped data. Page retrieval in Scrapy is asyn­chron­ous which means that multiple pages can be down­loaded at the same time. This makes Scrapy well suited for scraping projects in which a high volume of pages needs to be processed.

Web scraping with Selenium

The free-to-use software Selenium is a framework for automated software testing for web ap­plic­a­tions. While it was ori­gin­ally developed to test websites and web apps, the Selenium WebDriver with Python can also be used to scrape websites. Despite the fact that Selenium itself is not written in Python, the software’s functions can be accessed using Python.

Unlike Scrapy or Beau­ti­ful­Soup, Selenium does not use the page’s HTML source code. Instead, the page is loaded in a browser without a user interface. The browser in­ter­prets the page’s source code and generates a Document Object Model (DOM). This stand­ard­ised interface makes it possible to test user in­ter­ac­tions. For example, clicks can be simulated and forms can be filled out auto­mat­ic­ally. The resulting changes to the page are reflected in the DOM. This results in the following schema il­lus­trat­ing web scraping with Selenium:

URL → HTTP request → HTML → Selenium → DOM

Since the DOM is generated dy­nam­ic­ally, Selenium also makes it possible to scrape pages with content created in JavaS­cript. Being able to access dynamic content is a key advantage of Selenium. Selenium can also be used in com­bin­a­tion with Scrapy or Beau­ti­ful­Soup. Selenium delivers the source code, while the second tool parses and analyses it. This results in the following schema:

URL → HTTP request → HTML → Selenium → DOM → HTML → Scrapy/Beau­ti­ful­Soup

Web scraping with Beau­ti­ful­Soup

Beau­ti­ful­Soup is the oldest of the Python web scraping tools presented here. Like Scrapy, it is also an HTML parser. This results in the following schema il­lus­trat­ing web scraping with Beau­ti­ful­Soup:

URL → HTTP request → HTML → Beau­ti­ful­Soup

Unlike Scrapy, de­vel­op­ing scrapers with Beau­ti­ful­Soup does not require object-oriented pro­gram­ming. Instead, scrapers are coded as a simple script. Using Beau­ti­ful­Soup is thus probably the easiest way to fish specific in­form­a­tion out of the “tag soup”.

Com­par­is­on of Python web scraping tools

Each of the three tools we have covered has its ad­vant­ages and dis­ad­vant­ages. In the table below, you will find an overview sum­mar­ising them:

Scrapy Selenium Beau­ti­ful­Soup
Easy to learn ++ + +++
Accesses dynamic content ++ +++ +
Creates complex ap­plic­a­tions +++ + ++
Able to cope with HTML errors ++ + +++
Optimised for scraping per­form­ance +++ + +
Strong ecosystem +++ + ++
Summary

So, which tool should you use for your project? To put it briefly, if you want the de­vel­op­ment process to go quickly or if you want to fa­mil­i­ar­ise yourself with Python and web scraping first, you should use Beau­ti­ful­Soup. Now, if you want to develop soph­ist­ic­ated web scraping ap­plic­a­tions in Python and have the necessary know-how to do so, you should opt for Scrapy. However, if your primary goal is to scrape dynamic content with Python, you should go for Selenium.

Tutorial on web scraping with Python and Beau­ti­ful­Soup

Here we will show you how to extract data from a website with Beau­ti­ful­Soup. First, you will need to install Python and a few tools. The following is required:

  • Python version 3.4 or higher,
  • the Python package manager pip, and
  • the venv module.

Please follow the in­stall­a­tion in­struc­tions found on the Python in­stall­a­tion page.

Al­tern­at­ively to using pip, if you have the free-to-use package manager Homebrew installed on your system, you can install Python with the following command:

brew install python
Note

The following code and ex­plan­a­tions shown below were written in Python 3 in macOS. In principle, the code should run on other operating systems. However, you may have to make some modi­fic­a­tions, es­pe­cially if you are using Windows.

Setting up a Python web scraping project on your own device

Here, we are going to create the project folder web Scraper for the Python tutorial on the desktop. Open the command-line terminal (e.g. Terminal.app on Mac). Then, copy the following lines of code into the terminal and execute them.

# Switch to the desktop folder
cd ~/Desktop/
# Create project directory
mkdir ./web Scraper/ && cd ./web Scraper/
# Create virtual environment
# Ensures for instance that pip3 is used later
python3 -m venv ./env
# Activate virtual environment
source ./env/bin/activate
# Install packages
pip install requests
pip install beautifulsoup4

Scraping quotes and authors using Python and Beau­ti­ful­Soup

The website Quotes to Scrape provides a selection of quotes. This is a service provided spe­cific­ally for scraping tests. So, there is no need to worry about violating the terms of use.

Let us begin. Open the command-line terminal (e.g. Terminal.app on Mac) and launch the Python in­ter­pret­er from your Python project folder web Scraper. Copy the following lines of code into the terminal and execute them:

# Switch to the project directory
cd ~/Desktop/web Scraper/
# Activate virtual environment
source ./env/bin/activate
# Launch the Python interpreter
# Since we are in the virtual environment, Python 3 will be used
python

Now, copy the following code into the command-line terminal in the Python in­ter­pret­er. Then, press Enter – several times if necessary – to execute the code. You can also save the code as a file called scrape_quotes.py in your project folder web Scraper. In this case, you can run the Python script using the command python scrape_quotes.py.

Executing the code should result in a file called quotes.csv being created in your Python project folder web Scraper. This will be a table con­tain­ing the quotes and authors. You can open this file with any spread­sheet program.

# Import modules
import requests
import csv
from bs4 import BeautifulSoup
# Website address
url = "http://quotes.toscrape.com/"
# Execute GET request
response = requests.get(url)
# Parse the HTML document from the source code using BeautifulSoup
html = BeautifulSoup(response.text, 'html.parser')
# Extract all quotes and authors from the HTML document
quotes_html = html.find_all('span', class_="text")
authors_html = html.find_all('small', class_="author")
# Create a list of the quotes
quotes = list()
for quote in quotes_html:
    quotes.append(quote.text)
# Create a list of the authors
authors = list()
for author in authors_html:
    authors.append(author.text) 
# For testing: combine the entries from both lists and output them
for t in zip(quotes, authors):
print(t)
# Save the quotes and authors in a CSV file in the current directory
# Open the file using Excel, LibreOffice, etc.
with open('./zitate.csv', 'w') as csv_file:
    csv_writer = csv.writer(csv_file, dialect='excel')
    csv_writer.writerows(zip(quotes, authors))

Using Python packages for web scraping

Every web scraping project is different. Sometimes, you just want to check the website for any changes. Other times, you are looking to perform complex analyses. With Python, you have a wide selection of packages at your disposal.

  1. Use the following code in the command-line terminal to install packages with pip3.
pip3 install <package></package>
  1. Integrate modules in the Python script with import.
from <package> import <module></module></package>

The following packages are often used in web scraping projects:

Package Use
venv Manage a virtual en­vir­on­ment for the project
request Request websites
lxml Use al­tern­at­ive parsers for HTML and XML
csv Read and write spread­sheet data in CSV format
pandas Process and analyse data
scrapy Use Scrapy
selenium Use Selenium WebDriver
Tip

Use the Python Package Index (PyPI) for an overview of available Python packages.

Please note the legal dis­claim­er relating to this article.

Go to Main Menu