Search engines like Google have been using so-called crawlers for a long time. Crawlers search the internet for user-defined terms. They are special types of bots that visit website after website to establish and cat­egor­ise as­so­ci­ations with search results. The first crawler was released in 1993, when the first search machine was launched: Jump­sta­tion.

Web scraping or web har­vest­ing is a crawling technique. We explain how it works, why it’s used, and how it can be blocked if necessary.

Web scraping: a defin­i­tion

During the process of web scraping, data is extracted from websites and stored in order to analyse or otherwise exploit it. Many different types of in­form­a­tion are collected when scraping – for instance, contact data like email addresses or telephone numbers, and in­di­vidu­al search terms or URLs. They are then collected in local databases or tables.

Defin­i­tion

During web scraping, texts are read from websites in order to obtain and store in­form­a­tion. This is com­par­able to an automatic copy-and-paste process. For image searches, this technique is referred to as image scraping.

How web scraping works

There are different ap­proaches to scraping, but a dis­tinc­tion is generally made between manual and automatic scraping. Manual scraping refers to the manual copying and pasting of in­form­a­tion and data. This is rather like cutting and col­lect­ing newspaper articles. Manual scraping is only performed when certain pieces of in­form­a­tion are to be obtained and stored. It’s a highly effort-intensive process that is rarely used for large quant­it­ies of data.

Automatic scraping is when a software or algorithm is used to search through multiple websites and extract in­form­a­tion. Depending on the type of website and content, special software is available for this purpose. A number of ap­proaches exist for automatic scraping:

  • Parsers: A parser is used to convert a text to a new structure. In HTML parsing, for example, the software reads a HTML document and stores the in­form­a­tion. DOM parsing uses the client display of content in the browser to extract data.
  • Bots: A bot is computer software that is dedicated to per­form­ing certain tasks auto­mat­ic­ally. Bots can be used in web har­vest­ing to auto­mat­ic­ally search through websites and collect data.
  • Text: Anyone pro­fi­cient with command line can give Unix-grep in­struc­tions in order to comb the web for certain terms in Python or Perl. This is a really simple method for scraping data, but it takes more work than utilising software.
Note

In our tutorial, we show you what to keep in mind when web scraping with Python. Selenium WebDriver can be easily in­teg­rated into this process to collect data.

Why is web scraping used?

Web scraping is used for a range of tasks. For example, it allows contact details or special in­form­a­tion to be collected quickly. Scraping is com­mon­place in a pro­fes­sion­al context in order to obtain ad­vant­ages over com­pet­it­ors. Data har­vest­ing enables a company to view all of a com­pet­it­or’s products and compare them with its own. Web scraping can also be helpful with financial data. The in­form­a­tion is read from an external website, placed in a tabular format and then analysed or further processed.

A good example of web scraping is Google. The search engine uses the tech­no­logy to display weather in­form­a­tion or price com­par­is­ons for hotels and flights. Many common price com­par­is­on portals also practice scraping to show in­form­a­tion from many different websites and providers.

Is web scraping legal?

Scraping is not always legal and scrapers must first consider the copy­rights of a website. For some web shops and providers, web scraping can certainly have negative con­sequences – for example, if the page ranking suffers as a result of ag­greg­at­ors. From time to time, companies may sue com­par­is­on portals to compel them to cease web scraping. In these cases, however, the Ninth Circuit Court of Appeals pre­vi­ously ruled that scraping was not illegal and did not violate anti-hacking laws where in­form­a­tion was freely ac­cess­ible. However, companies are at liberty to install technical measures to prevent scraping.

In other words, scraping is legal when the extracted data is freely ac­cess­ible to third parties on the internet. To stay on the right side of the law, it’s important to consider the following points when web scraping:

  • Consider and observe copyright. If data is copyright-protected, it may not be published elsewhere.
  • Website operators have a right to install technical measures to prevent web scraping. They must not be cir­cum­ven­ted.
  • If data use relates to a user re­gis­tra­tion or usage agreement, this data may not be scraped.
  • Scraping tech­no­logy is not allowed to hide ad­vert­ising, general terms of use or dis­claim­ers.

Although scraping is permitted in many cases, it can certainly lead to de­struct­ive con­sequences or even be misused for illegal purposes. The tech­no­logy is often used for spam, for example. Thanks to this tech­no­logy, spammers can collect email addresses and send spam emails to these re­cip­i­ents.

How to block web scraping

To prevent web scraping, website operators can take a range of different measures. The file robots.txt is used to block search engine bots, for example. Con­sequently, they also prevent automatic scraping by software bots. IP addresses belonging to bots can also be blocked. Contact data and personal in­form­a­tion can be concealed and sensitive data like telephone numbers can be stored as an image or CSS, reducing the ef­fect­ive­ness of data scraping. Moreover, there are many providers of anti-bot services that can set up a firewall for a fee. Google Search Console can also be used to configure no­ti­fic­a­tions that inform website operators if their data has been scraped.

Please note the legal dis­claim­er relating to this article.

Go to Main Menu