Website scraping, or content scraping, is essentially a practice of (automatically) extracting or scraping content on a web page. Scraping is typically performed by a program/software sending a series of automated requests to the web page. This program is often called a web scraping bot or content scraping bot due to how it operates.
It’s important to first understand that web scraping, on its own, isn’t actually illegal, and we can think of it as somewhat ‘gray’. If the bot is only scraping for content that is made available for the public on the website, then it doesn’t actually perform anything malicious.
However, there are attackers and cybercriminals that are using web scraping activities for malevolent means, such as stealing hidden/unpublished information, publishing your (duplicated) somewhere else, and others.
Essentially, preventing website scraping is about making it more difficult for bots and scripts to extract the target content/data from your website. However, we also need to consider real users and search engine bots and make sure it’s not difficult for them to access these respective content.
Unfortunately, this can be easier said than done, and to do it effectively, it helps to know how these scrapers work so we can understand what prevents them from working optimally.
How Web Scraping Works: Different Types
There are actually various different methods/techniques in performing web scraping, and so there are several different types of web scraper bots:
The ‘standard’ type, Googlebot belongs in this category, as well as common website copiers like HTtrack. Spider bots simply follow links on your website to other pages until they find the target data. They can also use an HTML parser to extract the target data from each page.
- HTML parsers
This type of web scrapers extracts information from your pages based on patterns in your HTML. Typically these bots will ignore anything besides HTML codes to optimize efficiency.
- Shell scripts
Using Unix scripts and tools to extract data, for example, Curl or Wget to download pages and Grep to extract data with a shell script (hence the name). Very simple to use, but at the same time very easy to stop.
This type of bots actually opens your website in a real browser before extracting the desired content from the page by:
- Taking a screenshot of the rendered page, and then using Optical Character Recognition (OCR) tools to extract the content (text-based). This is fairly difficult and expensive, so attackers launching this type of attack is pretty rare.
- Extracting the HTML from the browser, and then using an HTML parser to extract the target information or content.
As we can see, although the actual techniques are different, there are underlying similarities between these four methods, and so many web scrapers will behave similarly. Below we will discuss the methods of preventing web scraping while considering these four different techniques.
How To Prevent Website Scraping
1.Bot Management Software
Since the main culprit behind web scraping attacks are bots, we can effectively prevent website scraping if we can stop these malicious bot activities.
The thing is, we can’t simply rely on a free and obsolete bot mitigation solution due to two main challenges:
- We wouldn’t want to accidentally block traffic from good bots that are beneficial for our site. For example, Google’s indexing bot. The bot detection solution must be able to effectively differentiate between good and bad bots.
- Today’s malicious bots can rotate between hundreds of user agents and IP addresses and may use AI technologies to impersonate human-like behaviors. Differentiating these bots from legitimate human users can be a major challenge.
Due to the sophistication of today’s shopping bots, a bot management solution that is capable of behavioral-based detection in real-time is now a necessity. DataDome is an advanced solution that uses AI and machine learning technologies to detect and manage bots in real-time.
2. Monitor Your Traffic and Manage Unusual Activities
You should check your traffic logs regularly and use appropriate analytics tools to check for unusual activities like a sudden increase in page views, an increase in bounce rate (web scrapers tend to only visit the target page and not the other pages), and slow down in site speed, among other signs.
You have several options in mitigating the web scrapers once you’ve detected their presence:
Only allow these scrapers to perform a limited number of requests and actions in a certain time frame. For example, allowing only three login attempts, allowing only a certain number of searches per second from one user agent, and so on.
If you are 100% sure about the presence of malicious scrapers, you can block them altogether. When you do block, however, make sure to use a nondescript error message so you don’t provide any information for the scraper of what causes the block.
Use CAPTCHA or other challenge-based approaches to test whether the client is actually a web scraper bot. With the presence of CAPTCHA farm services, however, you can’t solely rely on this approach.
Another important consideration when preventing website scraping is not to provide a clear path for the scraper bot to get all of your content in the same place. For example, a common mistake is to have a page listing all your blog posts in one place. Instead, make sure the posts are only accessible by searching for them via the search function, so the bot must search for all potential queries if they want to access all your content.
Web scraper bots run on a limited resource (which is expensive), so the idea is that by slowing them down, the operator might just give up and move on from your website. Our job in preventing website scraping is to make it as hard as possible for these scrapers to extract your content/data, but at the same time ensuring it’s easy enough for human users to access this content.