Pythongasm — Home

Web Scraping Without Getting Blocked

Follow @adarshpunj

Introduction

We often need to prove "I'm not a robot" on the internet. That's because there are plenty of chatbots on the internet, actively doing nasty stuff like filling forms, clicking like buttons, etc.

However, in this process many normal spiders get blocked, which are out there trying to access public data, which is not illegal. Any public data should be accessible by anyone — regardless of the fact that they are humans or bots.

This article will highlight six different things you should do to avoid getting blacklisted while scraping a webpage. Note that this article is intended for responsible scraping, and should not be seen as a way to do anything illegal.

Send headers with your request

If a server receives a request without any headers, it gives them enough reason to ignore that request. That's because it indicates that the request is not coming from a browser. Every modern day browser sends headers like User-Agent, Cookie (if the server has set), etc., which isn't the case when you send a headless, vanilla HTTP request from your Python environment.

You can check the requests sent by your browser in the network tab:

Request headers

Let us try to replicate this request using Python:

import requests

HEADERS = { 
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Encoding':'gzip, deflate',
    'Accept-Language':'en-GB,en;q=0.9,en-US;q=0.8,hi;q=0.7,la;q=0.6',
    'Cache-Control':'no-cache',
    'Connection':'keep-alive',
    'DNT':'1',
    'Host':'example.com',
    'Pragma':'no-cache',
    'Upgrade-Insecure-Requests':'1',
    'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',
}

response = requests.get(url="https://example.com", headers=HEADERS)
<Response [200]>

In many cases, you need to use the Cookie header to authorize your request, which you can get from the browser and then use it from your Python environment.

Proxies

When a website receives malicious traffic from a particular IP, it makes sense for them to blacklist that IP.

This is where proxy IPs come into play. You can compile a list of these proxy addresses, and then simply keep rotation them. If one of these proxy IPs are blocked, you can always have another.

You can pass a dictionary of proxies in requests.request function:

import requests
PROXIES = {

    'http': 144.91.78.58:80,
    'https': 144.91.78.58:80
}
response = requests.get(url="https://example.com", proxies=PROXIES)

The proxy(ies) used in this example might not work at the time when you run it. Fortunately, there are many freemium websites that offer SSL proxies.

For your production environment, you can purchase premium proxies to ensure good connectivity, speed, uptime, etc.

Time interval

Putting a time interval between two consecutive requests is the easiest thing you can do to avoid getting blocked. Make your script sleep for a few seconds, works great if this interval is random — to give your request a human touch.

import requests
import random
import time

for n in range(0,10):
    response = requests.get(url=f"https://example.com/p/{n}")
    time.sleep(random.randint(2, 15))

When time is not a critical factor consider increasing the value of this interval even further.

Switch UA

The User-Agent header helps the server identify the device, or say the source of the request. Different browsers running on different operating systems have different user-agent strings. In a utopian society, this is not limited to browsers and OS — every HTTP request carries this header, including Google, Bing, other crawlers.

You can check your user agent here. If you access this page using a different browser/OS, the value would look different. You can spoof your user-agent by simply customising the headers you send with your request. This comes in handy when accessing a mobile-only webpage.

This text file contains 1000 different user agent strings which you can use for as the value for User-Agent header like this:

import requests
import random

HEADERS = {
    'User-Agent': None
}

USER_AGENT_LIST = [
...
]

HEADERS['User-Agent'] = random.choice(USER_AGENT_LIST)

Headless scraping with Selenium

Primarily, Selenium is used for automated testing. It lets you automate a browser, and is almost like a human sending requests from a browser. More importantly, it lets you customise all the things mentioned above like headers/user-agent, cookies. You can also use your own browser profile.

Running Selenium

First of all, you need to have a web driver, that's what controls the browser.

In this tutorial, we will be using chromedriver, an open-source web driver built by Google under the chromium project. You can download it from here.

Note

Make sure you're downloading the right version of chromedriver (ideally same as your Chrome version).

Now we need to install the selenium library for Python:

pip install selenium

The following snippet will open a browser window, and then log on to Google (change the value of CHROMEDRIVER_PATH to the path of your chromedriver executable):

from selenium import webdriver

CHROMEDRIVER_PATH = "/path/to/chromedriver"

driver = webdriver.Chrome(CHROMEDRIVER_PATH)
driver.get("https://google.com/")

Selenium in action

Further readings


Permanent link to this article
pygs.me/003

Built with using FastAPI