11.1. Scraping Tips and Tricks#

11.1.1. Manage your Webdrivers with webdriver-manager#

When using Selenium with Python, you probably did the following:

  • Download Chromedriver Binary

  • Unzip it

  • Set the path to the driver

This is annoying.

  • The Path can be changed

  • You have to somehow manage those browser drivers for each OS

  • Check if new updates for drivers are released

Instead of doing this manually, use webdriver-manager.

It makes managing binaries for different browsers easy.

webdriver-manager downloads binaries automatically for you.

So you don’t have to go through the pain of doing it manually.

To use it in your project, see the example below. It’s straightforward and saves you time and energy.

Especially when you integrate Selenium in your CI/CD Pipeline.

By default, webdriver-manager installs the latest version.

But you can also define a specific version of the driver.

!pip install webdriver-manager
# Old way
from selenium import webdriver
driver = webdriver.Chrome('path/to/driver.exe')

# New way with webdriver-manager and Selenium 4
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

# New way with webdriver-manager and Selenium 3
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(ChromeDriverManager().install())

# Use specific version
driver = webdriver.Chrome(executable_path=ChromeDriverManager("<your_version>").install())

11.1.2. Speed up your Scraping with disabling image loading#

Do you want to speed up your web scraper?

Disable image loading!

Disabling image loading while scraping is a great way to speed up your scraper.

You are wasting a lot of connection bandwidth.

To disable image loading in Selenium, you only have to set one option (like below).

This will save you time and money.

from selenium import webdriver

chrome_options = webdriver.ChromeOptions()

chrome_options.add_argument('--blink-settings=imagesEnabled=false')

driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get("https://www.instagram.com/")

11.1.3. AI-Powered Web Scraper with scrapegraph-ai#

Do you want to let AI scrape your website?

Use scrapegraph-ai.

This library uses LLM and direct graph logic to scrape websites by only providing the information you need.

See below where we give it a prompt and an URL.

It also supports multi-page scraper that extracts information from the top n search results of a search engine.

!pip install scrapegraphai
from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "api_key": OPENAI_API_KEY,
        "model": "gpt-3.5-turbo",
        "temperature":0,
    },
    "verbose":True,
}

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the projects with their descriptions.",
    source="https://perinim.github.io/projects/",
    config=graph_config
)

'''
Output
{
  "projects": [
    {
      "title": "Rotary Pendulum RL",
      "description": "Open Source project aimed at controlling ..."
    },
    {
      "title": "DQN Implementation from scratch",
      "description": "Developed a Deep Q-Network algorithm to train a ..."
    },
    {
      "title": "Multi Agents HAED",
      "description": "University project which focuses ...."
    },
    {
      "title": "Wireless ESC for Modular Drones",
      "description": "Modular drone architecture ..."
    }
  ]
}
'''