Scraping Tips and Tricks

11.1. Scraping Tips and Tricks#

11.1.1. Manage your Webdrivers with `webdriver-manager`#

When using Selenium with Python, you probably did the following:

Download Chromedriver Binary
Unzip it
Set the path to the driver

This is annoying.

The Path can be changed
You have to somehow manage those browser drivers for each OS
Check if new updates for drivers are released

Instead of doing this manually, use webdriver-manager.

It makes managing binaries for different browsers easy.

webdriver-manager downloads binaries automatically for you.

So you don’t have to go through the pain of doing it manually.

To use it in your project, see the example below. It’s straightforward and saves you time and energy.

Especially when you integrate Selenium in your CI/CD Pipeline.

By default, webdriver-manager installs the latest version.

But you can also define a specific version of the driver.

!pip install webdriver-manager

# Old way
from selenium import webdriver
driver = webdriver.Chrome('path/to/driver.exe')

# New way with webdriver-manager and Selenium 4
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

# New way with webdriver-manager and Selenium 3
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(ChromeDriverManager().install())

# Use specific version
driver = webdriver.Chrome(executable_path=ChromeDriverManager("<your_version>").install())

11.1.2. Speed up your Scraping with disabling image loading#

Do you want to speed up your web scraper?

Disable image loading!

Disabling image loading while scraping is a great way to speed up your scraper.

You are wasting a lot of connection bandwidth.

To disable image loading in Selenium, you only have to set one option (like below).

This will save you time and money.

from selenium import webdriver

chrome_options = webdriver.ChromeOptions()

chrome_options.add_argument('--blink-settings=imagesEnabled=false')

driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get("https://www.instagram.com/")

11.1.3. AI-Powered Web Scraper with `scrapegraph-ai`#

Do you want to let AI scrape your website?

Use scrapegraph-ai.

This library uses LLM and direct graph logic to scrape websites by only providing the information you need.

See below where we give it a prompt and an URL.

It also supports multi-page scraper that extracts information from the top n search results of a search engine.

!pip install scrapegraphai

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "api_key": OPENAI_API_KEY,
        "model": "gpt-3.5-turbo",
        "temperature":0,
    },
    "verbose":True,
}

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the projects with their descriptions.",
    source="https://perinim.github.io/projects/",
    config=graph_config
)

'''
Output
{
  "projects": [
    {
      "title": "Rotary Pendulum RL",
      "description": "Open Source project aimed at controlling ..."
    },
    {
      "title": "DQN Implementation from scratch",
      "description": "Developed a Deep Q-Network algorithm to train a ..."
    },
    {
      "title": "Multi Agents HAED",
      "description": "University project which focuses ...."
    },
    {
      "title": "Wireless ESC for Modular Drones",
      "description": "Modular drone architecture ..."
    }
  ]
}
'''

Scraping Tips and Tricks

Contents

11.1. Scraping Tips and Tricks#

11.1.1. Manage your Webdrivers with webdriver-manager#

11.1.2. Speed up your Scraping with disabling image loading#

11.1.3. AI-Powered Web Scraper with scrapegraph-ai#

11.1.1. Manage your Webdrivers with `webdriver-manager`#

11.1.3. AI-Powered Web Scraper with `scrapegraph-ai`#