5.4. Get Data#

5.4.1. Scrape Data from Twitter, Youtube and more with snscrape#

Do you want to scrape Twitter data? Without restrictions?

Use snscrape!

snscrape is a Python library for social networking services to scrape information like users, hashtags, threads, likes, etc. easily.

It can also be used for other social network platforms like Instagram or Facebook.

snscrape comes with a CLI functionality and with a Python wrapper.

In the example below, to get 500 tweets from Elon Musk between January 1, 2022 and December 11, 2022, we can simply use the CLI command below. We want to return the data in a JSON format and save it.

!pip install snscrape
!snscrape --jsonl --max-results 500 --since 2022-01-01 twitter-search "from:elonmusk until:2022-12-11" > elon-tweets.json

5.4.2. Scrape Google Play Reviews#

Do you want to scrape Google Play Reviews?

Without effort?

Try google-play-scraper.

google-play-scraper provides APIs to easily retrieve reviews for apps from the Google Play Store for Python.

Below you can see how easy it is to get reviews for the LinkedIn App by providing its ID (you can get the ID from the URL of the corresponding Playstore page)

  • You can sort the reviews by their date or relevance

  • You can filter by rating, country and language

!pip install google-play-scraper
from google_play_scraper import Sort, reviews

result, _ = reviews(
    'com.linkedin.android',
    lang='en',
    country='us',
    sort=Sort.NEWEST,
    count=3,
    filter_score_with=5 
)

print(result)

5.4.3. Scrape Reviews from App Store#

Do you want to scrape App Store Reviews?

Without effort?

Try app_store_scraper!

app_store_scraper provides APIs to easily retrieve reviews for apps and podcasts from the Apple App Store for Python.

Below you can see how easy it is to get reviews for the Instagram App by providing its ID and app name (you can get the ID and app name from the URL of the corresponding App Store page)

A nice library for your next side project!

!pip install app_store_scraper
from app_store_scraper import AppStore
# app_name and app_id is derived from url
# https://apps.apple.com/de/app/instagram/id389801252
insta = AppStore(country='us', app_name='instagram', app_id = '389801252')

insta.review(how_many=2)

print(insta.reviews)

5.4.4. Read CSV files without a problem with clevercsv#

Do you want to read CSV files without problems?

Try clevercsv.

clevercsv handles messy CSV files for you.

The problem with CSV files is that CSV isn’t a standard file format.

Thus, every CSV you face could be different.

Pandas and the standard CSV module of Python throw errors if the CSV is too messy.

clevercsv detects the “real” dialect of the CSV and knows what to do.

!pip install clevercsv
import clevercsv

df = clevercsv.read_dataframe('imdb.csv')

5.4.5. Powerful Web Text Gathering with trafilatura#

Do you need a powerful text extractor on the web?

Try trafilatura!

It’s a Python package for Web Crawling, Downloads and Scraping of text, metadata and comments from websites.

Trafilatura supports different output formats like JSON, XML and CSV.

!pip install trafilatura
from trafilatura import fetch_url, extract

downloaded = fetch_url('https://github.com/adbar/trafilatura')

result = extract(downloaded, output_format="xml")