Python Project - Basic Web Scraper Solutions and Explanations

Last update on October 19 2024 13:04:50 (UTC/GMT +8 hours)

Basic Web Scraper:

Learn web scraping by extracting data from a website using libraries like BeautifulSoup.

Input values:
None (Automated process to extract data from a specified website).

Output value:
Data extracted from the website using libraries like BeautifulSoup

Example:

Input values:
None
Output value:
List all the h1 tags from https://en.wikipedia.org/wiki/Main_Page: <h1 class="firstHeading mw-first-heading" id="firstHeading" style="display: none"><span class="mw-page-title-main">Main Page</span></h1> <h1><span class="mw-headline" id="Welcome_to_Wikipedia">Welcome to <a href="/wiki/Wikipedia" title="Wikipedia">Wikipedia</a></span></h1>

Here are two different solutions for a basic web scraper using Python. The goal of the scraper is to extract data (like all h1 tags) from a website using libraries such as 'BeautifulSoup' and requests.

Prerequisites:

To run these scripts, you'll need to have the following libraries installed:

requests: To send HTTP requests to the target website.
BeautifulSoup from bs4: To parse the HTML and extract data.

You can install these libraries using:

pip install requests beautifulsoup4

Solution 1: Basic Web Scraper using 'requests' and 'BeautifulSoup'

Code:

# Solution 1: Basic Web Scraper Using `requests` and `BeautifulSoup`
# Import necessary libraries
import requests  # Used to send HTTP requests
from bs4 import BeautifulSoup  # Used for parsing HTML content

# Function to extract data from the specified website
def scrape_h1_tags(url):
    # Send an HTTP GET request to the specified URL
    response = requests.get(url)

    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Parse the HTML content of the page using BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')

        # Find all the h1 tags on the page
        h1_tags = soup.find_all('h1')

        # Print the extracted h1 tags
        print(f"List all the h1 tags from {url}:")
        for tag in h1_tags:
            print(tag)
    else:
        # Print an error message if the request was not successful
        print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

# Specify the URL to scrape
url = "https://en.wikipedia.org/wiki/Main_Page"

# Call the function to scrape h1 tags
scrape_h1_tags(url)

Output:

List all the h1 tags from https://en.wikipedia.org/wiki/Main_Page:
<h1 class="firstHeading mw-first-heading" id="firstHeading" style="display: none"><span class="mw-page-title-main">Main Page</span></h1>
<h1 id="Welcome_to_Wikipedia">Welcome to <a href="/wiki/Wikipedia" title="Wikipedia">Wikipedia</a></h1>

Explanation:

The script defines a function 'scrape_h1_tags()' that takes a URL as input and performs the following steps:

Sends an HTTP GET request to the URL using 'requests.get()'.
Checks if the request was successful by examining the status code.
Parses the HTML content of the page using 'BeautifulSoup'.
Finds all h1 tags on the page using 'soup.find_all('h1')'.
Prints the extracted 'h1' tags.

This solution is straightforward and works well for simple scraping tasks, but it's less modular and reusable for more complex scenarios.

Solution 2: Using a Class-Based approach for Reusability and Extensibility

Code:

# Solution 2: Using a Class-Based Approach for Reusability and Extensibility

import requests  # Used to send HTTP requests
from bs4 import BeautifulSoup  # Used for parsing HTML content

class WebScraper:
    """Class to handle web scraping operations"""

    def __init__(self, url):
        """Initialize the scraper with a URL"""
        self.url = url
        self.soup = None

    def fetch_content(self):
        """Fetch content from the website and initialize BeautifulSoup"""
        try:
            # Send an HTTP GET request to the specified URL
            response = requests.get(self.url)

            # Check if the request was successful
            if response.status_code == 200:
                # Initialize BeautifulSoup with the content
                self.soup = BeautifulSoup(response.text, 'html.parser')
                print(f"Successfully fetched content from {self.url}")
            else:
                print(f"Failed to retrieve the webpage. Status code: {response.status_code}")
        except requests.RequestException as e:
            # Handle any exceptions that occur during the request
            print(f"An error occurred: {e}")

    def extract_h1_tags(self):
        """Extract and display all h1 tags from the page content"""
        if self.soup:
            # Find all the h1 tags on the page
            h1_tags = self.soup.find_all('h1')

            # Print the extracted h1 tags
            print(f"List all the h1 tags from {self.url}:")
            for tag in h1_tags:
                print(tag)
        else:
            print("No content fetched. Please call fetch_content() first.")

# Specify the URL to scrape
url = "https://en.wikipedia.org/wiki/Main_Page"

# Create an instance of the WebScraper class
scraper = WebScraper(url)

# Fetch content from the website
scraper.fetch_content()

# Extract and display h1 tags
scraper.extract_h1_tags()

Output:

Successfully fetched content from https://en.wikipedia.org/wiki/Main_Page
List all the h1 tags from https://en.wikipedia.org/wiki/Main_Page:
<h1 class="firstHeading mw-first-heading" id="firstHeading" style="display: none"><span class="mw-page-title-main">Main Page</span></h1>
<h1 id="Welcome_to_Wikipedia">Welcome to <a href="/wiki/Wikipedia" title="Wikipedia">Wikipedia</a></h1>

Explanation:

The script defines a 'WebScraper' class that encapsulates all web scraping functionality, making it more organized and easier to extend.
The '__init__' method initializes the class with the URL to be scraped.
The 'fetch_content' method sends an HTTP GET request, checks the response, and initializes ‘BeautifulSoup’ with the page content.
The 'extract_h1_tags' method extracts and prints all 'h1' tags from the page.
This approach allows for better reusability and extensibility, making it easier to add more features (e.g., extracting different tags, handling different URLs) in the future.

Note:
Both solutions effectively scrape 'h1' tags from a specified website using 'requests' and 'BeautifulSoup'. Solution 1 is a functional, straightforward approach, while Solution 2 uses Object-Oriented Programming (OOP) principles for a more modular and maintainable design.