Python Project - Basic URL Crawler for extract URLs

Last update on October 18 2024 13:21:48 (UTC/GMT +8 hours)

Basic URL Crawler:

Develop a program that crawls a website and extracts URLs.

Input values:

Starting URL: The URL from which the crawler will start.
Depth (optional): The number of levels the crawler will follow links from the starting URL.
Optional Parameters:

Domain restriction: Whether to restrict crawling to the same domain as the starting URL.
File types to include or exclude (e.g., only HTML pages).

Output value:

Extracted URLs: A list of URLs found during the crawling process.
Status Messages:

Progress updates.
Error messages if the crawling fails (e.g., invalid URL, network issues).

Example:

Example 1: Basic Crawling from a Starting URL

Input:
•	Starting URL: http://example.com
Output:
•	List of extracted URLs:
 
http://example.com/page1
http://example.com/page2
http://example.com/about
http://example.com/contact
Example Console Output:
 
Starting URL: http://example.com
Crawling depth: 1
Crawling http://example.com...
Found URL: http://example.com/page1
Found URL: http://example.com/page2
Found URL: http://example.com/about
Found URL: http://example.com/contact
Crawling completed.

Example 2: Crawling with Depth Restriction
Input:
•	Starting URL: http://example.com
•	Depth: 2
Output:
•	List of extracted URLs:
 
http://example.com/page1
http://example.com/page2
http://example.com/about
http://example.com/contact
http://example.com/page1/subpage1
http://example.com/page2/subpage2
Example Console Output:
 
Starting URL: http://example.com
Crawling depth: 2
Crawling http://example.com...
Found URL: http://example.com/page1
Found URL: http://example.com/page2
Found URL: http://example.com/about
Found URL: http://example.com/contact
Crawling http://example.com/page1...
Found URL: http://example.com/page1/subpage1
Crawling http://example.com/page2...
Found URL: http://example.com/page2/subpage2
Crawling completed.

Example 3: Domain Restriction
Input:
•	Starting URL: http://example.com
•	Domain restriction: Yes
Output:
•	List of extracted URLs (only from the same domain):
http://example.com/page1
http://example.com/page2
http://example.com/about
http://example.com/contact
Example Console Output:
Starting URL: http://example.com
Crawling depth: 1
Domain restriction: Yes
Crawling http://example.com...
Found URL: http://example.com/page1
Found URL: http://example.com/page2
Found URL: http://example.com/about
Found URL: http://example.com/contact
Crawling completed.

Here are two different solutions for building a basic URL crawler that crawls a website and extracts URLs. The first solution uses the requests and BeautifulSoup libraries to perform a simple crawl, while the second solution uses the Scrapy framework for more advanced crawling capabilities.

Prerequisites for Both Solutions:

Install Required Python Libraries:

pip install requests beautifulsoup4 scrapy

Solution: Basic URL Crawler Using requests and BeautifulSoup

This solution uses the requests library to fetch web pages and BeautifulSoup to parse the HTML content and extract URLs.

Code:

# Solution 1: Basic URL Crawler Using 'requests' and 'BeautifulSoup'

import requests  # Library for making HTTP requests
from bs4 import BeautifulSoup  # Library for parsing HTML content
from urllib.parse import urljoin, urlparse  # Functions to handle URL joining and parsing

def crawl_website(starting_url, depth=1, domain_restriction=True):
    """Crawl a website starting from a given URL to extract URLs."""
    # Set to store all discovered URLs
    visited_urls = set()

    # Helper function to recursively crawl the website
    def crawl(url, current_depth):
        """Recursively crawl the website up to the specified depth."""
        if current_depth > depth:  # Stop crawling if the maximum depth is reached
            return
        print(f"Crawling {url}...")

        try:
            # Send a GET request to the URL
            response = requests.get(url)
            # Parse the content using BeautifulSoup
            soup = BeautifulSoup(response.text, 'html.parser')

            # Iterate over all <a> tags to find URLs
            for link in soup.find_all('a', href=True):
                # Resolve the full URL
                full_url = urljoin(url, link['href'])
                # Check domain restriction
                if domain_restriction and urlparse(full_url).netloc != urlparse(starting_url).netloc:
                    continue

                # Add the discovered URL to the set
                if full_url not in visited_urls:
                    print(f"Found URL: {full_url}")
                    visited_urls.add(full_url)
                    # Recursively crawl the discovered URL
                    crawl(full_url, current_depth + 1)

        except requests.RequestException as e:
            print(f"Error crawling {url}: {e}")

    # Start crawling from the starting URL
    crawl(starting_url, 1)
    print("Crawling completed.")
    return visited_urls

# Example usage
starting_url = "https://www.python.org"
crawled_urls = crawl_website(starting_url, depth=2, domain_restriction=True)
print("Extracted URLs:", crawled_urls)

Output:

Crawling https://www.python.org...
Found URL: https://www.python.org#content
Crawling https://www.python.org#content...
Found URL: https://www.python.org#python-network
Found URL: https://www.python.org/
Found URL: https://www.python.org/psf/
Found URL: https://www.python.org/jobs/
Found URL: https://www.python.org/community-landing/
Found URL: https://www.python.org#top
Found URL: https://www.python.org#site-map
Found URL: https://www.python.org
Found URL: https://www.python.org/community/irc/
Found URL: https://www.python.org/about/
Found URL: https://www.python.org/about/apps/
Found URL: https://www.python.org/about/quotes/
Found URL: https://www.python.org/about/gettingstarted/
Found URL: https://www.python.org/about/help/
Found URL: https://www.python.org/downloads/
Found URL: https://www.python.org/downloads/source/
Found URL: https://www.python.org/downloads/windows/
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Explanation:

Function crawl_website:

Takes a starting URL, depth, and domain restriction as inputs to control the crawling process.
Uses a recursive helper function crawl to navigate through the website up to the specified depth.

Recursive Crawling:

For each URL, it sends a GET request to fetch the HTML content, uses BeautifulSoup to parse the HTML, and iterates over all <a> tags to find links.
Resolves relative URLs to absolute URLs using urljoin.
Adds each new URL to a set to avoid duplicates and recursively crawls it if within the domain and depth limits.

Domain Restriction and Error Handling:

Checks if the domain restriction is enabled to restrict crawling to the same domain.
Handles errors using requests.RequestException to manage network issues.