Python Project: Extract Information from URLs

Last update on October 15 2024 12:11:58 (UTC/GMT +8 hours)

URL Analyzer: Build a program that analyzes and extracts information from a given URL.

Input values:

User provides a URL to be analyzed.

Output value:

Extract information and analysis results from the given URL.

Example:

Input values:
URL to analyze: https://www.example.com/about-us
Output value:
Analysis results:
- Domain: example.com
- Protocol: HTTPS
- Path: /about-us
- Query parameters: None
- HTTP status: 200 OK
- Page title: About Us - Example
- Meta description: Learn more about our company and our mission.
Input values:
URL to analyze: https://www.example.com/products?category=electronics
Output value:
Analysis results:
- Domain: example.com
- Protocol: HTTPS
- Path: /products
- Query parameters: category=electronics
- HTTP status: 200 OK
- Page title: Products - Example
- Meta description: Browse our wide selection of electronics products.
Input values:
URL to analyze: https://www.example.com/non-existent-page
Output value:
Analysis results:
- Domain: example.com
- Protocol: HTTPS
- Path: /non-existent-page
- Query parameters: None
- HTTP status: 404 Not Found
- Error message: The requested page does not exist.

Solution: Using requests and urllib Modules

Code:

# Import required modules
import requests  # For HTTP requests
from urllib.parse import urlparse, parse_qs  # For URL parsing

# Function to analyze a given URL
def analyze_url(url):
    # Parse URL to extract components
    parsed_url = urlparse(url)
    protocol = parsed_url.scheme.upper()  # Extract and convert protocol to uppercase
    domain = parsed_url.netloc  # Extract domain
    path = parsed_url.path  # Extract path
    query = parse_qs(parsed_url.query)  # Parse query parameters into a dictionary

    # Make a request to the URL to get status and HTML content
    try:
        response = requests.get(url)
        status_code = response.status_code  # Extract HTTP status code
        html_content = response.text  # Get HTML content of the page

        # Extract page title and meta description if available
        page_title = extract_meta(html_content, "", "")
        meta_description = extract_meta(html_content, 'name="description" content="', '"')
        
        # Display the results
        print("Analysis results:")
        print(f"- Domain: {domain}")
        print(f"- Protocol: {protocol}")
        print(f"- Path: {path}")
        print(f"- Query parameters: {query if query else 'None'}")
        print(f"- HTTP status: {status_code}")
        print(f"- Page title: {page_title}")
        print(f"- Meta description: {meta_description if meta_description else 'None'}")

    except requests.RequestException as e:
        print(f"Error analyzing URL: {e}")

# Helper function to extract metadata from HTML content
def extract_meta(html, start_tag, end_tag):
    start_index = html.find(start_tag)
    if start_index == -1:
        return None
    start_index += len(start_tag)
    end_index = html.find(end_tag, start_index)
    return html[start_index:end_index].strip()

# Example usage
analyze_url("https://www.w3resource.com/")
#analyze_url("https://www.w3resource.com/privacy/")

Output:

Analysis results:
- Domain: www.w3resource.com
- Protocol: HTTPS
- Path: 
- Query parameters: None
- HTTP status: 200
- Page title: Web development tutorials | w3resource
- Meta description: Web development tutorials on HTML, CSS, JS, PHP, SQL, MySQL, PostgreSQL, MongoDB, JSON and more.

Analysis results:
- Domain: www.w3resource.com
- Protocol: HTTPS
- Path: /privacy/
- Query parameters: None
- HTTP status: 404
- Page title: 404 Not Found
- Meta description: None

Explanation:

URL Parsing: Extracts protocol, domain, path, and query parameters from the URL.
HTTP Request: Sends a GET request and retrieves HTTP status and HTML content.
Metadata Extraction: Extracts page title and meta description from the HTML.
Error Handling: Handles any request errors gracefully.