Python Project: Extract Information from URLs
URL Analyzer: Build a program that analyzes and extracts information from a given URL.
Input values:
User provides a URL to be analyzed.
Output value:
Extract information and analysis results from the given URL.
Example:
Input values: URL to analyze: https://www.example.com/about-us Output value: Analysis results: - Domain: example.com - Protocol: HTTPS - Path: /about-us - Query parameters: None - HTTP status: 200 OK - Page title: About Us - Example - Meta description: Learn more about our company and our mission. Input values: URL to analyze: https://www.example.com/products?category=electronics Output value: Analysis results: - Domain: example.com - Protocol: HTTPS - Path: /products - Query parameters: category=electronics - HTTP status: 200 OK - Page title: Products - Example - Meta description: Browse our wide selection of electronics products. Input values: URL to analyze: https://www.example.com/non-existent-page Output value: Analysis results: - Domain: example.com - Protocol: HTTPS - Path: /non-existent-page - Query parameters: None - HTTP status: 404 Not Found - Error message: The requested page does not exist.
Solution: Using requests and urllib Modules
Code:
# Import required modules
import requests  # For HTTP requests
from urllib.parse import urlparse, parse_qs  # For URL parsing
# Function to analyze a given URL
def analyze_url(url):
    # Parse URL to extract components
    parsed_url = urlparse(url)
    protocol = parsed_url.scheme.upper()  # Extract and convert protocol to uppercase
    domain = parsed_url.netloc  # Extract domain
    path = parsed_url.path  # Extract path
    query = parse_qs(parsed_url.query)  # Parse query parameters into a dictionary
    # Make a request to the URL to get status and HTML content
    try:
        response = requests.get(url)
        status_code = response.status_code  # Extract HTTP status code
        html_content = response.text  # Get HTML content of the page
        # Extract page title and meta description if available
        page_title = extract_meta(html_content, "", " ")
        meta_description = extract_meta(html_content, 'name="description" content="', '"')
        
        # Display the results
        print("Analysis results:")
        print(f"- Domain: {domain}")
        print(f"- Protocol: {protocol}")
        print(f"- Path: {path}")
        print(f"- Query parameters: {query if query else 'None'}")
        print(f"- HTTP status: {status_code}")
        print(f"- Page title: {page_title}")
        print(f"- Meta description: {meta_description if meta_description else 'None'}")
    except requests.RequestException as e:
        print(f"Error analyzing URL: {e}")
# Helper function to extract metadata from HTML content
def extract_meta(html, start_tag, end_tag):
    start_index = html.find(start_tag)
    if start_index == -1:
        return None
    start_index += len(start_tag)
    end_index = html.find(end_tag, start_index)
    return html[start_index:end_index].strip()
# Example usage
analyze_url("https://www.w3resource.com/")
#analyze_url("https://www.w3resource.com/privacy/")
Output:
Analysis results: - Domain: www.w3resource.com - Protocol: HTTPS - Path: - Query parameters: None - HTTP status: 200 - Page title: Web development tutorials | w3resource - Meta description: Web development tutorials on HTML, CSS, JS, PHP, SQL, MySQL, PostgreSQL, MongoDB, JSON and more.
Analysis results: - Domain: www.w3resource.com - Protocol: HTTPS - Path: /privacy/ - Query parameters: None - HTTP status: 404 - Page title: 404 Not Found - Meta description: None
Explanation:
- URL Parsing: Extracts protocol, domain, path, and query parameters from the URL.
- HTTP Request: Sends a GET request and retrieves HTTP status and HTML content.
- Metadata Extraction: Extracts page title and meta description from the HTML.
- Error Handling: Handles any request errors gracefully.
Go to:
