Python Project: Extract Information from URLs
URL Analyzer: Build a program that analyzes and extracts information from a given URL.
Input values:
User provides a URL to be analyzed.
Output value:
Extract information and analysis results from the given URL.
Example:
Input values: URL to analyze: https://www.example.com/about-us Output value: Analysis results: - Domain: example.com - Protocol: HTTPS - Path: /about-us - Query parameters: None - HTTP status: 200 OK - Page title: About Us - Example - Meta description: Learn more about our company and our mission. Input values: URL to analyze: https://www.example.com/products?category=electronics Output value: Analysis results: - Domain: example.com - Protocol: HTTPS - Path: /products - Query parameters: category=electronics - HTTP status: 200 OK - Page title: Products - Example - Meta description: Browse our wide selection of electronics products. Input values: URL to analyze: https://www.example.com/non-existent-page Output value: Analysis results: - Domain: example.com - Protocol: HTTPS - Path: /non-existent-page - Query parameters: None - HTTP status: 404 Not Found - Error message: The requested page does not exist.
Solution: Using requests and urllib Modules
Code:
# Import required modules
import requests # For HTTP requests
from urllib.parse import urlparse, parse_qs # For URL parsing
# Function to analyze a given URL
def analyze_url(url):
# Parse URL to extract components
parsed_url = urlparse(url)
protocol = parsed_url.scheme.upper() # Extract and convert protocol to uppercase
domain = parsed_url.netloc # Extract domain
path = parsed_url.path # Extract path
query = parse_qs(parsed_url.query) # Parse query parameters into a dictionary
# Make a request to the URL to get status and HTML content
try:
response = requests.get(url)
status_code = response.status_code # Extract HTTP status code
html_content = response.text # Get HTML content of the page
# Extract page title and meta description if available
page_title = extract_meta(html_content, "", " ")
meta_description = extract_meta(html_content, 'name="description" content="', '"')
# Display the results
print("Analysis results:")
print(f"- Domain: {domain}")
print(f"- Protocol: {protocol}")
print(f"- Path: {path}")
print(f"- Query parameters: {query if query else 'None'}")
print(f"- HTTP status: {status_code}")
print(f"- Page title: {page_title}")
print(f"- Meta description: {meta_description if meta_description else 'None'}")
except requests.RequestException as e:
print(f"Error analyzing URL: {e}")
# Helper function to extract metadata from HTML content
def extract_meta(html, start_tag, end_tag):
start_index = html.find(start_tag)
if start_index == -1:
return None
start_index += len(start_tag)
end_index = html.find(end_tag, start_index)
return html[start_index:end_index].strip()
# Example usage
analyze_url("https://www.w3resource.com/")
#analyze_url("https://www.w3resource.com/privacy/")
Output:
Analysis results: - Domain: www.w3resource.com - Protocol: HTTPS - Path: - Query parameters: None - HTTP status: 200 - Page title: Web development tutorials | w3resource - Meta description: Web development tutorials on HTML, CSS, JS, PHP, SQL, MySQL, PostgreSQL, MongoDB, JSON and more.
Analysis results: - Domain: www.w3resource.com - Protocol: HTTPS - Path: /privacy/ - Query parameters: None - HTTP status: 404 - Page title: 404 Not Found - Meta description: None
Explanation:
- URL Parsing: Extracts protocol, domain, path, and query parameters from the URL.
- HTTP Request: Sends a GET request and retrieves HTTP status and HTML content.
- Metadata Extraction: Extracts page title and meta description from the HTML.
- Error Handling: Handles any request errors gracefully.
It will be nice if you may share this link in any developer community or anywhere else, from where other developers may find this content. Thanks.
https://198.211.115.131/projects/python/python-project-url-analyzer.php
- Weekly Trends and Language Statistics
- Weekly Trends and Language Statistics