Python Project: Extract Information from URLs
URL Analyzer: Build a program that analyzes and extracts information from a given URL.
Input values:
User provides a URL to be analyzed.
Output value:
Extract information and analysis results from the given URL.
Example:
Input values: URL to analyze: https://www.example.com/about-us Output value: Analysis results: - Domain: example.com - Protocol: HTTPS - Path: /about-us - Query parameters: None - HTTP status: 200 OK - Page title: About Us - Example - Meta description: Learn more about our company and our mission. Input values: URL to analyze: https://www.example.com/products?category=electronics Output value: Analysis results: - Domain: example.com - Protocol: HTTPS - Path: /products - Query parameters: category=electronics - HTTP status: 200 OK - Page title: Products - Example - Meta description: Browse our wide selection of electronics products. Input values: URL to analyze: https://www.example.com/non-existent-page Output value: Analysis results: - Domain: example.com - Protocol: HTTPS - Path: /non-existent-page - Query parameters: None - HTTP status: 404 Not Found - Error message: The requested page does not exist.
Solution: Using requests and urllib Modules
Code:
# Import required modules
import requests # For HTTP requests
from urllib.parse import urlparse, parse_qs # For URL parsing
# Function to analyze a given URL
def analyze_url(url):
# Parse URL to extract components
parsed_url = urlparse(url)
protocol = parsed_url.scheme.upper() # Extract and convert protocol to uppercase
domain = parsed_url.netloc # Extract domain
path = parsed_url.path # Extract path
query = parse_qs(parsed_url.query) # Parse query parameters into a dictionary
# Make a request to the URL to get status and HTML content
try:
response = requests.get(url)
status_code = response.status_code # Extract HTTP status code
html_content = response.text # Get HTML content of the page
# Extract page title and meta description if available
page_title = extract_meta(html_content, "", " ")
meta_description = extract_meta(html_content, 'name="description" content="', '"')
# Display the results
print("Analysis results:")
print(f"- Domain: {domain}")
print(f"- Protocol: {protocol}")
print(f"- Path: {path}")
print(f"- Query parameters: {query if query else 'None'}")
print(f"- HTTP status: {status_code}")
print(f"- Page title: {page_title}")
print(f"- Meta description: {meta_description if meta_description else 'None'}")
except requests.RequestException as e:
print(f"Error analyzing URL: {e}")
# Helper function to extract metadata from HTML content
def extract_meta(html, start_tag, end_tag):
start_index = html.find(start_tag)
if start_index == -1:
return None
start_index += len(start_tag)
end_index = html.find(end_tag, start_index)
return html[start_index:end_index].strip()
# Example usage
analyze_url("https://www.w3resource.com/")
#analyze_url("https://www.w3resource.com/privacy/")
Output:
Analysis results: - Domain: www.w3resource.com - Protocol: HTTPS - Path: - Query parameters: None - HTTP status: 200 - Page title: Web development tutorials | w3resource - Meta description: Web development tutorials on HTML, CSS, JS, PHP, SQL, MySQL, PostgreSQL, MongoDB, JSON and more.
Analysis results: - Domain: www.w3resource.com - Protocol: HTTPS - Path: /privacy/ - Query parameters: None - HTTP status: 404 - Page title: 404 Not Found - Meta description: None
Explanation:
- URL Parsing: Extracts protocol, domain, path, and query parameters from the URL.
- HTTP Request: Sends a GET request and retrieves HTTP status and HTML content.
- Metadata Extraction: Extracts page title and meta description from the HTML.
- Error Handling: Handles any request errors gracefully.
- Weekly Trends and Language Statistics
- Weekly Trends and Language Statistics