Build a URL Scraper in Python to Extract URLs from Webpages
URL Scraper:
Build a program that extracts URLs from a given webpage.
Input values:
User provides the URL of a webpage from which URLs need to be extracted.
Output value:
List of URLs extracted from the given webpage.
Example:
Input values: Enter the URL of the webpage: https://www.example.com Output value: URLs extracted from the webpage: URLs extracted from https://www.example.com: 1. https://www.iana.org/domains/example
Solution 1: URL Scraper Using requests and BeautifulSoup
This solution uses the requests library to fetch the webpage content and BeautifulSoup from bs4 to parse the HTML and extract all URLs.
Code:
import requests # Import requests to make HTTP requests
from bs4 import BeautifulSoup # Import BeautifulSoup for HTML parsing
def extract_urls(webpage_url):
"""Extracts all URLs from a given webpage."""
try:
# Send a GET request to the webpage
response = requests.get(webpage_url)
response.raise_for_status() # Raise an error if the request was unsuccessful
# Parse the webpage content using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
# Find all anchor tags with href attribute
anchor_tags = soup.find_all('a', href=True)
# Extract URLs from the anchor tags
urls = [tag['href'] for tag in anchor_tags if tag['href'].startswith('http')]
# Print the extracted URLs
print(f"URLs extracted from {webpage_url}:")
for idx, url in enumerate(urls, 1):
print(f"{idx}. {url}")
except requests.exceptions.RequestException as e:
print(f"Error fetching webpage: {e}")
# Input: Get URL from user
webpage_url = input("Enter the URL of the webpage: ")
extract_urls(webpage_url) # Call the function to extract URLs
Output:
Enter the URL of the webpage: https://www.example.com URLs extracted from https://www.example.com: 1. https://www.iana.org/domains/example
Explanation:
- Imports requests for making HTTP requests to fetch webpage content.
- Imports BeautifulSoup from bs4 for parsing HTML content.
- extract_urls(webpage_url) function:
- Sends a GET request to the provided URL.
- Parses the response using BeautifulSoup to find all anchor tags (<a>).
- Extracts URLs that start with "http" to ensure they are absolute URLs.
- Prints the list of extracted URLs.
- Error Handling:
- Catches and prints any exceptions related to HTTP requests.
- Input from User:
- Takes a URL as input from the user and calls the extract_urls() function.
Solution 2: URL Scraper Using urllib and re (Regular Expressions)
This solution uses the urllib library to fetch webpage content and regular expressions to extract URLs directly from the HTML.
Code:
import urllib.request # Import urllib to handle HTTP requests
import re # Import re for regular expression matching
def extract_urls(webpage_url):
"""Extracts all URLs from a given webpage."""
try:
# Open the URL and read the webpage content
with urllib.request.urlopen(webpage_url) as response:
html_content = response.read().decode('utf-8') # Decode the content to a string format
# Regular expression to find all URLs in the HTML content
urls = re.findall(r'href=["\'](http[s]?://[^\s"\'<>]+)["\']', html_content)
# Print the extracted URLs
print(f"URLs extracted from {webpage_url}:")
for idx, url in enumerate(urls, 1):
print(f"{idx}. {url}")
except urllib.error.URLError as e:
print(f"Error fetching webpage: {e}")
# Input: Get URL from user
webpage_url = input("Enter the URL of the webpage: ")
extract_urls(webpage_url) # Call the function to extract URLs
Output:
Enter the URL of the webpage: https://www.example.com URLs extracted from https://www.example.com: 1. https://www.iana.org/domains/example
Explanation:
- Imports urllib.request to handle HTTP requests and fetch webpage content.
- Imports re for using regular expressions to match patterns in the HTML content.
- extract_urls(webpage_url) function:
- Opens the URL using urllib.request.urlopen() and reads the webpage content.
- Uses re.findall() with a regular expression to find all URLs in the HTML content.
- Prints the list of extracted URLs.
- Error Handling:
- Catches and prints any exceptions related to URL errors.
- Input from User:
- Takes a URL as input from the user and calls the extract_urls() function.
Summary:
Solution 1 (Requests and BeautifulSoup): Uses requests and BeautifulSoup for a more Pythonic approach to parse and extract URLs from HTML. This method is easier to read and maintain and is suitable for more complex HTML structures.
Solution 2 (urllib and Regular Expressions): Uses urllib and regular expressions, which is a lightweight approach that works well for simple URL extraction. However, it may not handle complex HTML structures as robustly as BeautifulSoup.
- Weekly Trends and Language Statistics
- Weekly Trends and Language Statistics