w3resource

Python Web Scraping: Download and display the content of robot.txt for en.wikipedia.org

Python Web Scraping: Exercise-2 with Solution

Write a Python program to download and display the content of robot.txt for en.wikipedia.org.

Sample Solution:

Python Code:

import requests
response = requests.get("https://en.wikipedia.org/robots.txt")
test = response.text
print("robots.txt for http://www.wikipedia.org/")
print("===================================================")
print(test)

Sample Output:

robots.txt for http://www.wikipedia.org/
===================================================
# robots.txt for http://www.wikipedia.org/ and friends
#
# Please note: There are a lot of pages on this site, and there are
# some misbehaved spiders out there that go _way_ too fast. If you're
# irresponsible, your access to the site may be blocked.
#

# Observed spamming large amounts of https://en.wikipedia.org/?curid=NNNNNN
# and ignoring 429 ratelimit responses, claims to respect robots:
# http://mj12bot.com/
User-agent: MJ12bot
Disallow: /

# advertising-related bots:
User-agent: Mediapartners-Google*
Disallow: /

# Wikipedia work bots:
User-agent: IsraBot
Disallow:

User-agent: Orthogaffe
Disallow:

# Crawlers that are kind enough to obey, but which we'd rather not have
# unless they're feeding search engines.
User-agent: UbiCrawler
Disallow: /

User-agent: DOC
Disallow: /

User-agent: Zao
Disallow: /

# Some bots are known to be trouble, particularly those designed to copy
# entire sites. Please obey robots.txt.
User-agent: sitecheck.internetseer.com
Disallow: /

User-agent: Zealbot
Disallow: /
............
#
Disallow: /wiki/Wikipedia:Article_Incubator
Disallow: /wiki/Wikipedia%3AArticle_Incubator
Disallow: /wiki/Wikipedia_talk:Article_Incubator
Disallow: /wiki/Wikipedia_talk%3AArticle_Incubator
#
Disallow: /wiki/Category:Noindexed_pages
Disallow: /wiki/Category%3ANoindexed_pages
#
#
 

Flowchart:

Python Web Scraping Flowchart: Test if a given page is found or not on the server

Python Code Editor:

Have another way to solve this solution? Contribute your code (and comments) through Disqus.

Previous: Write a Python program to test if a given page is found or not on the server.
Next: Write a Python program to get the number of datasets currently listed on data.gov.

What is the difficulty level of this exercise?



Become a Patron!

Follow us on Facebook and Twitter for latest update.

It will be nice if you may share this link in any developer community or anywhere else, from where other developers may find this content. Thanks.

https://198.211.115.131/python-exercises/web-scraping/web-scraping-exercise-2.php