Python Web Scraping: Download and display the content of robot.txt for en.wikipedia.org
Python Web Scraping: Exercise-2 with Solution
Write a Python program to download and display the content of robot.txt for en.wikipedia.org.
Sample Solution:
Python Code:
import requests
response = requests.get("https://en.wikipedia.org/robots.txt")
test = response.text
print("robots.txt for http://www.wikipedia.org/")
print("===================================================")
print(test)
Sample Output:
robots.txt for http://www.wikipedia.org/ =================================================== # robots.txt for http://www.wikipedia.org/ and friends # # Please note: There are a lot of pages on this site, and there are # some misbehaved spiders out there that go _way_ too fast. If you're # irresponsible, your access to the site may be blocked. # # Observed spamming large amounts of https://en.wikipedia.org/?curid=NNNNNN # and ignoring 429 ratelimit responses, claims to respect robots: # http://mj12bot.com/ User-agent: MJ12bot Disallow: / # advertising-related bots: User-agent: Mediapartners-Google* Disallow: / # Wikipedia work bots: User-agent: IsraBot Disallow: User-agent: Orthogaffe Disallow: # Crawlers that are kind enough to obey, but which we'd rather not have # unless they're feeding search engines. User-agent: UbiCrawler Disallow: / User-agent: DOC Disallow: / User-agent: Zao Disallow: / # Some bots are known to be trouble, particularly those designed to copy # entire sites. Please obey robots.txt. User-agent: sitecheck.internetseer.com Disallow: / User-agent: Zealbot Disallow: / ............ # Disallow: /wiki/Wikipedia:Article_Incubator Disallow: /wiki/Wikipedia%3AArticle_Incubator Disallow: /wiki/Wikipedia_talk:Article_Incubator Disallow: /wiki/Wikipedia_talk%3AArticle_Incubator # Disallow: /wiki/Category:Noindexed_pages Disallow: /wiki/Category%3ANoindexed_pages # #
Flowchart:
Python Code Editor:
Have another way to solve this solution? Contribute your code (and comments) through Disqus.
Previous: Write a Python program to test if a given page is found or not on the server.
Next: Write a Python program to get the number of datasets currently listed on data.gov.
What is the difficulty level of this exercise?
It will be nice if you may share this link in any developer community or anywhere else, from where other developers may find this content. Thanks.
https://198.211.115.131/python-exercises/web-scraping/web-scraping-exercise-2.php
- Weekly Trends and Language Statistics
- Weekly Trends and Language Statistics