NLTK Corpus Exercises with Solution
Python NLTK Corpus [13 exercises with solution]
[An editor is available at the bottom of the page to write and execute the scripts.]
In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts. In corpus linguistics, they are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.
Each corpus reader class is specialized to handle a specific corpus format. In addition, the nltk.corpus package automatically creates a set of corpus reader instances that can be used to access the corpora in the NLTK data package.
1. Write a Python NLTK program to list down all the corpus names.
Click me to see the sample solution
2. Write a Python NLTK program to get a list of common stop words in various languages in Python.
Click me to see the sample solution
3. Write a Python NLTK program to check the list of stopwords in various languages.
From Wikipedia:
In computing, stop words are words which are filtered out before or after processing of natural language data (text). Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search.
Any group of words can be chosen as the stop words for a given purpose. For some search engines, these are some of the most common, short function words, such as the, is, at, which, and on. In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as "The Who", "The The", or "Take That". Other search engines remove some of the most common words-including lexical words, such as "want"-from a query in order to improve performance.
Click me to see the sample solution
4. Write a Python NLTK program to remove stop words from a given text.
Click me to see the sample solution
5. Write a Python NLTK program to omit some given stop words from the stopwords list.
Click me to see the sample solution
6. Write a Python NLTK program to find the definition and examples of a given word using WordNet.
From Wikipedia,
WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short definitions and usage examples, and records a number of relations among these synonym sets or their members. WordNet can thus be seen as a combination of dictionary and thesaurus. While it is accessible to human users via a web browser, its primary use is in automatic text analysis and artificial intelligence applications. The database and software tools have been released under a BSD style license and are freely available for download from the WordNet website. Both the lexicographic data (lexicographer files) and the compiler (called grind) for producing the distributed database are available.
Click me to see the sample solution
7. Write a Python NLTK program to find the sets of synonyms and antonyms of a given word.
From Wikipedia,
WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short definitions and usage examples, and records a number of relations among these synonym sets or their members.
Click me to see the sample solution
8. Write a Python NLTK program to get the overview of the tagset, details of a specific tag in the tagset and details on several related tagsets, using regular expression.
Click me to see the sample solution
9. Write a Python NLTK program to compare the similarity of two given nouns.
Click me to see the sample solution
10. Write a Python NLTK program to compare the similarity of two given verbs.
Click me to see the sample solution
11. Write a Python NLTK program to find the number of male and female names in the names corpus. Print the first 10 male and female names.
Note: The names corpus contains a total of around 2943 male (male.txt) and 5001 female (female.txt) names. It's compiled by Kantrowitz, Ross.
Click me to see the sample solution
12. Write a Python NLTK program to print the first 15 random combine labeled male and labeled female names from names corpus.
Click me to see the sample solution
13. Write a Python NLTK program to extract the last letter of all the labeled names and create a new array with the last letter of each name and the associated label.
Click me to see the sample solution.
More to Come !
Do not submit any solution of the above exercises at here, if you want to contribute go to the appropriate exercise page.
[ Want to contribute to Python exercises? Send your code (attached with a .zip file) to us at w3resource[at]yahoo[dot]com. Please avoid copyrighted materials.]
- Weekly Trends and Language Statistics
- Weekly Trends and Language Statistics