w3resource

NLTK Tokenize: Create a list of words from a given string


Write a Python NLTK program to create a list of words from a given string.

Sample Solution:

Python Code-1:

from nltk.tokenize import word_tokenize
text = "Joe waited for the train. The train was late. Mary and Samantha took the bus. I looked for Mary and Samantha at the bus station."
print("\nOriginal string:")
print(text)
print("\nList of words:")
print(word_tokenize(text))

Sample Output:

Original string:
Joe waited for the train. The train was late. Mary and Samantha took the bus. I looked for Mary and Samantha at the bus station.

List of words:
['Joe', 'waited', 'for', 'the', 'train', '.', 'The', 'train', 'was', 'late', '.', 'Mary', 'and', 'Samantha', 'took', 'the', 'bus', '.', 'I', 'looked', 'for', 'Mary', 'and', 'Samantha', 'at', 'the', 'bus', 'station', '.']

It's equivalent to the following code:

Python Code-2:

from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
text = "Joe waited for the train. The train was late. Mary and Samantha took the bus. I looked for Mary and Samantha at the bus station."
print("\nOriginal string:")
print(text)
print("\nList of words:")
print(tokenizer.tokenize(text))

Output:

Original string:
Joe waited for the train. The train was late. Mary and Samantha took the bus. I looked for Mary and Samantha at the bus station.

List of words:
['Joe', 'waited', 'for', 'the', 'train.', 'The', 'train', 'was', 'late.', 'Mary', 'and', 'Samantha', 'took', 'the', 'bus.', 'I', 'looked', 'for', 'Mary', 'and', 'Samantha', 'at', 'the', 'bus', 'station', '.']

Have another way to solve this solution? Contribute your code (and comments) through Disqus.

Previous: Write a Python NLTK program to tokenize sentences in languages other than English.
Next: Write a Python NLTK program to split all punctuation into separate tokens.

What is the difficulty level of this exercise?

Test your Programming skills with w3resource's quiz.



Follow us on Facebook and Twitter for latest update.