How to Tokenize a String into Words or Sentences in Python using the NLTK Module

In this article, we show how to tokenize a string into words or sentences in Python using the NLTK module.

The NLTK module is the natural language toolkit module.

Tokenizing words means extracting words from a string and having each word stand alone. Python then puts these tokenized words into a list, with each item in the list being one of the words in the string. For example, if we tokenized the string, "The grass is green", the resultant output would be, ['The', 'grass', 'is', 'green']

Tokenizing sentences means extracting sentences from a string and having each sentence stand alone. Pythong then puts these tokenized sentences into a list, with each item in the list being one of the sentences in the string. For example, if we tokenized the string, "The sky is blue. The sun is yellow. The clouds are white. The hills are green", the resultant output would be, ['The sky is blue', 'The sun is yellow, 'The clouds are white', 'The hills are green'].

A lot of natural language processing deals with the extraction of words and sentences from language.

Being able to extract words from sentences and extract sentences from paragraphs is vital for natural language processing.

How to Tokenize Words in a String

So the first thing we will show is how to tokenize words in a string.

So below, we tokenize the string, 'Python is a great language to use for programming'

>>> import nltk >>> string= 'Python is a great programming language' >>> words= nltk.word_tokenize(string) >>> words ['Python', 'is', 'a', 'great', 'programming', 'language']

So the code above is very simple.

We first have to import the nltk module.

We then have a string stored in the string variable, 'Python is a great programming language'

We then create another variable, words, which uses the nltk.word_tokenize() function to tokenize the words in the string.

Once we put words in the terminal, we see a list of items with each item being each word in the string.

How to Tokenize Sentences in a String

We now will show to tokenize sentences in a string.

It's just like above but now we use the nltk.sent_tokenize() function.

We show this below.

>>> import nltk >>> paragraph= """The sky is blue. The sun is yellow. The grass is green. The clouds are white.""" >>> sentences= nltk.sent_tokenize(paragraph) >>> sentences ['The sky is blue.', 'The sun is yellow.', 'The grass is green.', 'The clouds are white.']

So we have a variable, paragraph, which stores a few sentences. We will triple quotes so that the text can span multiple lines.

We then create a variable, sentences, which stores the tokenized sentences. This is done using the nltk.sent_tokenize() function.

We then show the output of the sentences variable.

There were 4 sentences in the original string, and you can see there are 4 items in the list, which represents the tokenized string.

So this is how the NLTK module allows us to tokenize strings in Python either into words or sentences.

Related Resources

How to Draw a Rectangle in Python using OpenCV

How to Draw a Circle in Python using OpenCV

How to Draw a Line in Python using OpenCV

How to Add Text to an Image in Python using OpenCV

How to Display an OpenCV image in Python with Matplotlib

How to Use Callback functions to Connect Images to Events in Python using OpenCV

How to Check for Multiple Events in Python using OpenCV

HTML Comment Box is loading comments...

Learning about Electronics

How to Tokenize a String into Words or Sentences in Python using the NLTK Module

How to Tokenize Words in a String

How to Tokenize Sentences in a String