Count Vectorizer - Machine Learning

CountVectorizer in NLP

Introduction

When working with text data in Machine Learning or NLP (Natural Language Processing), one of the first challenges is:

How can we convert text into numbers that a machine can understand?

Computers cannot understand sentences like:

"I love Python"
"Python is easy"
"I love coding"

Machine learning algorithms work with numbers, not text.

This is where CountVectorizer comes into the picture.

CountVectorizer converts text documents into numerical vectors by counting how many times each word appears.

It is available in the sklearn.feature_extraction.text module of the scikit-learn library.

What is CountVectorizer?

Definition:

CountVectorizer converts a collection of text documents into a matrix of token counts.

In simple words:

Extract all unique words
Create a vocabulary
Count how many times each word appears
Represent each document as numerical data

Why Do We Need CountVectorizer?

Consider the following dataset:

documents = [
    "I love Python",
    "Python is easy",
    "I love coding"
]

Machine learning algorithms cannot process these strings directly.

We need something like:

coding	easy	is	love	python
0	0	0	1	1
0	1	1	0	1
1	0	0	1	0

Now the text has been converted into numbers.

This numerical representation can be used for:

Sentiment Analysis
Spam Detection
Text Classification
Recommendation Systems
Chatbots
Search Engines

How CountVectorizer Works

CountVectorizer works in three major steps:

Step 1: Tokenization

Break text into individual words.

Example:

I love Python

becomes:

["I", "love", "Python"]

Step 2: Create Vocabulary

Collect all unique words from all documents.

Documents:

[
    "I love Python",
    "Python is easy",
    "I love coding"
]

Unique words:

coding
easy
is
love
python

Vocabulary:

Word	Index
coding	0
easy	1
is	2
love	3
python	4

Step 3: Count Frequencies

Now CountVectorizer counts occurrences of each word in every document.

Document 1:

I love Python

Counts:

coding = 0
easy = 0
is = 0
love = 1
python = 1

Vector:

[0, 0, 0, 1, 1]

Document 2:

Python is easy

Vector:

[0, 1, 1, 0, 1]

Document 3:

I love coding

Vector:

[1, 0, 0, 1, 0]

Final Matrix:

Document	coding	easy	is	love	python
I love Python	0	0	0	1	1
Python is easy	0	1	1	0	1
I love coding	1	0	0	1	0

This is called a Document-Term Matrix (DTM).

Python Program

from sklearn.feature_extraction.text import CountVectorizer

documents = [
    "I love Python",
    "Python is easy",
    "I love coding"
]

vectorizer = CountVectorizer()

X = vectorizer.fit_transform(documents)

print("Vocabulary:")
print(vectorizer.vocabulary_)

print("\nFeatures:")
print(vectorizer.get_feature_names_out())

print("\nCount Matrix:")
print(X.toarray())

Output:

Vocabulary:
{
'love':3,
'python':4,
'is':2,
'easy':1,
'coding':0
}

Features:
['coding' 'easy' 'is' 'love' 'python']

Count Matrix:

[
[0 0 0 1 1]
[0 1 1 0 1]
[1 0 0 1 0]
]