Count Vectorizer
CountVectorizer in NLP
Introduction
When working with text data in Machine Learning or NLP (Natural Language Processing), one of the first challenges is:
How can we convert text into numbers that a machine can understand?
Computers cannot understand sentences like:
"I love Python"
"Python is easy"
"I love coding"
Machine learning algorithms work with numbers, not text.
This is where CountVectorizer comes into the picture.
CountVectorizer converts text documents into numerical vectors by counting how many times each word appears.
It is available in the sklearn.feature_extraction.text module of the scikit-learn library.
What is CountVectorizer?
Definition:
CountVectorizer converts a collection of text documents into a matrix of token counts.
In simple words:
-
Extract all unique words
-
Create a vocabulary
-
Count how many times each word appears
-
Represent each document as numerical data
Why Do We Need CountVectorizer?
Consider the following dataset:
documents = [
"I love Python",
"Python is easy",
"I love coding"
]
Machine learning algorithms cannot process these strings directly.
We need something like:
| coding | easy | is | love | python |
|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 1 |
| 0 | 1 | 1 | 0 | 1 |
| 1 | 0 | 0 | 1 | 0 |
Now the text has been converted into numbers.
This numerical representation can be used for:
-
Sentiment Analysis
-
Spam Detection
-
Text Classification
-
Recommendation Systems
-
Chatbots
-
Search Engines
How CountVectorizer Works
CountVectorizer works in three major steps:
Step 1: Tokenization
Break text into individual words.
Example:
I love Python
becomes:
["I", "love", "Python"]
Step 2: Create Vocabulary
Collect all unique words from all documents.
Documents:
[
"I love Python",
"Python is easy",
"I love coding"
]
Unique words:
coding
easy
is
love
python
Vocabulary:
| Word | Index |
|---|---|
| coding | 0 |
| easy | 1 |
| is | 2 |
| love | 3 |
| python | 4 |
Step 3: Count Frequencies
Now CountVectorizer counts occurrences of each word in every document.
Document 1:
I love Python
Counts:
coding = 0
easy = 0
is = 0
love = 1
python = 1
Vector:
[0, 0, 0, 1, 1]
Document 2:
Python is easy
Vector:
[0, 1, 1, 0, 1]
Document 3:
I love coding
Vector:
[1, 0, 0, 1, 0]
Final Matrix:
| Document | coding | easy | is | love | python |
|---|---|---|---|---|---|
| I love Python | 0 | 0 | 0 | 1 | 1 |
| Python is easy | 0 | 1 | 1 | 0 | 1 |
| I love coding | 1 | 0 | 0 | 1 | 0 |
This is called a Document-Term Matrix (DTM).
Python Program
from sklearn.feature_extraction.text import CountVectorizer
documents = [
"I love Python",
"Python is easy",
"I love coding"
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)
print("Vocabulary:")
print(vectorizer.vocabulary_)
print("\nFeatures:")
print(vectorizer.get_feature_names_out())
print("\nCount Matrix:")
print(X.toarray())
Output:
Vocabulary:
{
'love':3,
'python':4,
'is':2,
'easy':1,
'coding':0
}
Features:
['coding' 'easy' 'is' 'love' 'python']
Count Matrix:
[
[0 0 0 1 1]
[0 1 1 0 1]
[1 0 0 1 0]
]