Back to AI/ML

Natural Language Processing

Master Natural Language Processing with Python. Learn text preprocessing, sentiment analysis, NER, and work with modern transformer models.

Video Tutorial

Introduction to Natural Language Processing

Natural Language Processing (NLP) is a branch of AI that helps computers understand, interpret, and manipulate human language. It combines computational linguistics with machine learning and deep learning.

Examples:

# Common NLP Tasks:
- Text Classification (Sentiment Analysis)
- Named Entity Recognition (NER)
- Machine Translation
- Question Answering
- Text Summarization
- Chatbots and Conversational AI

Key applications of NLP

Setting Up NLP Environment

Install essential NLP libraries including NLTK, spaCy, and transformers for working with text data.

Examples:

# Install NLP libraries
pip install nltk spacy transformers

# Install spaCy language model
python -m spacy download en_core_web_sm

# Install additional tools
pip install textblob wordcloud

Install essential NLP libraries

import nltk
import spacy
from transformers import pipeline

# Download NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

Import and initialize NLP libraries

Text Preprocessing

Text preprocessing is crucial for NLP tasks. It involves cleaning and transforming raw text into a format suitable for analysis.

Examples:

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

text = "Natural Language Processing is amazing! It's transforming AI."

# Tokenization
tokens = word_tokenize(text.lower())
print(f"Tokens: {tokens}")

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [w for w in tokens if w.isalnum() and w not in stop_words]
print(f"Filtered: {filtered_tokens}")

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(w) for w in filtered_tokens]
print(f"Lemmatized: {lemmatized}")

Complete text preprocessing pipeline

Sentiment Analysis

Sentiment analysis determines the emotional tone of text. It's widely used for analyzing customer reviews, social media, and feedback.

Examples:

from textblob import TextBlob

# Analyze sentiment
text = "I love this product! It's absolutely fantastic."
blob = TextBlob(text)

# Get polarity (-1 to 1) and subjectivity (0 to 1)
sentiment = blob.sentiment
print(f"Polarity: {sentiment.polarity}")
print(f"Subjectivity: {sentiment.subjectivity}")

if sentiment.polarity > 0:
    print("Positive sentiment")
elif sentiment.polarity < 0:
    print("Negative sentiment")
else:
    print("Neutral sentiment")

Simple sentiment analysis with TextBlob

from transformers import pipeline

# Use pre-trained transformer model
sentiment_analyzer = pipeline('sentiment-analysis')

texts = [
    "This movie was absolutely wonderful!",
    "I hated the service at this restaurant.",
    "The product is okay, nothing special."
]

results = sentiment_analyzer(texts)
for text, result in zip(texts, results):
    print(f"Text: {text}")
    print(f"Sentiment: {result['label']}, Score: {result['score']:.4f}\n")

Advanced sentiment analysis with transformers

Named Entity Recognition (NER)

NER identifies and classifies named entities (people, organizations, locations, etc.) in text.

Examples:

import spacy

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

text = """
Apple Inc. was founded by Steve Jobs in Cupertino, California.
The company released the iPhone in 2007.
"""

# Process text
doc = nlp(text)

# Extract named entities
print("Named Entities:")
for ent in doc.ents:
    print(f"{ent.text:20} {ent.label_:15} {spacy.explain(ent.label_)}")

Extract named entities using spaCy

Text Classification

Text classification assigns predefined categories to text documents. Common applications include spam detection and topic categorization.

Examples:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Sample data
texts = [
    "Free money now! Click here!",
    "Meeting scheduled for tomorrow",
    "Win a free iPhone today!",
    "Project deadline is next week",
    "Congratulations! You won the lottery!"
]
labels = ['spam', 'ham', 'spam', 'ham', 'spam']

# Vectorize text
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)

# Train classifier
clf = MultinomialNB()
clf.fit(X, labels)

# Predict
new_text = ["Important meeting reminder"]
X_new = vectorizer.transform(new_text)
prediction = clf.predict(X_new)
print(f"Prediction: {prediction[0]}")

Text classification with TF-IDF and Naive Bayes

Word Embeddings

Word embeddings represent words as dense vectors that capture semantic meaning. Similar words have similar vector representations.

Examples:

import spacy

# Load model with word vectors
nlp = spacy.load('en_core_web_md')

# Get word vectors
word1 = nlp("king")
word2 = nlp("queen")
word3 = nlp("car")

# Calculate similarity
similarity = word1.similarity(word2)
print(f"Similarity (king, queen): {similarity:.4f}")

similarity = word1.similarity(word3)
print(f"Similarity (king, car): {similarity:.4f}")

Working with word embeddings in spaCy

Text Generation with Transformers

Modern transformers like GPT can generate human-like text. The Hugging Face library makes it easy to use pre-trained models.

Examples:

from transformers import pipeline

# Create text generation pipeline
generator = pipeline('text-generation', model='gpt2')

# Generate text
prompt = "Artificial intelligence is"
result = generator(
    prompt,
    max_length=50,
    num_return_sequences=1,
    temperature=0.7
)

print(result[0]['generated_text'])

Generate text using GPT-2

Question Answering

Question answering systems can extract answers from context passages. This is useful for building chatbots and search systems.

Examples:

from transformers import pipeline

# Create QA pipeline
qa_pipeline = pipeline('question-answering')

context = """
Python is a high-level programming language. It was created by 
Guido van Rossum and first released in 1991. Python is known for 
its simple syntax and readability.
"""

question = "Who created Python?"

result = qa_pipeline(question=question, context=context)
print(f"Question: {question}")
print(f"Answer: {result['answer']}")
print(f"Confidence: {result['score']:.4f}")

Question answering with transformers

Quick Reference

Essential Libraries

  • nltk - Text processing toolkit
  • spacy - Industrial NLP
  • transformers - Pre-trained models
  • textblob - Simple text processing

Best Practices

  • ✓ Always preprocess text data
  • ✓ Use pre-trained models when possible
  • ✓ Consider context and domain
  • ✓ Evaluate on diverse datasets