Advanced Text Summarizer Using TF-IDF and Cosine Similarity
Build a Python function that performs extractive text summarization by selecting the most important sentences based on TF-IDF vectors and sentence similarity measures.
Challenge prompt
Create a function called 'extractive_summarizer' that takes a long-form text as input and returns a concise summary consisting of the top N most important sentences. The importance of sentences should be determined using TF-IDF vectorization and ranked by their centrality measured via cosine similarity between sentences. Your summarizer should preprocess the text by tokenizing sentences and words, computing TF-IDF weights for each word in sentences, building a similarity matrix, and finally selecting the most central sentences for the summary in their original order.
Guidance
- • Preprocess the text into sentences and tokenize each sentence into words (consider removing stopwords and applying basic normalization).
- • Use TF-IDF vectorization to represent sentences as vectors.
- • Build a similarity matrix (using cosine similarity) where each element represents similarity between two sentences.
- • Rank sentences based on their importance derived from the similarity matrix and return the top N sentences in their original order.
Hints
- • Use libraries such as NLTK or spaCy for sentence and word tokenization and stopword removal.
- • Leverage scikit-learn's TfidfVectorizer for TF-IDF calculations.
- • Consider using PageRank or a simple weighted sum approach on the similarity matrix to score sentences.
Starter code
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
def extractive_summarizer(text, top_n=3):
# Tokenize text into sentences
sentences = sent_tokenize(text)
# Preprocessing steps (implement stopword removal, lowercasing, etc.) goes here
# Calculate TF-IDF vectors for sentences
# Build similarity matrix
# Rank sentences by centrality
# Return top_n sentences in original order
return []Expected output
A list of the most important sentences (strings), representing a concise summary of the input text, preserving original sentence ordering.
Core concepts
Challenge a Friend
Send this duel to someone else and see if they can solve it.