pythonadvanced60 minutes

Advanced Text Summarizer Using TF-IDF and Cosine Similarity

Build a Python function that performs extractive text summarization by selecting the most important sentences based on TF-IDF vectors and sentence similarity measures.

Challenge prompt

Create a function called 'extractive_summarizer' that takes a long-form text as input and returns a concise summary consisting of the top N most important sentences. The importance of sentences should be determined using TF-IDF vectorization and ranked by their centrality measured via cosine similarity between sentences. Your summarizer should preprocess the text by tokenizing sentences and words, computing TF-IDF weights for each word in sentences, building a similarity matrix, and finally selecting the most central sentences for the summary in their original order.

Guidance

• Preprocess the text into sentences and tokenize each sentence into words (consider removing stopwords and applying basic normalization).
• Use TF-IDF vectorization to represent sentences as vectors.
• Build a similarity matrix (using cosine similarity) where each element represents similarity between two sentences.
• Rank sentences based on their importance derived from the similarity matrix and return the top N sentences in their original order.

Hints

• Use libraries such as NLTK or spaCy for sentence and word tokenization and stopword removal.
• Leverage scikit-learn's TfidfVectorizer for TF-IDF calculations.
• Consider using PageRank or a simple weighted sum approach on the similarity matrix to score sentences.

Starter code

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk

nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords


def extractive_summarizer(text, top_n=3):
    # Tokenize text into sentences
    sentences = sent_tokenize(text)
    
    # Preprocessing steps (implement stopword removal, lowercasing, etc.) goes here
    
    # Calculate TF-IDF vectors for sentences
    
    # Build similarity matrix
    
    # Rank sentences by centrality
    
    # Return top_n sentences in original order
    
    return []

Expected output

A list of the most important sentences (strings), representing a concise summary of the input text, preserving original sentence ordering.

Core concepts

Text preprocessingTF-IDF vectorizationCosine similarityExtractive summarization

Challenge a Friend

Send this duel to someone else and see if they can solve it.