However, developing a text summarizer on your own can be a very difficult task. Firstly, you’ll need the proper knowledge of the libraries required and, secondly, how to execute the code.
Thankfully, you’ve landed on the right article to learn the entire process of developing an AI text summarizer model using Python. We will try to cover each aspect in detail so you don’t have to worry about anything. Let’s get started.
Understanding Text Summarization
Mainly, there are two types of text summarization techniques, abstractive and extractive. In extractive summarization, you pick only a few sentences verbatim from the original passage and present a boiled-down version.
The revised version still contains the same meaning as the source but with way fewer words and redundancy.
However, abstractive summarization works completely differently. It requires one to understand the context of the original work and make the summary in own words. This is normally how humans write summaries.

The text summarizer that we’ll develop today will provide abstract summaries. It is because we’ll use Natural Language Processing (NLP) algorithms at the backend that provide human touch to the outputs.
Setting up the Environment
To begin the procedure, you’ll need to set up a virtual environment. This is necessary to keep all files in the project isolated from global ones and avoid any potential data leaks, etc.
Start by opening the command prompt with administrative privileges. Then, change the directory to where you’ll save the project files and put in the following lines of code.
python -m venv text_summarization
text_summarization\Scripts\activate
Pressing enter after each will construct your virtual environment ready for the project. Now, install a suitable IDE like Atom, Spyder, Visual Studio, etc., to access the created text_summarization environment.
Installing Libraries
After you’re done with setting up the basics, it’s time to download and install the required libraries.
To develop a text summarizer model, you need to input the following line in your command prompt.
pip install pandas transformers torch sentencepiece
This single command will install all the required libraries onto your PC or laptop. The reason why we’re using these libraries is another discussion and can be covered in a separate post later.
Importing Libraries
Moving on, you need to hop on to your IDE and import the required libraries in the workspace by using the following code.
import pandas as pd
from transformers import T5ForConditionalGeneration,
T5Tokenizer
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AdamW
The Pandas library will be used to handle dataset inputs.
The HuggingFace’s T5 pre-trained transformer will help us quickly get accurate summarization results.
These lines of code will ensure that all your tools are ready at your disposal. So, we don’t face any bugs/errors later on in the execution phase.
Loading the Dataset for Model Training
Once you’ve imported the libraries, the next step is to load and preprocess the data for the abstractive summarization process. To do so, follow the code given below.
# Paths to your dataset
train_file = 'C:/Users/Common/Desktop/train.csv'
test_file = 'C:/Users/Common/Desktop/test.csv'
val_file = 'C:/Users/Common/Desktop/validation.csv'
# Load datasets
train_data = pd.read_csv(train_file, usecols=['article',
'highlights'])
test_data = pd.read_csv(test_file, usecols=['article',
'highlights'])
val_data = pd.read_csv(val_file, usecols=['article',
'highlights'])
# Ensure that the columns are correct
Here, we have given the CNN Daily Mail dataset as input. This will provide the code with enough context to train and provide us with valid results.
We have given a static input from our Desktop directory. However, you can also take dynamic inputs from users, but it would require you to make some changes in the written program.
Tokenizing the Dataset
After you’re done getting the dataset in the workspace, it is time to initialize and run the transformer for the tokenization process. Start by inputting the following lines of code in the IDE.
# Initialize the tokenizer and model
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model =
T5ForConditionalGeneration.from_pretrained("t5-small")
# Function to tokenize data
def tokenize_data(texts, max_length=512):
return tokenizer(texts, max_length=max_length, truncation=True,
padding="max_length",
return_tensors="pt")
# Custom dataset class
class TextSummaryDataset(Dataset):
def __init__(self, articles, summaries):
self.articles = articles
self.summaries = summaries
def __len__(self):
return len(self.articles)
def __getitem__(self, idx):
article = self.articles.iloc[idx]
summary = self.summaries.iloc[idx]
# Tokenize the inputs and outputs (article and summary)
encodings = tokenizer(article, max_length=512, truncation=True,
padding="max_length",
return_tensors="pt")
labels = tokenizer(summary, max_length=150, truncation=True,
padding="max_length",
return_tensors="pt")
# Flatten the tensor (get rid of extra batch dimension)
encodings = {key: val.squeeze(0) for key, val in
encodings.items()}
labels = {key: val.squeeze(0) for key, val in labels.items()}
encodings["labels"] = labels["input_ids"]
return encodings
Using the above code, you’ll successfully initialize and tokenize the input content. For efficiency purposes, we defined a function for the tokenizer.
This will ensure that the program returns with encodings necessary to evaluate the input while maintaining code readability.
Creating Dataset Objects
The next step is to create dataset objects and prepare the data loader for the validation loop. This will establish the text summarizer model, which we will test later on.
# Create dataset objects
train_dataset = TextSummaryDataset(train_data['article'],
train_data['highlights'])
test_dataset = TextSummaryDataset(test_data['article'],
test_data['highlights'])
val_dataset = TextSummaryDataset(val_data['article'],
val_data['highlights'])
# DataLoader: Decrease batch size to reduce memory usage
batch_size = 4 # Reduce this if you run into memory issues
train_loader = DataLoader(train_dataset, batch_size=batch_size,
shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size)
val_loader = DataLoader(val_dataset, batch_size=batch_size)
# Check if GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else
'cpu')
model = model.to(device)
# Set up the optimizer
optimizer = AdamW(model.parameters(), lr=3e-5) # Lower learning
rate for stability
We’ve currently kept the batch size to 4, however, you can decrease this value if your processor isn’t that strong. This will save some time but degrade the overall quality of the output a bit.
Along with the dataset objects and data loader, you’ll also need to initialize the optimizer. You will see why it is important in a bit.
Running a Training Loop
We must run a training loop with the data loaders and objects created in the previous step. This step is essential and cannot be skipped.
# Training loop
epochs = 3
model.train()
for epoch in range(epochs):
total_loss = 0
for batch in train_loader:
optimizer.zero_grad()
# Move the batch to the correct device
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
# Forward pass
outputs = model(input_ids=input_ids,
attention_mask=attention_mask, labels=labels)
loss = outputs.loss
# Backward pass and optimization step
loss.backward()
optimizer.step()
Here, the nested for loop is currently working on 3 epochs. Meaning, our program will go forward and backward (representative of the passes in the transformer layer) 3 times before generating a summary.
Again, you can reduce the number of epochs here at a significant reduction of summary quality.
The next part is, logically, running a validation loop to see how well your text summarizer model treats unseen data. However, to keep things simple, we will skip this loop for now for a quick execution of the code.
Extracting Results
The final step in the entire procedure is to define a summary-generating function that will give us the output.
# Function to generate summaries
def generate_summary(text, max_length=150):
inputs = tokenizer(text, return_tensors="pt", max_length=512,
truncation=True).to(device)
summary_ids = model.generate(inputs.input_ids,
max_length=max_length, return tokenizer.decode
(summary_ids[0],
skip_special_tokens=True)
# Test with an example from the test set
test_article = test_data['article'].iloc[0]length_penalty=1.5,
num_beams=6, early_stopping=True)
print("Original Article: ", test_article)
print("Generated Summary: ", generate_summary(test_article))
# Backward pass and optimization step
loss.backward()
optimizer.step()
total_loss += loss.item()
# Print loss for each epoch
avg_train_loss = total_loss / len(train_loader)
print(f"Epoch {epoch+1}/{epochs}, Training Loss:
{avg_train_loss}")
When the model is done training, these lines of code will provide the user with the original article, along with its generated summary.
And, that is pretty much it for the entire development process. Now, you can also take user input to generate summaries. Play around a bit with the base version of the code until you find the perfect fit for your needs.
A Deployed Text Summarizer in Action
In our development process, we didn’t deploy the AI summarizer on the internet. However, today, many online tools do make their models available for web users.
case in point is the Text Summarizer by Editpad which is worth mentioning due to its high operational efficiency and better training parameters.

Just like the mentioned tool, you can also add features like ‘Show Bullets’ to your model to create a one-stop solution for users looking to make information concise. However, this may require more coding and deployment efforts, both front end and back end.
Conclusion
Text summarizers are essential tools that condense content effectively using NLP algorithms and Python. Developing a text summarizer independently can be challenging, but this guide covers everything for your convenience.
It starts by completing the prerequisites. Then, install and import the required libraries. Afterward, prepare the data along with completing the model and optimizer setup.