A Complete Guide to Developing Text Summarizer Model with Python

Text summarizers are powerful tools that make content concise and impactful. They leverage the latest NLP algorithms and Python language to function accurately and precisely.

20 mins read

Developing Text Summarizer Model with Python

However, developing a text summarizer on your own can be a very difficult task. Firstly, you’ll need the proper knowledge of the libraries required and, secondly, how to execute the code.

Thankfully, you’ve landed on the right article to learn the entire process of developing an AI text summarizer model using Python. We will try to cover each aspect in detail so you don’t have to worry about anything. Let’s get started.

Understanding Text Summarization

Mainly, there are two types of text summarization techniques, abstractive and extractive. In extractive summarization, you pick only a few sentences verbatim from the original passage and present a boiled-down version.

The revised version still contains the same meaning as the source but with way fewer words and redundancy.

However, abstractive summarization works completely differently. It requires one to understand the context of the original work and make the summary in own words. This is normally how humans write summaries.

The text summarizer that we’ll develop today will provide abstract summaries. It is because we’ll use Natural Language Processing (NLP) algorithms at the backend that provide human touch to the outputs.

Setting up the Environment

To begin the procedure, you’ll need to set up a virtual environment. This is necessary to keep all files in the project isolated from global ones and avoid any potential data leaks, etc.

Start by opening the command prompt with administrative privileges. Then, change the directory to where you’ll save the project files and put in the following lines of code.

python -m venv text_summarization
text_summarization\Scripts\activate

Pressing enter after each will construct your virtual environment ready for the project. Now, install a suitable IDE like Atom, Spyder, Visual Studio, etc., to access the created text_summarization environment.

Installing Libraries

After you’re done with setting up the basics, it’s time to download and install the required libraries.

To develop a text summarizer model, you need to input the following line in your command prompt.

pip install pandas transformers torch sentencepiece

This single command will install all the required libraries onto your PC or laptop. The reason why we’re using these libraries is another discussion and can be covered in a separate post later.

Importing Libraries

Moving on, you need to hop on to your IDE and import the required libraries in the workspace by using the following code.

import pandas as pd
from transformers import T5ForConditionalGeneration, T5Tokenizer
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AdamW

The Pandas library will be used to handle dataset inputs.

The HuggingFace’s T5 pre-trained transformer will help us quickly get accurate summarization results.

These lines of code will ensure that all your tools are ready at your disposal. So, we don’t face any bugs/errors later on in the execution phase.

Loading the Dataset for Model Training

Once you’ve imported the libraries, the next step is to load and preprocess the data for the abstractive summarization process. To do so, follow the code given below.

# Paths to your dataset
train_file = 'C:/Users/Common/Desktop/train.csv'
test_file = 'C:/Users/Common/Desktop/test.csv'
val_file = 'C:/Users/Common/Desktop/validation.csv'
# Load datasets
train_data = pd.read_csv(train_file, usecols=['article', 'highlights'])
test_data = pd.read_csv(test_file, usecols=['article', 'highlights'])
val_data = pd.read_csv(val_file, usecols=['article', 'highlights'])
# Ensure that the columns are correct

Here, we have given the CNN Daily Mail dataset as input. This will provide the code with enough context to train and provide us with valid results.

We have given a static input from our Desktop directory. However, you can also take dynamic inputs from users, but it would require you to make some changes in the written program.

Tokenizing the Dataset

After you’re done getting the dataset in the workspace, it is time to initialize and run the transformer for the tokenization process. Start by inputting the following lines of code in the IDE.

# Initialize the tokenizer and model
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")
# Function to tokenize data
def tokenize_data(texts, max_length=512):
return tokenizer(texts, max_length=max_length, truncation=True, padding="max_length",
return_tensors="pt")
# Custom dataset class
class TextSummaryDataset(Dataset):
def __init__(self, articles, summaries):
self.articles = articles
self.summaries = summaries
def __len__(self):
return len(self.articles)
def __getitem__(self, idx):
article = self.articles.iloc[idx]
summary = self.summaries.iloc[idx]
# Tokenize the inputs and outputs (article and summary)
encodings = tokenizer(article, max_length=512, truncation=True, padding="max_length",
return_tensors="pt")
labels = tokenizer(summary, max_length=150, truncation=True, padding="max_length",
return_tensors="pt")
# Flatten the tensor (get rid of extra batch dimension)
encodings = {key: val.squeeze(0) for key, val in encodings.items()}
labels = {key: val.squeeze(0) for key, val in labels.items()}
encodings["labels"] = labels["input_ids"]
return encodings

Using the above code, you’ll successfully initialize and tokenize the input content. For efficiency purposes, we defined a function for the tokenizer.

This will ensure that the program returns with encodings necessary to evaluate the input while maintaining code readability.

Creating Dataset Objects

The next step is to create dataset objects and prepare the data loader for the validation loop. This will establish the text summarizer model, which we will test later on.

# Create dataset objects
train_dataset = TextSummaryDataset(train_data['article'], train_data['highlights'])
test_dataset = TextSummaryDataset(test_data['article'], test_data['highlights'])
val_dataset = TextSummaryDataset(val_data['article'], val_data['highlights'])
# DataLoader: Decrease batch size to reduce memory usage
batch_size = 4 # Reduce this if you run into memory issues
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size)
val_loader = DataLoader(val_dataset, batch_size=batch_size)
# Check if GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
# Set up the optimizer
optimizer = AdamW(model.parameters(), lr=3e-5) # Lower learning rate for stability

We’ve currently kept the batch size to 4, however, you can decrease this value if your processor isn’t that strong. This will save some time but degrade the overall quality of the output a bit.

Along with the dataset objects and data loader, you’ll also need to initialize the optimizer. You will see why it is important in a bit.

Running a Training Loop

We must run a training loop with the data loaders and objects created in the previous step. This step is essential and cannot be skipped.

# Training loop
epochs = 3
model.train()
for epoch in range(epochs):
total_loss = 0
for batch in train_loader:
optimizer.zero_grad()
# Move the batch to the correct device
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
# Forward pass
outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss
# Backward pass and optimization step
loss.backward()
optimizer.step()

Here, the nested for loop is currently working on 3 epochs. Meaning, our program will go forward and backward (representative of the passes in the transformer layer) 3 times before generating a summary.

Again, you can reduce the number of epochs here at a significant reduction of summary quality.

The next part is, logically, running a validation loop to see how well your text summarizer model treats unseen data. However, to keep things simple, we will skip this loop for now for a quick execution of the code.

Extracting Results

The final step in the entire procedure is to define a summary-generating function that will give us the output.

# Function to generate summaries
def generate_summary(text, max_length=150):
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True).to(device)
summary_ids = model.generate(inputs.input_ids, max_length=max_length, return tokenizer.decode
(summary_ids[0], skip_special_tokens=True)
# Test with an example from the test set
test_article = test_data['article'].iloc[0]length_penalty=1.5, num_beams=6, early_stopping=True)
print("Original Article: ", test_article)
print("Generated Summary: ", generate_summary(test_article))
# Backward pass and optimization step
loss.backward()
optimizer.step()
total_loss += loss.item()
# Print loss for each epoch
avg_train_loss = total_loss / len(train_loader)
print(f"Epoch {epoch+1}/{epochs}, Training Loss: {avg_train_loss}")

When the model is done training, these lines of code will provide the user with the original article, along with its generated summary.

And, that is pretty much it for the entire development process. Now, you can also take user input to generate summaries. Play around a bit with the base version of the code until you find the perfect fit for your needs.

A Deployed Text Summarizer in Action

In our development process, we didn’t deploy the AI summarizer on the internet. However, today, many online tools do make their models available for web users.

case in point is the Text Summarizer by Editpad which is worth mentioning due to its high operational efficiency and better training parameters.

Just like the mentioned tool, you can also add features like ‘Show Bullets’ to your model to create a one-stop solution for users looking to make information concise. However, this may require more coding and deployment efforts, both front end and back end.

Conclusion

Text summarizers are essential tools that condense content effectively using NLP algorithms and Python. Developing a text summarizer independently can be challenging, but this guide covers everything for your convenience.

It starts by completing the prerequisites. Then, install and import the required libraries. Afterward, prepare the data along with completing the model and optimizer setup.

Finally, creating and running a summary-generating function will provide the user with the original article along with the created response.

informative

How a 24-Hour Answering Service Supports Medical and Legal Professionals

In medicine and law life, availability is essential. Patients and clients...

20 mins

informative

Top 10 B2B E-commerce Websites in the World for 2025

The landscape of B2B e-commerce websites to trade is shifting gears, with digital platforms...

20 mins

informative

How to Keep Creative Projects Moving Without Overloading Your Team

Inefficient time management increases the risk of project failure by 50%, making...

20 mins

Let us get talking and see where that leads us!

Tell us what is keeping you up at night and let us see how we can help you chase those monsters away.

This form to your right is the easiest way for you to get in touch with us.

You can also leave us an email at
[email protected]

and we will get back to you as soon as we can. Cheers!

Let us get talking and see where that leads us!

Thinking about a project?

Let’s build your next product! Share your idea or request a free consultation from us.

More?

There are a lot of articles on our blog, check them out!

Blog

Understanding Text Summarization

Setting up the Environment

Installing Libraries

Importing Libraries

Loading the Dataset for Model Training

Tokenizing the Dataset

Creating Dataset Objects

Running a Training Loop

Extracting Results

A Deployed Text Summarizer in Action

Conclusion

Share

How a 24-Hour Answering Service Supports Medical and Legal Professionals

Top 10 B2B E-commerce Websites in the World for 2025

How to Keep Creative Projects Moving Without Overloading Your Team

Let us get talking and see where that leads us!

Let us get talking and see where that leads us!

Thank you for getting in touch.

Thinking about a project?

More?

Thank you
for getting in touch.