GPT summarization condenses large volumes of text into concise, coherent summaries. It's particularly useful for quickly grasping the main points of articles, reports, and documents, saving valuable time and improving productivity.

The example uses the Amazon Sales dataset , which contains more than 1,000 Amazon Product Ratings and Reviews as per their details listed on the official Amazon website.

Quick Start

Before you start, make sure the NuPIC Inference Server is up and running, and the Python environment is set up.

Now navigate to the directory containing the GPT summarization example:

cd nupic.examples/examples/gpt_summarization

The directory structure looks like this:

gpt_summarization
├── datasets
│   ├── amazon_reviews
│   │   ├── amazon.csv -----------------------> Raw data
│   │   ├── preprocess_dataset.py ------------> Script for extracting data from CSV and saving as .txt
│   │   ├── database/ ------------------------> Preprocessed .txt files
│   │   ├── summaries/ -----------------------> GPT-generated summaries are saved here
│   │   └── README.md
│   └── README.md
├── generate_summaries.py --------------------> Main script for generating summaries
└── README.md

Check that the datasets/amazon_reviews/database/ folder already contains preprocessed reviews in the form of text files. If not, run python preprocess_dataset.py from its immediate parent directory to start preprocessing.

Next, open generate_summaries.py in a text editor, and check that the model and corresponding prompt formatter are correctly defined. A default installation of NuPIC contains NuPIC-GPT, which is specially optimized to run on CPUs.

def main(url: str, protocol: str, dataset_path: str) -> None:
    """Summarize the dataset."""
    model = "nupic-gpt"
    model = model_naming_mapping.get(model, model)
    prompt_formatter = get_prompt_formatter(model)

Finally, we're ready to generate some summaries!

python generate_summaries.py --dataset_path datasets/amazon_reviews

You should see summaries similar to this:

Summarizing document:  review_1288.txt
Summary: The Sujata Powermatic Plus, a juicer mixer grinder with a 900-watt motor and three jars, is a standout product among popular brands. It excels in both juicing and mixing/grinding, making it versatile for domestic or commercial use. Users appreciate its ease of use and cleaning, though some note that it may waste juice over time with infrequent use. Overall, customers are satisfied with its performance and sturdy build quality.

In More Detail

Let's take a closer look at generate_summaries.py to understand how this works.

We begin with a system prompt that tells the GPT model that it needs to perform a summary task. This system prompt is then concatenated with the document text, which in our case is an Amazon product review.

def summarize_document(document: str,
                       client: TextInferClient,
                       prompt_formatter: Callable[[str, str], str]) -> str:
    system_prompt = (
        "You are a helpful assistant who is proficient in summarizing documents. "
        "You will be given a product review and respond with a concise summary of less "
        "than 50 words. Each summary should start with 'Summary:'."
    )
    prompt = prompt_formatter(system_prompt, [document])

Sometimes, model-generated summaries are still too long. That's why the summarize_document() function contains a condition that tells the model to try again if the initial summary exceeds 70 words. The condition contains an additional prompt which is appended to the original prompt described above.

    if len(summary.split()) > 70:
        second_prompt = (
            "The summary was too long or did not start with 'Summary:'. "
            "Please try again to generate a 50-word summary starting with 'Summary:'"
        )
        attempt = summary.split(document)

        model_response = (attempt[0]
                          if len(attempt) == 1
                          else "\n".join(summary.split("\n")[-2:]))
        prompt = prompt_formatter(system_prompt,
                                  [document, second_prompt],
                                  [model_response])