Document Similarity lets you retrieve relevant documents from a database based on your query. It’s especially useful in applications such as information retrieval and recommendation engines, allowing you to quickly find information or documents that match specific needs or interests.

From a technical implementation standpoint, Document Similarity is similar to Sentence Similarity, but with an additional step to allow large documents to get passed through the BERT models. To overcome input size limitations of BERT models, large documents first need to be split into smaller chunks before they are passed to the model.

In this example, we try to look for new articles that are similar to each other. Specifically, we are using the Reuters-21578 Text Categorization Collection dataset, which is a collection of documents that appeared on Reuters newswire in 1987.

Quick Start

Before you start, make sure the NuPIC Inference Server is up and running, and the Python environment is set up.

Navigate to the directory containing the document similarity example:

cd nupic.examples/examples/document_similarity

The directory should look like this:

document_similarity/
├── get_data.sh  -----------------> Downloads and extracts the raw data
├── parse_data.py ----------------> Helper for get_data.sh
├── datasets
│   ├── database/ ----------------> Extracted data stored here
│   ├── query/ -------------------> Documents here will be used to search the database
│   └── README.md
├── document_similarity.py -------> Main script that runs the document similarity workflow
├── configs.py
├── README.md
├── requirements.txt
└── results/
    └── default_config/ ----------> Results from document_similarity.py

If the datasets/ subfolders are empty, start by downloading some news articles. Most of the articles will be placed in datasets/database/, with a few in datasets/query.

chmod +x get_data.sh
./get_data.sh

We also need to make sure that we have configured the Python client to communicate correctly with the Inference Server. In the following snippet from configs.py, we are using one of our NuPIC SBERT variants as the embedding model, and assuming the use-case where the Inference Server resides on the same machine as the Python client:

default_config = {
    "data_dir": data_dir,
    "query_docs": query_dir,
    "chunk_size": 100,
    "chunk_overlap": 10,
    "embedding_model": "nupic-sbert.base-v3", <---------------
    "similarity_method": "all",
    "connection_config": {},
    "protocol": "http",
    "url": "localhost:8000", <----------------------------------------
    "load_saved_chunks": False,
}

Now we can run the main script. For each article in database/query, we will return the most similar article from datasets/database.

python document_similarity.py

The results will be printed in your terminal. Alternatively, you can also view them in best_matching_doc_all_sim_methods.csv within the results/ subdirectory.

sim_of_mean_embeddings                            mean_chunk_sim                           ...  top_1_chunk_sim                          mean_of_max_query_chunk_sim                         
                            best_match similarity_of_best_match       best_match similarity_of_best_match  ...       best_match similarity_of_best_match                  best_match similarity_of_best_match
reut2-005-1.txt        reut2-011-0.txt                 0.869679  reut2-012-0.txt                 0.315663  ...  reut2-001-1.txt                 0.695103             reut2-009-2.txt                 0.488769
reut2-005-2.txt        reut2-011-0.txt                 0.882855  reut2-012-0.txt                 0.291093  ...  reut2-020-0.txt                 1.000000             reut2-006-0.txt                 0.513613
reut2-009-1.txt        reut2-019-1.txt                 0.823774  reut2-012-0.txt                 0.255548  ...  reut2-008-3.txt                 0.691499             reut2-019-1.txt                 0.460836
reut2-000-3.txt        reut2-009-2.txt                 0.851748  reut2-012-0.txt                 0.285296  ...  reut2-010-1.txt                 0.751023             reut2-009-2.txt                 0.454189
reut2-004-0.txt        reut2-012-1.txt                 0.834486  reut2-012-1.txt                 0.298738  ...  reut2-012-1.txt                 0.748014             reut2-003-0.txt                 0.519579

Let's examine one of the rows in detail. The last row asks the database, "Which document is most similar to reut2-004-0.txt? It turns out the best match is reut2-004-0.txt, with a cosine similarity score of 0.87. We can preview each document to make sure that the results make sense.

Using the command head datasets/query/reut2-004-0.txt:

The Reagan Administration sent to
Congress proposed legislation that would require Congress to
reflect the cost of federal loan subsidies in the government's
budget.
    The legislation would require Congress to approve all
subsidies on loans, to sell off many loans to the private
sector shortly after they are made and to buy private
reinsurance for many guaranteed loans.
    White House officials estimated reinsurance premiums  could
amount to six billion dlrs a year to private companies.

Similarly, we can run head datasets/database/reut2-012-1.txt:

BankAmerica Corp said it placed
its 1.9 billion dlrs in medium- and long-term term loans to the
Brazilian public and private sectors on non-accrual status as
of March 31.
    As a result, the bank's net income for the first quarter
will be reduced by about 40 mln dlrs. If Brazil's suspension of
interest payments remains in effect, earnings for the whole
year will be reduced by a further 100 mln dlrs.
    BankAmerica said, however, that it expects to report a
profit for the first quarter of 1987.

Both articles are talking about bank loans, so it seems our document similarity example is working!

In More Detail

The similarity between the query and database documents are computed in the following way. First, all documents (both the queries and the ones in the database) are split into multiple chunks. We can see this in code within document_similarity.py;

def chunk_single_doc(
    text: str, text_splitter_fn: callable, doc_name: str
) -> pd.DataFrame:
    # split text
    split_text = text_splitter_fn(text)
    split_text = [text.replace("\n", " ") for text in split_text]

    # Keep track of name of doc and also index of chunk within doc, so that if we
    # select a chunk, we can get the original document, and surrounding context
    starts = []
    stops = []
    counter = 0
    for i in range(len(split_text)):
        start = counter
        stop = counter + len(split_text[i])
        starts.append(start)
        stops.append(stop)
        counter = stop

    # Create a dataframe with the chunked text, doc name, and chunk indices
    df = pd.DataFrame(
        {
            "text": split_text,
            "doc_name": doc_name,
            "chunk_start": starts,
            "chunk_stop": stops,
        }
    )

    return df

Then, each chunk is embedded through a BERT model in the NuPIC Inference Server:

def embed_all_chunks(chunks: pd.DataFrame, client) -> pd.DataFrame:
    embeddings = []
    for i in tqdm(range(len(chunks))):
        embeddings.append(client.infer([chunks.loc[i, "text"]])["encodings"]
            .squeeze(0))

    chunks["embedding"] = embeddings

    return chunks

Finally, a similarity measure is computed between the set of all embeddings from a query and the set of all embeddings of each document in the database. In all cases, this similarity is a variation of the cosine similarity. Therefore, the documents can be ranked and returned appropriately.

def all_sims(
    query_data: pd.DataFrame, doc_data: pd.DataFrame, save_directory: str
) -> None:
    print("Computing all pairwise similarities with all similarity methods...")

    similarities = {}
    for sim_name, sim_fn in SIMILARITY_METHODS.items():
        # compute all pairwise similarities for given method
        single_method_sims = sim_fn(query_data, doc_data)

        # compute and store in dict the best match
        best_matches = get_best_matches(single_method_sims)
        similarities[sim_name] = best_matches

        # store the computed similarities for similarity method
        single_method_sims.index.name = "database_documents"
        single_method_sims.to_csv(
            os.path.join(save_directory, f"all_pairwise_similarities_{sim_name}.csv")
        )

    similarities = pd.concat(similarities, axis=1)

    similarities.to_csv(
        os.path.join(save_directory, "best_matching_doc_all_sim_methods.csv")
    )

    print(similarities)

    return

Customizing the Example

You may customize the example by creating a new config in configs.py. For example, you may change the embedding model, chunking parameters like chunk_size and chunk_overlap, and you may point to a different data directory and use a different query. Just pass that config name as a command line argument to document_similarity.py.

The basic ideas in this example can be extended in straightforward ways. You may replace the LangChain chunker with another chunking tool, as long as it takes in a string and returns a list of strings. Here we assumed the data are small and can be stored in Pandas DataFrames and pickled, but the same ideas can be extended to larger datasets using PyArrow tables, AnyScale datasets, or vector databases.

You may also use different techniques to compute document similarity. Here we provide four methods.

compute_similarities_by_mean: averages chunk embeddings in a document and finds the document with the highest average embedding similarity to the query.
compute_similarities_by_chunk: calculates pairwise cosine similarities bewteen the query embeddings and document embeddings, and returns that average similarity.
compute_mean_of_top_n_similarities: performs the same computation as compute_similarities_by_chunk, but only for the top n most similar chunks. The dafault value for n is n=3.
compute_mean_of_max_similarities: for each chunk in the query document, it gets the highest similarity value between all databse document chunks, and averages all of them. It can be changed so that this is done for each chunk in the database documents instead.

There are many methods for computing similarity including methods that don't require a neural network like bm-25, hybrid methods like Reciprocal Rank Fusion, re-ranking methods where sentence transformers are used to gather promising candidates and a more expensive pretrained task-specific cross encoder re-ranks the candidates, and many others. To use one of these techniques, simply define your function, register it in the variable SIMILARITY_METHODS (within document_similarity.py), and put the SIMILARITY_METHODS key in your config.