Building a Simple Semantic Search Engine with Sentence Embeddings

28 March 2026 by

Suraj Barman

Semantic search replaces literal keyword matching with a focus on meaning, allowing systems to retrieve documents that share conceptual similarity even when exact words differ. By converting text into dense vectors, the engine can compare queries and passages using mathematical distance measures, delivering results that align with user intent. This article walks through the construction of a compact yet effective semantic search pipeline using Python.

Limitations of Keyword‑Based Retrieval

Traditional keyword search relies on exact term matching, which often ignores context and meaning behind user input. This approach can miss documents that use synonyms or alternative phrasing. As a result, relevance suffers in many real‑world scenarios.

Exact match algorithms treat each token as independent, preventing the system from recognizing that puppy and young dog convey the same idea. Users who phrase queries differently may receive fewer hits. Consequently, the experience feels rigid and incomplete.

Another drawback is the inability to handle typographical errors or morphological variations without additional preprocessing. Stemming and lemmatization can mitigate some issues, yet they still cannot capture deeper semantic connections. This gap motivates the shift toward vector‑based methods.

Finally, keyword systems struggle with short or ambiguous queries because there is insufficient lexical evidence to rank documents accurately. Embedding techniques provide a richer representation that can infer intent from limited input. By moving beyond literal matching, search engines become more adaptable to diverse user language.

Fundamentals of Sentence Embeddings

Sentence embeddings map a piece of text to a fixed‑length numeric vector that encodes its semantic properties. The transformation is learned by large language models trained on massive corpora, enabling the vectors to capture subtle relationships. Similar sentences produce vectors that are close in Euclidean or cosine space.

These vectors are typically high‑dimensional, ranging from 256 to 1024 components, which balances expressiveness with computational cost. The dimensionality influences how well the model distinguishes fine‑grained differences. Lower dimensions may lose nuance, while higher dimensions increase storage requirements.

Embedding generation is a forward pass through a neural network, meaning it can be performed quickly on modern hardware. Batch processing further accelerates the creation of vectors for large document collections. Once vectors exist, similarity can be measured with a single mathematical operation.

Importantly, sentence embeddings are language‑agnostic to a degree multilingual models can produce comparable vectors across languages. This property allows a single index to serve queries in multiple tongues without separate pipelines. The uniform representation simplifies downstream engineering.

Selecting a Transformer Model for Embedding Generation

Choosing a transformer model involves evaluating size, training data, and licensing constraints. Smaller models like MiniLM run efficiently on CPUs, making them suitable for modest hardware budgets. Larger models such as BERT‑base provide richer representations at the expense of speed.

Model architecture determines how contextual information is aggregated into a single vector. Some designs use the CLS token, while others average token embeddings. The averaging approach often yields more stable results for short sentences.

Open‑source repositories host many pre‑trained models, each accompanied by documentation describing expected input formats. Compatibility with the Hugging Face library ensures straightforward integration into Python code. Verify that the models tokenization aligns with the language of your corpus.

Before committing, run a quick benchmark on a sample of your data to measure inference latency and memory footprint. Record the average time per embedding and the peak RAM usage. These metrics guide the trade‑off between accuracy and resource consumption.

Preparing Text Data for Vector Creation

Data preparation begins with cleaning raw text, removing HTML tags, and normalizing whitespace. Consistent preprocessing guarantees that the model receives input in a predictable shape. Preserve meaningful punctuation, as it can influence the embeddings semantic nuance.

Next, segment documents into logical units such as sentences or paragraphs, depending on the desired granularity of retrieval. Shorter units increase the chance of matching specific query intent, while longer units provide broader context. Decide based on the typical length of user queries.

Apply the models tokenizer to each segment, truncating or padding to the maximum sequence length supported by the transformer. Ensure that truncation does not cut off critical semantic cues prefer a length that accommodates most sentences in the dataset. Padding should use the token designated by the model.

Finally, batch the tokenized inputs and feed them through the model to obtain raw embeddings. Post‑process these vectors by normalizing them to unit length, which simplifies cosine similarity calculations later. Store the normalized vectors alongside identifiers for later lookup.

Constructing a Nearest‑Neighbor Index

With vectors in hand, build an index that can quickly retrieve the closest points to a query vector. Approximate nearest‑neighbor libraries such as FAISS or Annoy provide efficient search structures. Choose an index type that balances query speed with memory usage.

Index construction typically involves adding all document vectors to the data structure and then training any internal quantizers. For large collections, consider using IVF (inverted file) or HNSW (hierarchical navigable small world) configurations, which partition the space to reduce comparison count. Adjust the number of clusters or graph connectivity based on empirical performance.

After training, add the vectors to the index and persist the structure to disk for reuse. Verify that the index returns the expected number of nearest neighbors for a few test queries. Record the recall rate to ensure that approximations do not sacrifice too much accuracy.

When deploying, load the index into memory and expose a simple function that accepts a query vector and returns the identifiers of the top‑k closest documents. This function becomes the core of the semantic search service.

Executing Queries and Interpreting Results

At query time, preprocess the user input using the same cleaning and tokenization steps applied to the corpus. Consistency guarantees that the query vector resides in the same space as the indexed vectors. Generate the embedding with the selected transformer model.

Pass the query embedding to the nearest‑neighbor index to retrieve a set of candidate document IDs. The index returns distances that can be transformed into similarity scores, often by subtracting from one. Rank the candidates by these scores to produce the final ordered list.

Optionally, apply a re‑ranking step that incorporates additional signals such as recency or metadata relevance. This hybrid approach can improve user satisfaction when pure semantic similarity is insufficient. Present the top results to the user with highlighted snippets that contain query terms.

Monitor performance metrics like latency per query and the distribution of similarity scores. Use these insights to fine‑tune index parameters or to refresh the embedding model as newer versions become available. Continuous evaluation keeps the search experience aligned with evolving user expectations.