Semantic search replaces exact keyword matching with meaning‑based retrieval. By converting text into sentence embeddings, each document is represented as a high‑dimensional vector that captures its semantic content. A similarity metric then ranks these vectors, allowing queries to return relevant results even when wording differs.
Deep Technical Analysis
The workflow begins by loading a textual dataset, such as the public AG News corpus, and extracting the article body. A pre‑trained large language model from the sentence‑transformers library (e.g., all‑MiniLM‑L6‑v2) encodes each document into a dense vector. These vectors are stored in a matrix that feeds a nearest neighbor index built with sklearn.neighbors.NearestNeighbors, using cosine distance to measure similarity. At query time, the same model generates an embedding for the input text, the index returns the top‑k closest vectors, and the corresponding documents are presented to the user.
Data Preparation
Import the dataset with datasets.load_dataset, limit to a manageable size (e.g., first 1,000 entries), and clean any null entries. Preserve the original order to map returned indices back to source texts.
Embedding Generation
Instantiate the transformer model via SentenceTransformer('all‑MiniLM‑L6‑v2') and call model.encode on the text list. Enable batch processing and set show_progress_bar=True for efficient computation.
Index Construction
Create a NearestNeighbors object with n_neighbors=5 and metric='cosine'. Fit the model on the embedding matrix, which builds an approximate nearest‑neighbor structure suitable for rapid lookups.
Query Function
Define a function that accepts a plain‑text query and a top_k parameter. Inside, encode the query, retrieve indices and distances via kneighbors, and display the matched documents sorted by similarity score.