Techniques & methods

Embedding

An embedding is a dense, fixed-length numerical vector that represents data — such as text, an image, or audio — in a high-dimensional space where semantically similar items are located geometrically close to each other.

In machine learning, an embedding maps discrete or high-dimensional inputs into a continuous vector space of typically 256 to 4096 dimensions. The defining property is that geometric proximity in this space corresponds to semantic similarity: the vectors for "cat" and "feline" will be close, while the vectors for "cat" and "automobile" will be far apart. Embeddings can represent words, sentences, documents, images, code, molecular structures, or any other data type for which an encoder model has been trained.

Text embeddings are produced by encoder-based neural networks — such as BERT-style transformers — or by pooling the hidden states of large generative models. The encoder processes an input and returns a single vector, often the mean of final hidden states or a special [CLS] token representation. Embedding models are trained using objectives such as contrastive learning on pairs of semantically similar and dissimilar examples, teaching the model to place similar items close together and dissimilar items far apart in the vector space. Similarity between two embeddings is then measured by cosine similarity or dot product.

Embeddings are the foundational component of semantic search, retrieval-augmented generation, recommendation systems, and many classification and clustering pipelines. They allow systems to operate over meaning rather than surface form: a query about "vehicle maintenance" can match documents about "car repair" even if no common keywords are shared. Multimodal embeddings — which place text and images in the same vector space — power cross-modal search, such as querying an image database with a text description.

Leading embedding models as of 2026 include OpenAI's text-embedding-3-large (3072 dimensions), Cohere Embed v3, Google's Gecko embedding family, and open-source models from the Massive Text Embedding Benchmark (MTEB) leaderboard. Quality is evaluated on MTEB across retrieval, classification, clustering, and semantic similarity tasks in multiple languages. Matryoshka representation learning allows practitioners to truncate embedding vectors to smaller dimensions with graceful accuracy degradation, enabling cost-quality trade-offs at inference time.

Example

An e-commerce platform encodes all product descriptions into 1536-dimensional embeddings at index time; when a user types 'comfortable shoes for long walks,' the query is embedded and the closest product vectors are returned, surfacing relevant results even if none contain those exact words.

Related terms

Latest news on this topic

← Glossary