Summary

Natural language is incredibly rich, but difficult for a computer to process. To a human, a sentence like “I loved the movie!” instantly communicates meaning, sentiment, and tone. But to a machine learning model, this is just a string of characters with no specific meaning.

That’s where text embeddings come in.

In this post, we’ll walk through how to use embeddings to feed text into traditional machine learning models like linear regression, logistic regression, support vector machines (SVMs), k-Nearest-Neighbors (KNN), or decision trees, and how this enables those models to uncover patterns and make predictions from unstructured text data.


What Are Text Embeddings?

At a high level, text embeddings are numerical vector representations of text. They allow us to encode meaning, context, and even relationships between words into a format that machine learning models can understand.

For example, instead of representing the word “apple” as a simple one-hot vector, an embedding might encode it as

[0.15, -0.22, 0.98, 0.10, ..., -0.09]  # e.g., a 768-dimensional vector

This vector captures semantic similarities. So words like “apple” and “banana” will be closer together in embedding space than “apple” and “car”.

Some embedding models include:

  • Word2Vec
  • GloVe
  • FastText
  • Transformer-based models (e.g., BERT, OpenAI or LLM-Factory’s embedding APIs)

Embeddings + Traditional ML = Pattern Recognition from Text

Once we have a vector representation of text (a sentence, paragraph, or document), we can feed it into any machine learning model that works with numbers. Here’s how:

Step 1: Convert Text into Embeddings

You can use a pre-trained model to transform raw text into embeddings. For example, using LLM-Factory, or another service with an OpenAI-compatible embeddings API endpoint, you can convert a sentence into a high-dimensional vector through Python:

from openai import OpenAI
client = OpenAI(
    api_key='<Your LLM-Factory API Key>',
    base_url='https://api-llm-factory.ai.uky.edu/v1',
)

def generate_embeddings(inputs, model):
    # Call the completion endpoint to generate embeddings
    response = client.embeddings.create(input = inputs, model = model)
    embeddings = response.data[0].embedding
    return embeddings

text = 'Embed this text please.'
embedding = get_embedding(text) # Returns a vector of floats

Step 2: Use Embeddings as Features in ML Models

These embeddings now act as features, just like age, income, or number of purchases might in a structured dataset.

You can now train:

  • Logistic regression to classify sentiment (positive vs negative)
  • Linear regression to predict numerical outcomes (e.g., star ratings)
  • Random forests to detect categories of news articles
  • K-means clustering to group similar documents
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_embeddings, y_labels)

Why Use Embeddings with Traditional Models?

While deep learning models (like transformers) can do end-to-end processing of text, there are still compelling reasons to use traditional ML with embeddings:

  • Speed & Simplicity: Training an SVM or logistic regression model is fast and often doesn’t need GPUs.
  • Low-Resource Environments: Ideal for situations where training large models is impractical.
  • Explainability: Models like linear regression are easier to interpret than deep networks.
  • Feature Combinations: Embeddings can be combined with other structured data, like user profiles or timestamps.

Real-World Examples

Here are a few examples where this approach would work well:

  • Sentiment analysis: Use embeddings of product reviews to train a logistic regression model that classifies them as positive or negative.
  • Topic classification: Use embeddings of news headlines to train a random forest that tags the topic (e.g., sports, tech, politics).
  • Recommender systems: Use embeddings of article content alongside user features to predict which articles a user might like.

CAAI has used this approach in the One Good Choice project for predicting Food Compass Scores as well as specific nutritional quantities based on the text descriptions of a food item. We have also applied these techniques in several projects, classifying specific features about patient case information.

Categories:

Tags: