Skip to Content
  • Home
  • Blog
  • Privacy Policy
  • Terms And conditions
  • Disclaimer
  • About Us
      • Home
      • Blog
      • Privacy Policy
      • Terms And conditions
      • Disclaimer
      • About Us
  • Knowledge Base
  • Fusing LLM Embeddings, TF-IDF, and Metadata in a Scikit-learn Pipeline for Text Classification
  • Fusing LLM Embeddings, TF-IDF, and Metadata in a Scikit-learn Pipeline for Text Classification

    6 March 2026 by
    Suraj Barman

    Fusing Dense LLM Embeddings, Sparse TF-IDF, and Structured Metadata

    It merges three complementary feature streams-dense semantic vectors from a sentence transformer, sparse lexical weights generated by TF-IDF, and handcrafted metadata attributes such as length and digit ratio-into a single scikit-learn workflow that can be trained end-to-end for accurate news-article classification. It simplifies feature engineering while preserving each representations strengths, enabling the classifier to exploit contextual meaning and cues.

    End-to-End Pipeline Construction

    The process begins by loading the 20 Newsgroups dataset, then synthetically generating metadata from raw text. Crucially, the data split occurs before any transformation so that the TF-IDF vocabulary and the sentence transformer model are fitted exclusively on training data, preventing leakage. Three parallel branches are defined: one for TF-IDF, one for LLM embeddings, and one for standardized metadata. These branches are merged with a ColumnTransformer, followed by a LogisticRegression classifier. The final pipeline handles preprocessing, feature fusion, and model training in a single fit/predict call.

    Data Import and Preparation

    Using fetch_20newsgroups, we select a subset of categories and create X_raw and y. Synthetic metadata such as character length, word count, average word length, uppercase ratio, and digit ratio are computed from each document.

    Metadata Feature Engineering

    Metadata features are assembled into a DataFrame and scaled with StandardScaler to align their ranges with other feature types.

    TF-IDF Extraction

    A TfidfVectorizer converts raw text into a high‑dimensional sparse matrix, capturing term frequency and inverse document frequency information.

    LLM Embedding Integration

    A custom transformer wraps a pre‑trained sentence transformer (e.g., all‑MiniLM‑L6‑v2) to generate dense embeddings for each document on‑the‑fly within the pipeline.

    ColumnTransformer Fusion

    The three branches are combined via a ColumnTransformer, assigning each transformer to its respective column set, producing a single feature matrix ready for classification.

    Model Training and Evaluation

    The fused feature matrix feeds a LogisticRegression model. After fitting on the training split, predictions on the test set are evaluated with accuracy and classification‑report metrics.


    Latest Stories

    Explore fresh ideas and updates from our editorial team.

    See All
    Your Dynamic Snippet will be displayed here... This message is displayed because you did not provide enough options to retrieve its content.

    Copyright © 2026 TechStora. All Rights Reserved.