Fusing Dense LLM Embeddings, Sparse TF-IDF, and Structured Metadata
It merges three complementary feature streams-dense semantic vectors from a sentence transformer, sparse lexical weights generated by TF-IDF, and handcrafted metadata attributes such as length and digit ratio-into a single scikit-learn workflow that can be trained end-to-end for accurate news-article classification. It simplifies feature engineering while preserving each representations strengths, enabling the classifier to exploit contextual meaning and cues.
End-to-End Pipeline Construction
The process begins by loading the 20 Newsgroups dataset, then synthetically generating metadata from raw text. Crucially, the data split occurs before any transformation so that the TF-IDF vocabulary and the sentence transformer model are fitted exclusively on training data, preventing leakage. Three parallel branches are defined: one for TF-IDF, one for LLM embeddings, and one for standardized metadata. These branches are merged with a ColumnTransformer, followed by a LogisticRegression classifier. The final pipeline handles preprocessing, feature fusion, and model training in a single fit/predict call.
Data Import and Preparation
Using fetch_20newsgroups, we select a subset of categories and create X_raw and y. Synthetic metadata such as character length, word count, average word length, uppercase ratio, and digit ratio are computed from each document.
Metadata Feature Engineering
Metadata features are assembled into a DataFrame and scaled with StandardScaler to align their ranges with other feature types.
TF-IDF Extraction
A TfidfVectorizer converts raw text into a high‑dimensional sparse matrix, capturing term frequency and inverse document frequency information.
LLM Embedding Integration
A custom transformer wraps a pre‑trained sentence transformer (e.g., all‑MiniLM‑L6‑v2) to generate dense embeddings for each document on‑the‑fly within the pipeline.
ColumnTransformer Fusion
The three branches are combined via a ColumnTransformer, assigning each transformer to its respective column set, producing a single feature matrix ready for classification.
Model Training and Evaluation
The fused feature matrix feeds a LogisticRegression model. After fitting on the training split, predictions on the test set are evaluated with accuracy and classification‑report metrics.