← All Posts Research

Feature Engineering for Language Models

Feature Engineering for Language Models

The researchers explored whether adding predefined interpretable features to large language models could enhance performance, building on prior work with CNNs. They conducted experiments using "parts of speech (POS) tagging applied to input data" to create supplementary features.

Methodology

Dataset: WikiText-2 (2 million tokens from Wikipedia)

Key Components:

  • Used Byte Pair Encoding tokenization with NFKC normalization
  • Applied Penn Treebank POS tags via spaCy's model
  • Created a decoder-only transformer with three layers and two attention heads
  • Added an input adapter concatenating baseline inputs with POS embeddings

Results

The approach achieved a 15.1% improvement in perplexity, reducing scores from 42.70 to 36.26.

Analysis

The researchers examined a ten-dimensional POS embedding using topological data analysis. By filtering for middle-length sequences, they identified distinct sentence structure groups:

  • Group A: Narrative sequencing with temporal/causal phrases
  • Group B: Expository compression with formal, technical reporting

The team concluded that POS features capture "pure syntactic information" and could serve as lightweight fine-tuning tools for language models.