The researchers explored whether adding predefined interpretable features to large language models could enhance performance, building on prior work with CNNs. They conducted experiments using "parts of speech (POS) tagging applied to input data" to create supplementary features.
Methodology
Dataset: WikiText-2 (2 million tokens from Wikipedia)
Key Components:
- Used Byte Pair Encoding tokenization with NFKC normalization
- Applied Penn Treebank POS tags via spaCy's model
- Created a decoder-only transformer with three layers and two attention heads
- Added an input adapter concatenating baseline inputs with POS embeddings
Results
The approach achieved a 15.1% improvement in perplexity, reducing scores from 42.70 to 36.26.
Analysis
The researchers examined a ten-dimensional POS embedding using topological data analysis. By filtering for middle-length sequences, they identified distinct sentence structure groups:
- Group A: Narrative sequencing with temporal/causal phrases
- Group B: Expository compression with formal, technical reporting
The team concluded that POS features capture "pure syntactic information" and could serve as lightweight fine-tuning tools for language models.