Light Mode

Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

chenxingqiang/NOVA-BERT

Repository files navigation

NOVA-BERT

Non-invasive Self-Attention for Side Information Fusion in Sequential Recommendation

Table of Contents

  • Introduction
  • NOVA Algorithm
  • Installation
  • Quick Start
  • NOVA Model Usage
  • API Reference
  • Configuration
  • Datasets
  • Evaluation
  • Citation

Introduction

NOVA-BERT implements the NOVA (NOninVasive self-Attention) mechanism for sequential recommendation, as proposed in:

Non-invasive Self-attention for Side Information Fusion in Sequential Recommendation

Chang Liu, Xiaoguang Li, Guohao Cai, Zhenhua Dong, Hong Zhu, Lifeng Shang

arXiv:2103.03578 (2021) | Paper

The Problem

Traditional approaches to incorporating side information (e.g., item categories, brands, tags) into sequential recommendation models typically directly fuse this information into item embeddings:

item_embedding = item_embedding + category_embedding # Direct fusion

This causes information overwhelming - the side information can dominate or distort the learned item representations, leading to suboptimal performance.

The NOVA Solution

NOVA takes a fundamentally different approach: instead of modifying item embeddings, it uses side information to improve the attention distribution. This is non-invasive because:

  1. Item embeddings remain unchanged
  2. Side information only affects how items attend to each other
  3. The model learns better attention patterns using auxiliary signals

NOVA Algorithm

Mathematical Formulation

Standard Self-Attention (BERT4Rec)

Attention(Q, K, V) = softmax(Q x K^T / d) x V

where:
Q = X x W_Q (Query projection)
K = X x W_K (Key projection)
V = X x W_V (Value projection)
X = item embeddings
d = dimension per head

NOVA Self-Attention

NOVA_Attention(Q, K, V, Q_s, K_s) = softmax(
Q x K^T / d + l x Q_s x K_s^T / d
) x V

where:
Q, K, V = projections from item embeddings (same as standard)
Q_s = S x W_Qs (Query projection from side info)
K_s = S x W_Ks (Key projection from side info)
S = side information embeddings
l = side_info_weight (controls influence, default 0.5)

Key insight: V still comes from item embeddings only!

Visual Comparison

+-----------------------------------------------------------------+
| STANDARD ATTENTION |
+-----------------------------------------------------------------+
| |
| Item Embeddings --+-- Q --+ |
| +-- K --+-- Attention Scores -- Softmax --+ |
| +-- V --+ | |
| Output |
+-----------------------------------------------------------------+

+-----------------------------------------------------------------+
| NOVA ATTENTION |
+-----------------------------------------------------------------+
| |
| Item Embeddings --+-- Q --+ |
| +-- K --+-- Item Attention --+ |
| +-- V -----------------------+---+ |
| | | |
| Side Info --------+-- Q_s -+ | | |
| +-- K_s -+-- Side Attention -+ | |
| | | | |
| | Combined Score | |
| | | | |
| | Softmax ---------+-- Output |
| | |
| Note: V comes ONLY from item embeddings (non-invasive!) |
+-----------------------------------------------------------------+

Why Non-Invasive Works Better

Approach Method Problem
Direct Fusion emb = item_emb + side_emb Side info can overwhelm item signal
Concatenation emb = concat(item_emb, side_emb) Increases dimension, more parameters
Gating emb = gate * item_emb + (1-gate) * side_emb Still modifies item representation
NOVA Side info affects attention only Item embeddings preserved, better generalization

Installation

Requirements

  • Python 2.7+ or Python 3.6+
  • TensorFlow 1.12+ (GPU recommended)
  • CUDA compatible with TensorFlow
  • NumPy, Six

Install

git clone https://github.com/chenxingqiang/NOVA-BERT.git
cd NOVA-BERT
pip install -r requirements.txt

Quick Start

1. Standard Mode (BERT4Rec)

./run_ml-1m.sh

2. NOVA Mode (with side information)

CUDA_VISIBLE_DEVICES=0 python -u run.py \
--train_input_file=./data/ml-1m.train.tfrecord \
--test_input_file=./data/ml-1m.test.tfrecord \
--vocab_filename=./data/ml-1m.vocab \
--user_history_filename=./data/ml-1m.his \
--checkpointDir=./checkpoints/ml-1m-nova \
--signature=-nova \
--bert_config_file=./bert_train/bert_config_ml-1m_64.json \
--do_train=True \
--do_eval=True \
--batch_size=256 \
--max_seq_length=200 \
--num_train_steps=400000 \
--learning_rate=1e-4 \
--use_nova=True \
--use_side_info=True \
--side_info_vocab_size=100 \
--side_info_weight=0.5

NOVA Model Usage

Python API

1. Create NOVA Configuration

import modeling

# Method 1: Create programmatically
config = modeling.NOVABertConfig(
vocab_size=3420, # Number of items
hidden_size=64, # Hidden dimension
num_hidden_layers=2, # Number of transformer layers
num_attention_heads=2, # Number of attention heads
intermediate_size=256, # Feed-forward dimension
hidden_act="gelu", # Activation function
hidden_dropout_prob=0.2, # Hidden layer dropout
attention_probs_dropout_prob=0.2, # Attention dropout
max_position_embeddings=200, # Max sequence length
# NOVA-specific parameters
use_side_info=True, # Enable side information
side_info_vocab_size=100, # Number of categories/tags
side_info_embedding_size=64, # Side info embedding dim
side_info_weight=0.5 # l: side info influence weight
)

# Method 2: Load from JSON file
config = modeling.NOVABertConfig.from_json_file("bert_config.json")

2. Create NOVA Model

import tensorflow as tf

# Prepare inputs
batch_size = 32
seq_length = 200

input_ids = tf.placeholder(tf.int32, [batch_size, seq_length]) # Item IDs
input_mask = tf.placeholder(tf.int32, [batch_size, seq_length]) # Attention mask
side_info_ids = tf.placeholder(tf.int32, [batch_size, seq_length]) # Category IDs

# Create NOVA model
model = modeling.NOVABertModel(
config=config,
is_training=True,
input_ids=input_ids,
input_mask=input_mask,
side_info_ids=side_info_ids, # Side information (optional)
use_one_hot_embeddings=False
)

# Get outputs
sequence_output = model.get_sequence_output() # [batch, seq, hidden]
embedding_table = model.get_embedding_table() # [vocab_size, hidden]
side_info_emb = model.get_side_info_embedding() # [batch, seq, hidden] or None

3. Use NOVA Attention Layer Directly

# For custom architectures
context = modeling.nova_attention_layer(
from_tensor=query_tensor, # [batch*seq, hidden]
to_tensor=key_value_tensor, # [batch*seq, hidden]
side_info_tensor=side_embeddings, # [batch*seq, side_hidden] (optional)
attention_mask=mask, # [batch, seq, seq]
num_attention_heads=2,
size_per_head=32,
attention_probs_dropout_prob=0.1,
side_info_weight=0.5, # l parameter
do_return_2d_tensor=True
)

4. Use NOVA Transformer

# Full NOVA transformer stack
outputs = modeling.nova_transformer_model(
input_tensor=embeddings, # [batch, seq, hidden]
side_info_tensor=side_embeddings, # [batch, seq, hidden] (optional)
attention_mask=mask,
hidden_size=64,
num_hidden_layers=2,
num_attention_heads=2,
intermediate_size=256,
side_info_weight=0.5,
do_return_all_layers=True
)

API Reference

Classes

NOVABertConfig

Configuration class for NOVA-BERT model.

Parameter Type Default Description
vocab_size int required Item vocabulary size
hidden_size int 768 Hidden layer dimension
num_hidden_layers int 12 Number of transformer layers
num_attention_heads int 12 Number of attention heads
intermediate_size int 3072 Feed-forward layer size
hidden_act str "gelu" Activation function
hidden_dropout_prob float 0.1 Hidden layer dropout
attention_probs_dropout_prob float 0.1 Attention dropout
max_position_embeddings int 512 Maximum sequence length
use_side_info bool False Enable side information
side_info_vocab_size int 0 Side info vocabulary size
side_info_embedding_size int 64 Side info embedding dimension
side_info_weight float 0.5 Weight (l) for side info attention

NOVABertModel

NOVA-enhanced BERT model for sequential recommendation.

Method Returns Description
get_sequence_output() Tensor [B, S, H] Final hidden states
get_all_encoder_layers() List of Tensors All layer outputs
get_embedding_output() Tensor [B, S, H] Input embeddings
get_embedding_table() Tensor [V, H] Item embedding table
get_side_info_embedding() Tensor or None Side info embeddings

Functions

nova_attention_layer

nova_attention_layer(
from_tensor, # Query source tensor
to_tensor, # Key/Value source tensor
side_info_tensor=None, # Side information tensor (optional)
attention_mask=None, # Attention mask
num_attention_heads=1, # Number of heads
size_per_head=512, # Dimension per head
query_act=None, # Query activation
key_act=None, # Key activation
value_act=None, # Value activation
attention_probs_dropout_prob=0.0,
initializer_range=0.02,
do_return_2d_tensor=False,
batch_size=None,
from_seq_length=None,
to_seq_length=None,
side_info_weight=0.5 # l for side info influence
)

nova_transformer_model

nova_transformer_model(
input_tensor, # Input embeddings [B, S, H]
side_info_tensor=None, # Side info embeddings [B, S, H]
attention_mask=None, # Attention mask [B, S, S]
hidden_size=768,
num_hidden_layers=12,
num_attention_heads=12,
intermediate_size=3072,
intermediate_act_fn=gelu,
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
initializer_range=0.02,
do_return_all_layers=False,
side_info_weight=0.5 # l for side info influence
)

Command Line Arguments

Argument Type Default Description
--use_nova bool False Enable NOVA model
--use_side_info bool False Use side information in attention
--side_info_vocab_size int 0 Size of side info vocabulary
--side_info_embedding_size int 64 Side info embedding dimension
--side_info_weight float 0.5 Weight (l) for side info attention

Configuration

Standard BERT4Rec Config

{
"vocab_size": 3420,
"hidden_size": 64,
"num_hidden_layers": 2,
"num_attention_heads": 2,
"intermediate_size": 256,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.2,
"attention_probs_dropout_prob": 0.2,
"max_position_embeddings": 200,
"type_vocab_size": 2,
"initializer_range": 0.02
}

NOVA Config (with side information)

{
"vocab_size": 3420,
"hidden_size": 64,
"num_hidden_layers": 2,
"num_attention_heads": 2,
"intermediate_size": 256,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.2,
"attention_probs_dropout_prob": 0.2,
"max_position_embeddings": 200,
"type_vocab_size": 2,
"initializer_range": 0.02,
"use_side_info": true,
"side_info_vocab_size": 100,
"side_info_embedding_size": 64,
"side_info_weight": 0.5
}

Hyperparameter Guidelines

Parameter Recommendation Notes
side_info_weight 0.3 - 0.7 Start with 0.5, tune based on validation
side_info_embedding_size 32 - 128 Usually same as hidden_size or smaller
side_info_vocab_size Depends on data Number of unique categories/tags

Datasets

Supported Datasets

Dataset Items Users Interactions File
ML-1M 3,416 6,040 999,611 data/ml-1m.txt
ML-20M 26,744 138,493 20,000,263 data/ml-20m.zip
Beauty 12,101 22,363 198,502 data/beauty.txt
Steam 13,047 334,730 3,693,591 data/steam.txt

Data Format

Basic Format (user-item interactions)

user_id item_id
user_id item_id
...

Each line represents one interaction, sorted by timestamp per user.

Side Information Format

Create an additional mapping file for side information:

item_id category_id
1 5
2 3
3 5
...

Evaluation

Metrics

Metric Description
NDCG@K Normalized Discounted Cumulative Gain at K
Hit@K Hit Rate at K (1 if target in top-K, else 0)
MRR Mean Reciprocal Rank

Expected Results

Results on ML-1M dataset (hidden_size=64, 2 layers):

Model NDCG@10 Hit@10
BERT4Rec ~0.40 ~0.60
NOVA-BERT ~0.42 ~0.62

Note: Results may vary based on hyperparameters and random seeds.

Project Structure

NOVA-BERT/
+-- modeling.py # Model implementations
| +-- BertConfig # Standard BERT config
| +-- BertModel # Standard BERT model
| +-- NOVABertConfig # NOVA config with side info params
| +-- NOVABertModel # NOVA-enhanced BERT model
| +-- nova_attention_layer # NOVA attention mechanism
| +-- nova_transformer_model # NOVA transformer stack
+-- run.py # Training and evaluation
+-- optimization.py # AdamW optimizer
+-- gen_data_fin.py # Data preprocessing
+-- vocab.py # Vocabulary management
+-- util.py # Utilities
+-- bert_train/ # Config files
+-- data/ # Datasets
+-- requirements.txt # Dependencies
+-- run_*.sh # Training scripts

Citation

If you use this code, please cite:

@article{liu2021noninvasive,
title={Non-invasive Self-attention for Side Information Fusion in Sequential Recommendation},
author={Liu, Chang and Li, Xiaoguang and Cai, Guohao and Dong, Zhenhua and Zhu, Hong and Shang, Lifeng},
journal={arXiv preprint arXiv:2103.03578},
year={2021}
}

@inproceedings{sun2019bert4rec,
title={BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer},
author={Sun, Fei and Liu, Jun and Wu, Jian and Pei, Changhua and Lin, Xiao and Ou, Wenwu and Jiang, Peng},
booktitle={CIKM},
pages={1441--1450},
year={2019}
}

License

Apache License 2.0. See LICENSE.

Acknowledgments

  • NOVA paper authors for the non-invasive attention mechanism
  • Google AI Language Team for the original BERT implementation
  • BERT4Rec authors for sequential recommendation with BERT

About

Reproduce Non-invasive Self-attention for Side Information Fusion in Sequential Recommendation focusing on hyper-parameter settings, Reference, Usage, BERT4Rec

Topics

Resources

Readme

License

Apache-2.0 license

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors