BibRec

This project is a Recommender Systems project to recommend books. It was completed along with other students in University.

Users can rate a book form a scale of 1-10 stars. Based on the information the user provides on login, the system recommends books the user might like. When viewing a book, similar books are recommended as well.

The Random Forest algorithm is used as Model-Based Collaborative Filtering Algorithm in order to predict ratings for a potential user given his age, country and state on books given their Year-of-Publication and which Publisher they belong from.

A Content Based Filtering approach is used to recommend similar books. Similarity is inferred by calculating the term frequency–inverse document frequency from the books title and its genres.

System Description

Frontend: Vite, React, Material UI, Axios
Backend: REST API mit Flask
Algorithms: Random Forest & Content Based Filtering
Libraries
- Initial research with the CaseRecommender
- RS Algorithms used from Scikit-Learn
Python to train models
Jupyter Notebooks for experiments
Build: Makefile, Docker Compose

This project requires around 13GB of free RAM.

Dataset

As a base the BookCrossing was used for this project. The dataset is furthermore enhanced with genres taken from OpenLibrary. Normalized & hot-encoded versions are stored as files to increase startup performance.

Feature Engineering

Data Cleaning

Remove invalid entries
- ISBN Duplicates & Conversion to ISBN-13 Standard
- Split Location into Country, State, City
Data reduction
- Ratings: 66.6% reduced ⇒ from 1,149,780 to 383,962
- Books: 0.16 % reduced ⇒ from 271379 to 270944
- Users: remains the same

Data Normalization

Replace missing values with mean value (Age, Year_of_publication)
Publication Year Offset by minus 2005
Only explicit ratings & rating bias correction
Extend by average rating and number

Hot Encoding

Categorization of publisher/country/state into the most common and “other”

Data extension

Genre, Subject from OpenLibrary data

Recommendation Strategy

Item Recommendations: Content Based: TF-IDF
- Data used Title, Genres, Subjects
- Calculation Cos Similarity
User Recommendations: MBCF: Random Forest
- Features: Country, State, Age, Year-of-Publication, Publisher
Hybrid approach (Mixed): Collaborative Filtering + Content based
- Used in the evaluation API
- Ratio: 70% to 30%
Most Popular (Cold Start)
- 80% Most Rated and 20% Least Rated
- Merge and mix
Top in Country
- Top 50 most rated books in a country
- sorted by rating

Lessons

Python was partly difficult
- Time-consuming (new territory, tinkering with DataFrames)
- Extraction into separate files partly didn’t work
- Jupyter Notebooks are great for interactive experiments + documentation
Performance issues with large amounts of data
- RAM limits cause problems (OOM, system crash)
- Data processing must be very precise (example: per-user split)
Development
- Introduce shared code base / normalization earlier
- Loading models isn’t feasible (MODEL_FILE_PKL)
- Start implementation earlier in the semester
  - Less is forgotten / overlooked / left out

Fabian Untermoser

Recent Notes

Home

XAI

Latex Packages

Configuration Drift

vim-fugitive