BibRec

This project is a Recommender Systems project to recommend books. It was completed along with other students in University.

Users can rate a book form a scale of 1-10 stars. Based on the information the user provides on login, the system recommends books the user might like. When viewing a book, similar books are recommended as well.

The Random Forest algorithm is used as Model-Based Collaborative Filtering Algorithm in order to predict ratings for a potential user given his age, country and state on books given their Year-of-Publication and which Publisher they belong from.

A Content Based Filtering approach is used to recommend similar books. Similarity is inferred by calculating the term frequency–inverse document frequency from the books title and its genres.

System Description

This project requires around 13GB of free RAM.

Dataset

As a base the BookCrossing was used for this project. The dataset is furthermore enhanced with genres taken from OpenLibrary. Normalized & hot-encoded versions are stored as files to increase startup performance.

Feature Engineering

Data Cleaning

  • Remove invalid entries
    • ISBN Duplicates & Conversion to ISBN-13 Standard
    • Split Location into Country, State, City
  • Data reduction
    • Ratings: 66.6% reduced from 1,149,780 to 383,962
    • Books: 0.16 % reduced from 271379 to 270944
    • Users: remains the same

Data Normalization

  • Replace missing values with mean value (Age, Year_of_publication)
  • Publication Year Offset by minus 2005
  • Only explicit ratings & rating bias correction
  • Extend by average rating and number

Hot Encoding

  • Categorization of publisher/country/state into the most common and “other”

Data extension

  • Genre, Subject from OpenLibrary data

Recommendation Strategy

  • Item Recommendations: Content Based: TF-IDF
    • Data used Title, Genres, Subjects
    • Calculation Cos Similarity
  • User Recommendations: MBCF: Random Forest
    • Features: Country, State, Age, Year-of-Publication, Publisher
  • Hybrid approach (Mixed): Collaborative Filtering + Content based
    • Used in the evaluation API
    • Ratio: 70% to 30%
  • Most Popular (Cold Start)
    • 80% Most Rated and 20% Least Rated
    • Merge and mix
  • Top in Country
    • Top 50 most rated books in a country
    • sorted by rating

Lessons

  • Python was partly difficult

    • Time-consuming (new territory, tinkering with DataFrames)
    • Extraction into separate files partly didn’t work
    • Jupyter Notebooks are great for interactive experiments + documentation
  • Performance issues with large amounts of data

    • RAM limits cause problems (OOM, system crash)
    • Data processing must be very precise (example: per-user split)
  • Development

    • Introduce shared code base / normalization earlier
    • Loading models isn’t feasible (MODEL_FILE_PKL)
    • Start implementation earlier in the semester
      • Less is forgotten / overlooked / left out