Rap Gender Data Prep

We are in a golden age of female rap!!!

Rap has historically been dominated by men, as reflected in the charts and the canon. This is not to downplay the contributions of numerous women (and female-identifying people) to the genre, but on numbers alone they are in the minority. In the last decade, however, the trend has swung in the opposite direction. While still far from parity, women are significantly better represented practically everywhere within the genre.

Given the several-fold increase in rap by women, I got an idea for a data project a few months ago: Could I train a model to correctly guess the gender of a rapper that said a given line?

For the time being, the answer is “no” or at least “not yet”.

I pulled over 200,000 verses from some 600 artists from Genius, then trained a few models to perform sentence classification on that data, to little success. It is entirely possible that this is an impossible task, but there are also plenty of ways I could improve on my work.

In the mean time, however, I have been searching for a job. So I figured I would write up the preliminary work I did to acquire, clean, standardize and optimize the data before I started training models, as a case study in how I approach this kind of work.


  1. Intro
  2. A Note on the Data

Data Acquistion

  1. Data Acquistion Introduction
  2. Search
  3. Filter
  4. Refilter with alternate names
  5. Dates
  6. Output
  7. Lyrics
  8. Verses
  9. Additional cleaning and standardization
  10. Add metrics

Exploratory Data Analysis

  1. Imbalanced classes
  2. Overrepresentation
  3. Maximizing input data
  4. Checking for biased classes


  1. Some thoughts on optimization
  2. Removing recurring text
  3. Downsampling overrepresented artists
  4. Removing verses with low unique token percentages


  1. Review
  2. Imbalanced classes revisited
  3. Overrepresentation revisited
  4. Conclusion and future considerations