Risk Score Estimation

Eugenio "Jay" Zuccarelli
4 min readMay 21, 2020

--

Using machine learning to develop better risk scores

Assigning a numerical risk score is a vital aspect in many settings. For instance, hospitals rely on assigning a score to each patient showing up at the door to prioritize those deemed more critical (the so-called “Triage System”). Financial services companies use scores such as FICO’s to determine how likely we are to repay a loan. Car insurance firms use a wide variety of measures to assess the risk of drivers getting into road accidents.

However, many of these risk scores are developed manually, qualitatively and based upon rules. These end up not fully capturing the actual risk and are based on assumptions that make them subject to bias. In contrast, machine learning has proven to be not just a viable alternative but also a better solution, introducing quantitative methods that outperform the existing techniques and are less prone to subjective decisions.

Want to read this story later? Save it in Journal.

So, How Do You Develop a Risk Score?

Developing a risk score consists of a series of steps that, regardless of the application, are almost always the same:

1. Define a Binary Outcome

It starts with thinking about your objective in binary terms. For instance, in the context of loan repayment, this is easy. The outcome is the repayment of the loan itself, which can have only two outcomes: the person can successfully repay the loan or default.

In hospital settings, this might be less intuitive. For instance, when a patient shows up at an Emergency Department, the outcome of interest could be the length of stay in the hospital for that patient. In that case, the binary outcome can be defined as whether the stay (expressed in days) will be shorter than average, or longer than average.

2. Build a Classification Model

Once we have identified our outcome of interest, we can proceed to build a binary classification model predicting the outcome variable, such as logistic regression or decision trees. The model takes as input the feature information that is available to us (X) and predicts whether the outcome happens (y=1) or it does not (y=0).

3. Extract the Sample’s Probability

After having trained the model, we are now able to predict the probability of the outcome occurring using the available data. In particular, we extract the probability associated with the outcome occurring (y=1).

4. Map the Probability to a Scale

Finally, the probability can be mapped to the scale of interest. For instance, if we want to develop a risk score ranging from 0, meaning no risk, to 5, the highest level of risk, we can simply multiply the probability by 5 and obtaining the new rescaled score.

Use Cases

Over the past year, I have worked on various projects which consisted of developing a risk score:

  1. Driver Score
    I developed a machine learning model able to get information on drivers and return a risk score indicating the driver’s dangerousness. I started defining the outcome as to whether or not the driver will get into a severe or fatal accident. Then, I built a CART-like decision tree trained on drivers’ biographical and their vehicle information. Finally, I extracted the predictions’ probabilities and mapped them on a 1 to 5 scale. This Driver Risk Score could not only be used by insurance companies to better estimate the type of drivers but also by drivers themselves to improve having quantitative feedback. You can read more about it here.
  2. Patient’s Mortality Risk Score
    In my current research at MIT, I work with patient records of children affected by congenital heart disease to develop a model estimating the risk of death if undergoing a certain procedure. This is vital to make sure that such vulnerable patients are assigned to the procedure that most likely will make sure of their wellbeing. In this case, the binary outcome is simply whether or not a patient will survive a particular procedure. After having created a Classification Tree, I extracted the probabilities and assigned the score to each patient. This can help clinicians decide whether a certain procedure is indeed the best option considering that patient’s unique set of characteristics. The model has now become a web tool and iPhone app that physicians can use to help guide decision making.
  3. Senior Facilities COVID-19 Risk Score
    As a member of the COVID-19 Policy Alliance, I helped develop the Senior Facilities Score, measuring the risk of a given facility having a COVID-19 outbreak. The model is based on multiple facility-level datasets, predicting whether or not a facility has at least one confirmed case. Once the model is trained, the resulting risk score can be used to inform states on which facilities are most at risk to better prioritize resource allocation to such facilities first. The model is currently being published and available here.

Summary

To summarise, developing a risk score is a matter of following a few simple steps:

  1. Define a Binary Outcome
  2. Build a Classification Model
  3. Extract the Sample’s Probability
  4. Map the Probability to a Scale

This method can be used across disciplines to intuitively inform decision-makers by assigning quantitative measures of risk to the aspect of interest, without introducing human bias.

To read more articles like this, follow me on Twitter, LinkedIn or my Website.

📝 Save this story in Journal.

👩‍💻 Wake up every Sunday morning to the week’s most noteworthy stories in Tech waiting in your inbox. Read the Noteworthy in Tech newsletter.

--

--

Eugenio "Jay" Zuccarelli

Data Science for Fortune 100 | Forbes 30 Under 30 | Fortune 40 Under 40 | MIT, Harvard, Imperial College | Follow on Socials as @jayzuccarelli