Machine Learning, Supervised Learning

With self-driving cars, voice-controlled personal assistants, and recommendation engines built into your favorite streaming services, it seems like we are only at the tip of the iceberg when it comes to machine learning (ML). For many of us, it is increasingly governing much of how we live and choose to spend our time.

Ever passionate about the technology that defines our world, our engineers at Index Exchange have leaned heavily into ML, specifically in the realm of supervised learning. Our ML-enhanced pipeline, Supply Traffic Filter, is a good example of how we’ve invested in the space and we’re here to explain exactly how we did it.

Step 1: Pick the machine learning challenge

The first step in an ML project is defining the challenge. Not all problems are good candidates for ML. Consider:

Would solving the problem manually require a lot of time? Involve extensive rules?
Do outcomes change so often it is hard to maintain?
Is it difficult to get insights into the problem because data volume is too high?

At Index, one such challenge was managing our increasingly massive scale. For example, when COVID-19 hit, we instantaneously saw a huge increase in traffic to our edge at the same time as our data center technicians were grounded with travel restrictions. We needed to determine: could we continue scaling traffic while being more efficient with how we handled it, with little or no impact to the media owners and marketers who rely on our platform? Predicting the value of incoming auctions would be challenging for us due to our huge data volume. When combined, all of these factors suited an ML challenge, and specifically, a supervised machine learning approach.

Step 2: Select features

We now have a machine learning challenge to solve. Exciting! Next is choosing the best features. An ML feature is a measurable property or characteristic. Feature selection is crucial, as an ML model’s performance is highly dependent on it. The right features can significantly improve model performance and decrease training time. Conversely, irrelevant features can adversely affect model performance. So, how do you know if a feature is suitable or not for your needs?

Identifying features

Start by finding all possible features. When we first started working on Supply Traffic Filter, researchers were faced with a multitude of potential features, so they reached out to domain experts within the company to help us gather the list.

Pre-processing the feature set

In a perfect world we would have clean data, but in reality, there is a lot of noise and randomness. As a result, ML engineers typically need to spend a lot of time understanding and cleaning their data.

Some features are textual, but ML models expect numerical values. It is the engineers’ responsibility to translate these features into a numerical format, most commonly by one-hot encoding or integer encoding techniques. One-hot encoding often performs better than integer encoding, but comes with a higher training time cost. One-hot encoding is recommended for small datasets, while integer encoding is superior for large datasets.

Reducing the feature set

Next, we try to reduce the number of features in a process known as dimensionality reduction. Having a lot of features does not necessarily improve your model’s performance. These are two key reasons why the number of features needs to be limited:

Generalize the model: While a feature may look correlative to an outcome, this correlation might not generalize to future data.
Reduce training and prediction time: A fewer number of features enables faster model training and outcome prediction.

Okay, so we understand that the number of features must be limited, but how do we reduce our feature dimensionality? While there are many different techniques, we suggest evaluating feature importance. This technique scores the relevance of a feature to an outcome. In essence, this means that we should keep the high-scoring features and drop the low-scoring ones.

In our case, the Supply Traffic Filter project started out with 50 features but we were able to narrow it down to just 11 to start.

Step 3: Select a machine learning model

At this stage we have to choose the learning algorithm. In supervised learning, the two main types of model algorithms are:

Algorithm Type	Predicts
Classification	Target classes
Regression	Value-likeihood

For our Supply Traffic Filter challenge, our goal was to classify the “no-bid” auctions before they happen, making it a classification problem. Interestingly, however, our research team ended up choosing a regression model. We want to filter traffic by a dynamic amount based on load. If our exchange nodes are heavily overloaded, we need more aggressive filtering to keep them operational and performant. Using the regression model, the model scores each auction’s predicted relevance to our buyers. Based on this score, our servers can dynamically filter traffic proportionally to server active volume.

Understanding machine learning algorithms

In supervised learning, there are many widely used learning algorithms. When picking one, it is important that we have a good understanding of feature characteristics and the pros and cons of each algorithm.

Supervised learning algorithm	Pros	Cons
Logistic regression	Simple to implement and understand Low cost on hyperparameter tuning Fast to train	Poor performance on non-linear data Sensitive to noise
Decision tree	Scales well with large training data size Easy to visualize and explain	Sensitive to overfitting Sensitive to data
Support vector machine	Robust against outliers Great performance with small training data size	Slow to train when data size is big Sensitive to hyperparameters
K-nearest neighbor	Simple to understand and implement Low cost on hyperparameter tuning No assumptions about the data	Slow to infer for large datasets Doesn’t scale well with large datasets
Neural network	Works well with non-linear datasets Works well with high-dimensional datasets Large body of academic research	Computationally expensive to train Difficult to explain the model

The next step is to look at our dataset and find the algorithm that is best suited to it. It is not enough to just look at the pros and cons.

For example, Index transacts hundreds of billions of auctions every day, which means we can eliminate the support vector machine and k-nearest neighbor algorithms as they are slow to train with large datasets. Next, many of the features in our dataset were categorical, and therefore non-linear. This allowed our research team to eliminate the logistic regression algorithm (optimized for linear datasets). Lastly, we wanted an algorithm that was easy to explain, which eliminated the neural network algorithm. This left us a single learning algorithm choice: decision tree.

Step 4: Train the model

If you followed along until this step, you now have the features and learning algorithm required to train a model! ML engineers spend much of their time cleaning up the features and finding the right training algorithm. With that out of the way, we can get to work training and testing the model.

Prior to training a model, we need to split the datasets into three different groups: training, validating, and testing. This is so that we can generalize the model and collect unbiased performance scores. A common split is for 70% of the dataset to be used for training, 15% for validation, and the remaining 15% for testing.

Training and tuning the model

The next step is training and optimizing the model to maximize performance. Model performance can vary depending on something called hyperparameters. The learning algorithm defines the high-level learning style, whereas the hyperparameters define the details of how the model should learn. At Index, we used the decision tree model which happens to have a risk of overfitting. Tuning these hyperparameters allows us to reduce statistical noise while maintaining the model’s ability to follow the statistical trends in the data. The key hyperparameters for decision tree are:

Minimum leaf samples: If the minimum number of samples within a leaf node is too low, then the model becomes overfitted (where the model memorized the training data including noises). Too high, and the model becomes underfitted (where the model failed to capture the underlying trend of the training data).
Maximum depth: This sets the maximum depth of the decision tree. It prevents the tree from branching further if the depth reaches this value. If the maximum depth is too high, the model will become overfitted and if it is too low, then the model becomes underfitted.

How do we ensure that the hyperparameters are optimal and generalized? Cross-validation and grid search algorithms are validation techniques that can be used to intelligently tune hyperparameters.

Testing the model

We now have a trained model that best generalizes the training and validation datasets—so it’s time for a test. For Supply Traffic Filter, as we wanted to filter out as many no-bid auctions as possible without losing predictive power, we used two KPIs to measure model performance:

Recall rate: How correctly is the model predicting valuable transactions. It is ok to let through irrelevant auctions, but not to filter auctions that actually would have received a bid.
Auction reduction rate (predicted condition negative rate): This is the percentage of auctions the model is dropping, or the server resource savings we can expect.

We created a model capable of dropping a double-digit percentage of auctions with at least 99% recall rate. This is noteworthy because it means we can identify a large number of unproductive auctions before they take place with near-zero impact to our customers.

Now, depending on the market, the characteristics of the data can rapidly change over time. What is valuable today may become non-valuable tomorrow, and vice-versa. ML engineers must always define the retraining frequency, especially if training is done in batches on offline servers. If a model’s predictive power decreases rapidly, we recommend dropping features that are sensitive to market changes.

Step 5: Go live

We have now built a model with great predictive power that generalizes well to the data. We are steps away from enabling it in production.

First we need to build a pipeline to collect unbiased training data. Once live in production, we want to prevent the model from biasing the training data (or the erroneous assumptions), so this data should be independent from the effects of the model. In the Supply Traffic Filter project, the outcome of auctions will depend on the prediction once the model starts to filter auctions. To remove training bias, we created a separate pipeline that bypasses our filtering process. The bypassed auctions are then used for training new, future models.

Lastly, we need to monitor model performance in real time. A sudden event can, for instance, trigger a market shift. When this happens, a new model may need to be redeployed on short notice. We use the bypassed auction traffic to monitor model performance in real time. This allows us to track model degradation and take immediate action if we see unexpected or problematic behavior.

Conclusion

Congratulations, you’re operational!

While there is much work involved, implementing the right machine learning development process, means you can benefit from your own data in new ways. Machine learning is—and continues to be—a high priority for our engineers at Index and we encourage you to spearhead new ML projects yourself!

Looking for your next challenge? Come join an engineering team that is constantly tackling complex challenges at a massive scale.

Explore careers

Taihwa Song

Principal software engineer

Taihwa Song is a principal software engineer at Index Exchange focused on machine learning (ML) operations. Currently, he is building ML pipelines for data ingestion and model serving, and looks forward to optimizing model deployment paths to allow smoother integration of ML products with existing functions.

Back to blog