21 minute read

An AI bloodstock agent?

THIS YEAR has seen the arrival of AI technology into the mainstream and even those of us without a computer-literate brain can understand its potential, but can AI’s tendrils stretch to the world of bloodstock? Could computers really help buyers pick those yearlings who will become the next racecourse superstars?

Byron Rogers of Performance Genetics has spent a lot of time, effort, finance and brain power building a machine-learning application; he certainly believes technology has its place and sees the computer as a vital “co-pilot” for the production of a short-list of selections at a yearling sale.

In the first of a two-part series taken from his business blog, Rogers explains how he went about building his AI-based bloodstock agent

I HAVE BEEN WORKING at Performance Genetics for nearly a decade now. We’ve found many elite racehorses, faced some bitter disappointments, and learned valuable lessons from the data we have collected.

During that time we’ve explored pedigree data, biomechanics and kinematics, cardiovascular parameters, and DNA

The challenge of selecting yearlings

When we started to look at how to use data to better select yearlings, it took me, at least, a little while to understand what we were trying to do, or more specifically what we were trying to take advantage of.

It turns out that what we are doing, is trying to exploit the learning weakness that occurs with every single person when they go to a yearling sale and try to buy what they hope will subsequently become an elite racehorse.

Consider this:

• About three out of every 10 yearlings anyone looks at at a sale don’t get to the racetrack. Those three don’t give you any meaningful markers, continually refining our predictive approach based on the data that we have collected and the outcomes on the racetrack. information to form your selection process on (other than they might have a conformation defect that meant racing wasn’t possible). As humans, we are unaware of this outcome when we observe a horse, so can we discard the input from those unraced horses as useless? Answer - we can’t.

The ground truth from that data is this – most horses are predictably slow, but the truly exceptional ones stand out. Our journey has brought us closer to understanding what variables are most critical in predicting racehorse performance.

• A further one out of 10 yearlings doesn’t make more than three lifetime starts, so they also fail to yield sufficient information on which to base a reasonable assumption of their potential ability.

• So only six of 10 yearlings at sales will give you information to learn from. How do you know which six do and which four don’t? (the answer is, you don’t know)

In addition to the imperfect samples to learn from at a sale we also have three further issues which make things even more difficult for us:

1. Retention of visual information

It’s nearly impossible to remember exactly what a yearling looked like and match that to an outcome two years later, especially when evaluating thousands of horses annually.

You might remember the odd one or two good ones each year, but the retention of all information is poor, especially of the examples that help you the most.

2. Bias in memory

We tend to remember those true positives (horses we liked who turned out to be good) and discount the difference between true positives and false positives (horses we thought would be elite but were in fact slow).

We also fail to remember false negatives (horses we thought would be slow but were fast).

The false positives and false negatives are the ones we learn the most from but we don’t recall them well enough.

3. Overwhelmed by negative cases

The truly elite racehorses make up only three to six per cent of all horses at a sale. Our brains are overwhelmed by the negative cases, making identification of the elite horse more difficult.

Don’t worry, unless they are specifically trained to find it (like credit card fraud detection) it is the same problem most machine learning models have if we have that type of imbalance in the dataset.

Given the above (and setting aside trainer effect on outcomes) it is no wonder that those that are considered the best judges on the planet, are striking at 12% accuracy!

The role of AI in racehorse selection

Given these challenges, we realised that what we were building would act like a copilot at the yearling sales.

By knowing all the subsequent racetrack outcomes as the data matured and building predictive models iteratively, the models could, if fed data consistently, overcome human limitations, namely:

• The computer doesn’t “forget” what a horse looks like. It has the data for both good and bad horses.

• It wouldn’t be confused by horses that provide no data.

• It would learn from both false negatives (horses it thought were slow but were fast) and false positives (horses it thought were fast but were slow).

Building an AI/machine learning application

In mid-2019, I was approached by Google to beta test what was to become VertexAI. Ultimately, I wanted to develop an application that could:

• Easily scale to handle tens of 1000s of records.

• Be fairly “hands-off” without requiring much manual editing.

• Improve its predictive power over time as more horses had their data collected as yearlings, aged into the dataset.

I had been using Google’s custom video-recognition models for the year prior and that allowed me to train a model that predicted how “good” a cardio was [see Byron’s blog post on www. performancegenetics.com], but the VertexAI platform was more what I was looking for – something that could be used as the backbone for a completely managed end-to-end application.

Google subsequently launched Vertex AI in May, 2021 and, starting in June 2021, (thanks Covid) I did a complete re-write of the Performance Genetics platform.

I started from the basics – getting the data in, training models, refining models based on what features/variables that were found to be important, and expanding the dataset to be as large as I could make it.

It has been a two-year project to build what is now internally known as Velox (Latin for swift or rapid) up and going with the scale and operation that I was after – it is now battle ready.

Much of what is discussed in this article will encapsulate the “secret sauce” on how we developed Velox and how we now go about racehorse selection at Performance Genetics. So let’s begin:

Understanding the problem – how do we try to define and predict elite horses?

To begin the process of trying to create a set of models to help us predict elite racehorses, we had to start by defining how we wanted to approach the problem, and how we planned to overcome some of the issues that come up along the way.

In data science terms there were two options available:

1. Binary classification task

We could set the task up as a binary classification where the goal is to classify instances into one of two possible classes –elite or non-elite racehorses.

2. Regression task

We could set it up as a regression task where the goal is to predict a continuous numerical value. So, we could take a rating of a horse (Timeform, Equibase, etc) and try to build models that predict the rating that a horse could achieve.

It’s crucial to identify whether the problem is a binary classification or regression task at the outset, as this will guide our choice of algorithms, evaluation metrics, and overall approach and outcomes.

We went around this first problem a few times and each of the approaches has their positives and negatives, but in the end we settled with a binary classification.

First, we did try it as a regression problem and tried to predict the exact rating of a horse (we used Timeform ratings to start with as we could get ratings in Europe and the US) but found that it was really difficult from a data science viewpoint as we ran into two major issues.

1. In a general commercial population of yearlings and their subsequent racetrack performance, the distribution of Timeform ratings is heavily skewed.

Most of the ratings are clustered around 50-70 with only a few yearlings getting to 100+ ratings. While we tried methodologies to address the imbalance within the dataset, even after doing this the models did not perform well for those outliers because they optimise for the majority distribution.

2. It also suffered from what is known as heteroscedasticity.

The models consistently predicted more accurately (less error) for lower-rated horses and less accurately for higher-rated horses.

I think that this comes down to the fact that when you look at a large population of horses, in performance/effort terms there isn’t that much difference between a 90-rated horse that is a good allowance/ handicapper, and a 100-rated horse that is a Listed level horse, but there is a big difference between that 90-rated horse and one rated 110.

Also, for those that are rated 110+, they are different (leaders/backmarkers, Dirt/Turf, peak as two-year-olds/threeyear-olds/older) and a small portion of the dataset so, while there are some core variables that separate out elite and nonelite horses, getting a good number of them to represent each possbility of elite performance is almost impossible.

Said more plainly, the regression models were really good at predicting the average horse correctly, but couldn’t predict the elite horses as correctly as it should.

I think this is a major issue that competitors in our space that are trying to predict racehorse outcomes and are using regression models have failed to overcome.

Given the above, we then developed our solution as a binary classification task, but to understand that properly, we also need to firstly discuss our own worldwide rating, how we built it and how we determined who was elite (1) and non-elite (0) in our datasets.

Creating a worldwide rating – the challenge of an international dataset

One of the other challenges we had is finding a rating that we could use to accurately describe performance. As I said earlier we initially used Timeform ratings. The reason for this is that, generally speaking, I have found the ratings are an accurate measure of performance and they are a metric that can be found in both Europe and the US and have a similar basis of comparison (so European Timeform ratings are roughly calibrated with US Timeform).

After using Timeform for a few years it became logistically very unwieldy to use. Not because the rating was wrong, rather:

1. We had a large portion of the data set racing in Australia/New Zealand/Asia that Timeform didn’t cover.

2. As the data set expanded it was very difficult to manually look up and insert ratings into the database. I know, I should have got an API call from Timeform to automatically update it but it was two companies (the US is different to the UK) and the issue above with lack of data in all racing jurisdictions was the primary concern.

So, eventually we had to go about creating our own rating. To do this we partnered with a database that gets worldwide racing results and started to think about how to create a rating that reflected performance in every country that races around the world.

How did we do that?

It would seem simple to use the Pattern/ graded stakes system as it is a worldwide structure with some boundaries to the annotations given to each race so Group and Grade 1 races should be roughly comparable worldwide, but we also needed a way to rate the majority of horses that didn’t race in group races.

That part was more difficult as it is hard to compare allowance level races in the US, for example, with a benchmark 64 in Australia.

While some might say it is imperfect, we settled on a prize-money based rating as the best that was available.

Generally speaking prize-money is quite well distributed relative to racing class in each country with better races having higher prize-money values, and lesser races having less. There are some problems with races such as The Everest in Australia that skew a little, but they are not that many and those issues can be overcome.

First, we “normalised” prize-money per race. So, we took the full value in prize-money for a race and then proportionally distributed it across all runners in the race, so, even if a horse was finishing 10th of 10 in a race, it got some prize-money from that race assigned to it for that performance.

This smoothed out the distribution of money somewhat across all horses that competed in races, but didn’t diminish the achievement of winning the race.

After that, we then created an index. We did this by comparing:

• Horses of the same sex (i.e. only males compared to males and females compared to females)

• Horses of the same year of birth

• Horses racing the same year

• Horses racing in the same countries

So, for example, if Horse A is a filly born in 2019 who raced in 2023 in England, then we get all the prize-money earned by the same horses as Horse A (filly, 2019, raced 2023, England) and find what the average prize-money earned for that subset is.

That value is given an index of 1.0.

If Horse A has earned three times the average it will have an index of 3.0. If she has earned 50 per cent of the average it will have an index of 0.5.

Country weights

Once we have the value for each horse, we then had to weight it by the country that the horse performed in.

We had to do this as if you left it unweighted, you would see some absolutely silly results in small countries where a high-class horse has astronomically high numbers compared to others, but has been beaten out of sight when its competed in other racing jurisdictions.

A horse with a high unweighted race rating in, for example, Hungary, starts off with a similar value to one in France, but we weight the country so France is higher than Hungary and the adjusted rating more accurately reflects the actual level of performance.

Getting this weighting right took some time as we had to study horses that had raced in multiple countries and look at their relative performance.

We used some pair wise ranking techniques to make sure it was right (there is a good R package named SportR for those that are interested).

For example, when you consider all horses racing in all jurisdictions the data would suggest that the performances of horses in Argentina are roughly 60 per cent of the value of a performance in Australia.

This doesn’t mean that the best horses bred in Argentina aren’t good, it means that the average horse there isn’t as good as the average horse in Australia.

A lot of the smaller racing nations got a weight of zero (0) which meant that no matter how good the horse was in that country, its adjusted race rating was zero.

That might seem unfair, but there was enough evidence to show that even the best horses coming out of lower tier countries to compete in other countries were getting well beaten.

There were some issues with countries such as South Africa/Zimbabwe and South Korea where their performances are isolated from outside competition (e.g. there is not much cross pollination of horses in other racing jurisdictions) but after producing all the data, looking at what weights were assigned to each country, looking at the distributions of adjusted race ratings and consulting people such as Alan Porter, I believe we have it broadly right.

The result of all this is a Race Rating index that goes from as low as 0.01 to as high as 300 (for what it is worth, the highest rating we had when this was written is for an Argentine-bred but US-raced Candy Ride at 307.30) which can be found and applied for any horse racing in any country.

It allows us to automate the process of assigning class to a horse without worrying where the horse ends up racing after we get data on it as a yearling.

Once we had this worldwide rating, we then went back and matched the Race Rating against Timeform Ratings and the Pattern/Graded system to verify its worth. What we found was:

1. A Race Rating of 3.0 on our rating is roughly equivalent to a horse that runs 100 Timeform rating – a Listed level horse.

2. About 4.5 per cent of all horses sold will run a Timeform of 100 or greater. About 6.5 per cent of all horses by the top one per cent of stallions will win any stakes race (Listed or better).

Horses we measured as yearlings that end up with a Race Rating of 3.0 or greater are six per cent of our database so it roughly aligns with Timeform (4.5 per cent) and Foals/SW for the top one per cent of all stallions (6.5 per cent)

3. For horses with yearling measurements, the top two and half per cent of Race Ratings in our database is a rating of 4.7 or better.

That rating equates to a Group/Graded stakes winner in our database.

Around 2.3 per cent of offspring by the top per cent of stallions win a Graded/Group race so if you look at the truly elite horses (the top 2.5 per cent) the database is broadly reflective of a commercial population.

Building the data set – getting the balance

Once we had the rating sorted out, what we also did was create an API (a programmable interface that can ping their database for new information on command) to the database that created the rating for us so that every month, every horse in our database is firstly checked to see if it is 1250 days old (so it is three and a half years old) and if so it is updated with any changes to:

• It’s name

• It’s raw Race Rating (which we then convert with the country weight)

• Number of starts

• Date of last start

• Average distance raced

• Country it raced in (to lookup the country weight and convert the Race Rating)

We also use the average distance raced to form some basic distance categories (5-7f, 6-8f, 8-10f, 10f+) so we can predict optimal distance in addition to racing class.

This process allows us to build a new data set each month for any model that we are training, with an ever expanding number of horses, to retrain the models on.

Rogers believes that machine learning can compliment the human eye

Depending on the time of year, in a given month we can have over 500 new records that age into the data set by turning 1,250 days old.

Additionally in that month, about 10 per cent of our database will have their starts updated to where they have at least three lifetime starts, so can be used in a model data set, and a further 60 per cent of the records will have changes to the number of starts and their Race Rating. This process starts to fulfill one desire to have an automated system that is more “hands off”.

The data sets change each month a fair bit giving the models more information to learn from. We build the model datasets by:

• Finding records where the horse is at least 1,250 days old

• Finding records where the horse has had three or more starts

This means that any data set that the models are asked to learn from are only horses that have had the opportunity to have at least three lifetime starts and are three and a half years of age or older.

On average it takes a horse three starts to break its maiden, and seven starts to win a stakes race.

In our database the average number of starts by a horse that is included in any data set for the models to learn from is 15.21 with one record having 121 starts!

The API keeps updating the records every month until a record hasn’t had a change in it’s date of last start for 1,000 days (so once the horse hasn’t raced for three years) at which point it stops requesting the update as we presume the horse is retired.

Building a data set

Once we have all the ratings for all the horses that have three or more starts, we then get down to the more important part of how we construct the data set for the models to learn from.

There were quite a few considerations to take when we construct the data sets and to be honest, a lot of this was about trial and error where we created the data sets, trained a model and then tested the data sets on unseen samples to examine its performance.

There was a lot of cost involved in this part but we found there were a few things to consider:

1. When building a data set, the size of the sample set plays a crucial role. Too few samples can lead to “underfitting”, where the model might not learn the underlying patterns of the data.

Conversely, too many samples without adding new information might lead to “overfitting”, where the model becomes too specific to the training data and performs poorly on new, unseen data. Intuitively, a larger data set generally provides a more comprehensive representation of the underlying distribution of the data.

However, the law of diminishing returns applies; after a certain point, adding more data may not lead to significant improvements in model performance.

2. Binary classification problems involve predicting one of two possible classes, in our case “elite” or “non-elite”.

The distribution of these classes in the database can significantly influence the performance of the model.

In our real-world database of all records, the prediction target – elite – is imbalanced and significantly underrepresented.

If we take all data as our data set, 95 per cent will be “non-elite” samples and five per cent “elite” samples; it is quite easy for a model to achieve 95 per cent accuracy by merely predicting “non-elite” all the time.

It’s like a bloodstock agent or trainer going to the sales and saying “this is a slow horse” to all the horses in the sale, they are going to be right 95 per cent of the time, but they aren’t useful to buy a fast one at all!

We looked at different ways to tackle these two problems and built a lot of different data set sizes, we looked at splits where it was 60:40 non-elite:elite, 70:30 split and other percentages.

We also looked at oversampling the minor case using resampling the elite samples and also looked at synthetically creating elite samples. We found the following:

1. As we found with the regression modeling we tried earlier, using samples of horses whose performance rating is too close to what is defined as elite is a mistake. The binary models can’t find information in the shape of the data to disciminate between those that are elite, and those that are nearly elite.

2. To capture the variation of data in the samples in the dataset that reflect the samples that are not included in the dataset the model is trained on, you need at least 750, preferably 1,000 samples.

The reward for more samples caps out at about 1,500 samples as beyond that the models start to overfit the data.

3. It is best to get a 50:50 balance between elite and non-elite samples. The balanced data set of the two provides a far more effective data set to learn from and operates well on those samples not found in the data set.

So, if the max number of samples you need per class is 1,500, a data set of 3,000 total is enough.

4. While 3,000 seems a small number, given that elite horses, those that rate 3.0 or better, are just six per cent of our database, and we need at least 750 elite samples (preferably 1,500), we had to get at least 12,500 samples overall.

This isn’t just 12,500 samples of horses, it is 12,500 samples of horses that have had three or more starts, so you actually need to get ~20,000 samples in the database to have at least 750 elite samples to use in a model.

There is a difference between what data you have, and what data you need, to collect the right number of samples for a data set.

5. As the models never learn off the full data set, rather they learn off the elite horses compared to the most non-elite, we use the Race Rating itself to weight each row of data so that the model treats the highest scoring elite horse a little more importantly than the lowest scoring elite horse.

The data for the worst horse in the data set is more important to learn from than those that are less bad. This weighting of the rows is important.

Building the data sets and getting the balance right took a lot of time, effort and money to get right but as you will see in the next issue, it is worth it.

This article is from: