Hypothesis, if the crowd drives the odds pricing, how able is the crowd to determine a wining horse (the horse with the shortest odds i.e. 2 to 1)?
Assumptions
Created by Clive Bird January 2022.
bookmaker <- "Bet365"
racecourse <- "Chelmsford City"
odds <- "at 2.0 to 1"
model_paramter_tile <- "Wins"
# data vector represents all the race venue i.e horse race course.
# 1 = Win, 0 = Not Win
data = c(0,0,0,1,1,1,1,1,0,1,0,1,0,1,0,1,0,1,0,1,1,0,1,0,1)
# Actual race data shown below.
#Racecourse EventDateTime HorseName Odds Outcome WinBoolean FinalUnixTime
#Chelmsford City 2020-06-08 14:30:00.000 MARIA ROSA 2 LOSE 0 1591626600
#Chelmsford City 2020-06-17 17:50:00.000 NIGHT MOMENT 2 PLACED 0 1592416200
#Chelmsford City 2020-08-22 17:55:00.000 BRANWELL 2 LOSE 0 1598118900
#Chelmsford City 2020-10-08 19:30:00.000 INDIGO TIMES 2 WIN 1 1602185400
#Chelmsford City 2020-10-10 16:08:00.000 HIGHFIELD PRINCESS 2 WIN 1 1602346080
#Chelmsford City 2020-10-10 17:25:00.000 BULLACE 2 WIN 1 1602350700
#Chelmsford City 2020-11-12 17:00:00.000 TRUMBLE 2 WIN 1 1605200400
#Chelmsford City 2020-11-12 18:30:00.000 ABSOLUTE SCENES 2 WIN 1 1605205800
#Chelmsford City 2020-11-23 16:45:00.000 ORIENTALISM 2 PLACED 0 1606149900
#Chelmsford City 2020-11-27 19:45:00.000 FORTUNE FINDER 2 WIN 1 1606506300
#Chelmsford City 2020-12-17 16:55:00.000 MELODY OF LIFE 2 PLACED 0 1608224100
#Chelmsford City 2021-02-04 18:30:00.000 ELECTRIC BLUE 2 WIN 1 1612463400
#Chelmsford City 2021-02-18 20:30:00.000 COZONE 2 PLACED 0 1613680200
#Chelmsford City 2021-03-04 19:50:00.000 SERGEANT MAJOR 2 WIN 1 1614887400
#Chelmsford City 2021-03-18 17:55:00.000 SHOW ME A SUNSET 2 LOSE 0 1616090100
#Chelmsford City 2021-03-18 19:55:00.000 BELLISSIME 2 WIN 1 1616097300
#Chelmsford City 2021-08-10 20:20:00.000 MIDFIELD 2 LOSE 0 1628626800
#Chelmsford City 2021-08-21 21:00:00.000 MAHANAKHON POWER 2 WIN 1 1629579600
#Chelmsford City 2021-09-02 19:55:00.000 MOUNT MARCY 2 LOSE 0 1630612500
#Chelmsford City 2021-09-02 20:25:00.000 THE VEGAS RAIDER 2 WIN 1 1630614300
#Chelmsford City 2021-09-09 17:30:00.000 HARB 2 WIN 1 1631208600
#Chelmsford City 2021-09-09 17:30:00.000 IKHTIRAAQ 2 PLACED 0 1631208600
#Chelmsford City 2021-09-25 20:30:00.000 LA ROCA DEL FUEGO 2 WIN 1 1632601800
#Chelmsford City 2021-10-07 18:30:00.000 FLORA FINCH 2 PLACED 0 1633631400
#Chelmsford City 2021-10-07 20:30:00.000 LINDWALL 2 WIN 1 1633638600
install.packages('dplyr')
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:
https://cran.rstudio.com/bin/windows/Rtools/
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.1/dplyr_1.0.7.zip'
Content type 'application/zip' length 1345624 bytes (1.3 MB)
downloaded 1.3 MB
package ‘dplyr’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\Administrator\AppData\Local\Temp\2\Rtmp4sIJS4\downloaded_packages
install.packages('gridExtra')
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:
https://cran.rstudio.com/bin/windows/Rtools/
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.1/gridExtra_2.3.zip'
Content type 'application/zip' length 1109502 bytes (1.1 MB)
downloaded 1.1 MB
package ‘gridExtra’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\Administrator\AppData\Local\Temp\2\Rtmp4sIJS4\downloaded_packages
install.packages('tidyverse')
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:
https://cran.rstudio.com/bin/windows/Rtools/
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.1/tidyverse_1.3.1.zip'
Content type 'application/zip' length 430268 bytes (420 KB)
downloaded 420 KB
package ‘tidyverse’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\Administrator\AppData\Local\Temp\2\Rtmp4sIJS4\downloaded_packages
install.packages('ggridges')
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:
https://cran.rstudio.com/bin/windows/Rtools/
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.1/ggridges_0.5.3.zip'
Content type 'application/zip' length 2259262 bytes (2.2 MB)
downloaded 2.2 MB
package ‘ggridges’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\Administrator\AppData\Local\Temp\2\Rtmp4sIJS4\downloaded_packages
install.packages('ggExtra')
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:
https://cran.rstudio.com/bin/windows/Rtools/
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.1/ggExtra_0.9.zip'
Content type 'application/zip' length 381276 bytes (372 KB)
downloaded 372 KB
package ‘ggExtra’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\Administrator\AppData\Local\Temp\2\Rtmp4sIJS4\downloaded_packages
library( dplyr )
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library( ggplot2 )
library( gridExtra )
Attaching package: ‘gridExtra’
The following object is masked from ‘package:dplyr’:
combine
library( tidyverse )
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
-- Attaching packages ------------------------------------ tidyverse 1.3.1 --
v tibble 3.1.6 v purrr 0.3.4
v tidyr 1.1.4 v stringr 1.4.0
v readr 2.1.1 v forcats 0.5.1
-- Conflicts --------------------------------------- tidyverse_conflicts() --
x gridExtra::combine() masks dplyr::combine()
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
library( ggridges )
library( ggExtra )
Registered S3 methods overwritten by 'htmltools':
method from
print.html tools:rstudio
print.shiny.tag tools:rstudio
print.shiny.tag.list tools:rstudio
library( lattice )
library( RColorBrewer )
chart_title <- paste(racecourse, " using ", sep=" ")
chart_title <- paste(chart_title, bookmaker, sep=" ")
chart_title <- paste(chart_title, odds, sep=" ")
#The prop_model function
# This function takes a number of successes and failure coded as a TRUE/FALSE
# or 0/1 vector. This should be given as the data argument.
# The result is a visualization of the how a Beta-Binomial
# model gradually learns the underlying proportion of successes
# using this data. The function also returns a sample from the
# posterior distribution that can be further manipulated and inspected.
# The default prior is a Beta(1,1) distribution, but this can be set using the
# prior_prop argument.
# Make sure the packages tidyverse and ggridges are installed, otherwise run:
# install.packages(c("tidyverse", "ggridges"))
# Example usage:
# data <- c(TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE)
# prop_model(data)
prop_model <- function(data = c(),
prior_prop = c(1, 1),
n_draws = 1000000,
gr_name="Proportion graph",
model_paramter_tile,
chart_title
) {
data <- as.logical(data)
# data_indices decides what densities to plot between the prior and the posterior
# For 20 datapoints and less we're plotting all of them.
data_indices <- round(seq(0, length(data), length.out = min(length(data) + 1, 40)))
# dens_curves will be a data frame with the x & y coordinates for the
# denities to plot where x = proportion_success and y = probability
proportion_success <- c(0, seq(0, 1, length.out = 100), 1)
dens_curves <- map_dfr(data_indices, function(i) {
value <- ifelse(i == 0, "Prior", ifelse(data[i], "Win", "Not Win"))
label <- paste0("n=", i)
probability <- dbeta(proportion_success,
prior_prop[1] + sum(data[seq_len(i)]),
prior_prop[2] + sum(!data[seq_len(i)]))
probability <- probability / max(probability)
data_frame(value, label, proportion_success, probability)
})
# Turning label and value into factors with the right ordering for the plot
dens_curves$label <- fct_rev(factor(dens_curves$label, levels = paste0("n=", data_indices )))
dens_curves$value <- factor(dens_curves$value, levels = c("Prior", "Win", "Not Win"))
graph_label <- paste("Prior likelihood distribution Beta(a =",
as.character(prior_prop[1]),", b =",
as.character(prior_prop[2]),")")
p <- ggplot(dens_curves, aes(x = proportion_success, y = label,
height = probability, fill = value)) +
ggridges::geom_density_ridges(stat="identity", color = "white", alpha = 0.8,
panel_scaling = TRUE, size = 1) +
scale_y_discrete("", expand = c(0.01, 0)) +
scale_x_continuous(model_paramter_tile) +
scale_fill_manual(values = hcl(120 * 2:0 + 15, 100, 65), name = "", drop = FALSE,
labels = c("Prior ", "Win ", "Not Win ")) +
ggtitle(paste0(gr_name, ": ", sum(data), " Win, ", sum(!data), " Not Win"),
subtitle = graph_label) +
labs(caption = chart_title) +
theme_light() +
theme(legend.position = "top")
print(p)
# Returning a sample from the posterior distribution that can be further
# manipulated and inspected
posterior_sample <- rbeta(n_draws, prior_prop[1] + sum(data), prior_prop[2] + sum(!data))
invisible(posterior_sample)
}
# Extract and explore the posterior
posterior <- prop_model(data, model_paramter_tile=model_paramter_tile, chart_title=chart_title)
Warning: `data_frame()` was deprecated in tibble 1.1.0.
Please use `tibble()` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
head(posterior)
[1] 0.5471468 0.6108983 0.6465365 0.6157755 0.5881064 0.6000661
hist(posterior, breaks = 30, xlim = c(0, 1), col = "gold")
# Inspect the posterior distribution model's parameters of interest.
# Median is the 'Best guess' point estimate.
# 90% & 95% credible interval (CI)
# Median is the 'Best guess' point estimate.
summary(posterior)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1312 0.4916 0.5571 0.5556 0.6210 0.9041
# Measure the credible interval. I.e. the probability expressed a %, that the model's posterior parameter of interest value will fall between this probability range.
# A 90% probability that the parameter of interest value is between this range. 90% credible interval
quantile(posterior, c(0.05, 0.95))
5% 95%
0.3983478 0.7080927
# A 95% probability that the parameter of interest value is between this range. 95% credible interval
quantile(posterior, c(0.05 - 0.025, 0.95 + 0.025))
2.5% 97.5%
0.3691813 0.7344128
# Calculate the probability that the 'win to not win' ratio is greater than or equal to more than half (50%).
# I.e. The probability I will win more than I loose if I bet on all like odds driven by the crowd using this book maker at this race course and odds combination.
cat("Probability that the 'win to not win' ratio is greater than or equal to 50% is ", sum( posterior >= 0.5 )/length(posterior) )
Probability that the 'win to not win' ratio is greater than or equal to 50% is 0.721641