The objective of the M5 forecasting competition is to advance the theory and practice of forecasting by identifying the method(s) that provide the most accurate point forecasts for each of the 43,204 time series of the competition, as well as the methods that elicit information to estimate the uncertainty distribution of the realized values of these series as precisely as possible.
To that end, the participants of M5 are asked to provide 28 days ahead point forecasts (PFs) for all the series of the competition, as well as the corresponding median and 50%, 67%, 95%, and 99% prediction intervals (PIs).
The M5 differs from the previous four ones in four important ways, some of them suggested by the discussants of the M4 competition, as follows:
- First, it uses hierarchical sales data, starting at the product-store level and being aggregated to that of product departments, product categories, stores, and three geographical areas: the States of California (CA), Texas (TX), and Wisconsin (WI).
- Second, besides the time series data, it includes explanatory variables such as sell prices, promotions, day of the week, and special events (e.g. Super Bowl, Valentine’s Day, and Orthodox Easter) that typically affect sales and could be used to improve forecasting accuracy.
- Third, in addition to point forecasts, it assesses the distribution of uncertainty, as the participants are asked to provide information on nine indicative quantiles.
- Fourth, for the first time it focuses on series that display intermittency, i.e., sporadic demand including zeros.
Dates and Hosting
The M5 will start on March 2, 2020 and finish on June 30 of the same year. The M5 dataset will become publicly available on the starting date of the competition.
The competition will be run using the Kaggle platform. Thus, we expect a lot of submissions, including forecasters of both statistical and machine learning background, expanding that way the field of forecasting and integrating its various approaches for improving accuracy and uncertainty estimation.
Note that in contrast to what is typically done in Kaggle competitions, M5 will not involve a real-time leaderboard. This means that the participants will be free to (re)submit their forecasts on daily basis but will not be aware of their absolute, as well as their relative performances. The ranks of the participating methods will be made available only at the end of the competition, when the organizers will have made publicly available the test sample of the dataset by sharing it with Kaggle. This is done in order for the competition to simulate reality as closely as possible, keeping in mind that in real life forecasters know little about the future.
The M5 dataset, generously made available by Walmart, involves the sales of various products sold in the USA, organized in the form of grouped time series. More specifically, the dataset involves the sales of 3,075 products, classified in 3 product categories (Hobbies, Foods, and Household) and 7 product departments, in which the above-mentioned categories are disaggregated. The products are sold across 10 stores, located in 3 States (CA, TX, and WI). In this respect, the bottom-level of the hierarchy, i.e., product-store sales, can be mapped either across product categories or geographical regions, as follows:
|Aggregation Level||Number of Series|
|Sales of all products, aggregated for all stores/states||1|
|Sales of all products, aggregated for each State||3|
|Sales of all products, aggregated for each store||10|
|Sales of all products, aggregated for each category||3|
|Sales of all products, aggregated for each department||7|
|Sales of all products, aggregated for each State and category||9|
|Sales of all products, aggregated for each State and department||21|
|Sales of all products, aggregated for each store and category||30|
|Sales of all products, aggregated for each store and department||70|
|Sales of product x, aggregated for all stores/states||3,075|
|Sales of product x, aggregated for each State||9,225|
|Sales of product x, aggregated for each store||30,750|
The historical data range from 2011-01-29 to 2016-06-19. Thus, the products have a (maximum) selling history of 1,941 days / 5.4 years (test data of h=28 days not included).
The M5 dataset consists of the following three (3) files:
Contains information about the dates the products are sold.
date: The date in a “y-m-d” format.
wm_yr_wk: The id of the week the date belongs to.
weekday: The type of the day (Saturday, Sunday, …, Friday).
wday: The id of the weekday, starting from Saturday.
month: The month of the date.
year: The year of the date.
event_name_1: If the date includes an event, the name of this event.
event_type_1: If the date includes an event, the type of this event.
event_name_2: If the date includes a second event, the name of this event.
event_type_2: If the date includes a second event, the type of this event.
snap_CA, snap_TX, and snap_WI: A binary variable (0 or 1) indicating whether the stores of CA, TX or WI allow SNAP1 purchases on the examined date. 1 indicates that SNAP purchases are allowed.
Contains information about the price of the products sold per store and date.
store_id: The id of the store where the product is sold.
item_id: The id of the product.
wm_yr_wk: The id of the week.
sell_price: The price of the product for the given week/store. Price is provided per week (average across seven days). If not available, this means that the product was not sold during the examined week.
Contains the historical daily sales data per product and store.
item_id: The id of the product.
dept_id: The id of the department the product belongs to.
cat_id: The id of the department the product belongs to.
store_id: The id of the store where the product is sold.
state_id: The State where the store is located.
d_1, d_2, …, d_i, … d_1941: The number of products sold at day i, starting from 2011-01-29.
The number of forecasts required, both for point and probabilistic forecasts, is h=28 days (4 weeks ahead).
The performance measures are first computed for each series separately by averaging their values across the forecasting horizon and then averaged again across the series in a weighted fashion (see below) to obtain the final scores.
The accuracy of the point forecasts will be evaluated using the Root Mean Squared Scaled Error (RMSSE), which is a variant of the well-known Mean Absolut Scaled Error (MASE) proposed by Hyndman and Koehler (2006)1. The measure is calculated as follows:
where Yt is the actual future value of the examined time series at point t, the generated forecast, n the length of the training sample (number of historical observations), and h the forecasting horizon.
The choice of the measure is justified as follows:
The M5 series are characterized by intermittency, involving lots of zeros. This means that absolute errors, which are optimized for the median, would assign lower scores to forecasting methods that derive forecasts close to zero. However, the objective of M5 is to accurately forecast the average demand. Thus, the accuracy measure used depends on squared errors, which are optimized for the mean.
The measure is scale independent, meaning that it can be effectively used to compare forecasts across series with different scales.
In contrast to other measures, it can be safely computed as it does not rely on divisions with values that could be equal to zero (e.g. as done in percentage errors when or relative errors when the error of the benchmark used for scaling is zero).
The measure penalizes positive and negative forecast errors, as well as large and small forecasts equally, thus being symmetric.
After estimating the RMSSE for all the 43,204 time series of the competition, the participating methods will be ranked using the Weighted RMSSE (WRMSSE), as described earlier. Once again, note that the weight of each series will be computed based on the test sample of the dataset, i.e., future sales and prices.
An indicative example for computing the WRMSSE will be available on the GitHub repository of the competition.
The precision of the probabilistic forecasts will be evaluated using the Scaled Pinball Loss (SPL) function, as follows:
where Yt is the actual future value of the examined time series at point t, Qt(u) the generated forecast for quantile u, h the forecasting horizon, n the length of the training sample (number of historical observations), and 1 is the indicator function (being 1 if Y is within the postulated interval and 0 otherwise). Given that forecasters will be asked to provide the 50th, 80th, 90th, and 95th percentiles, is set to 0.5, 0.2, 0.1 and 0.05, respectively.
After estimating the SPL for all the 43,204 time series of the competition and for all the requested percentiles, the participating methods will be ranked using the Weighted SPL (WSPL), as described earlier, divided by four (average performance of four percentiles across all series). Once again, note that the weight of each series will be computed based on the test sample of the dataset, i.e., future sales and prices.
An indicative example for computing the WSPL will be available on the GitHub repository of the competition.
In contrast to the previous M competition, M5 involves the sales of various products organized in a hierarchical fashion. This means that, businesswise, in order for a method to perform well, it must provide accurate forecasts across all hierarchical levels, especially for series of high aggregate sales (measured in US dollars). In other words, we expect from the best performing forecasting methods to derive lower forecasting errors for the series that are of more value for the company. To that end, the forecasting errors computed for each participating method will be weighted across the M5 series based on the aggregate sales that each series represents, i.e. a proxy of their actual value for the company in monetary terms.
Assume that two products of the same department, A and B, are sold in a store at WI. Product A, of price $1, displays 10 sales in the testing period, while product B, of price $2, displays 6 sales. The aggregate sales of product A will be $1*10=$10, while the aggregate sales of product B will be $2*6=$12. Assume also that a forecasting method was used to forecast the sales of product A, product B, and their aggregate sales, displaying errors EA, EB, and E, respectively. If the M5 dataset involved just those three series, the final score of the method would be
This weighting scheme can be expanded in order to consider more stores, geographical regions, product categories, and product departments, as previously described. Note that, based on the considered scheme, all hierarchical levels are equally weighted. This is because the total sales of a product, measured across all three States, are equal to the sum of the sales of this product when measured across all ten stores, or similarly, because the total sales of a product category of a store are equal to the sum of the sales of the departments that this category consists of, as well as the sum of the sales of the products of the corresponding departments.
The prerequisite for the Full Reproducibility Prizes will be that the code used for generating the forecasts, with the exception of companies providing forecasting services and those claiming proprietary software, will be put on GitHub, not later than 10 days after the end of the competition (i.e., the 10th of July, 2020). In addition, there must be instructions on how to exactly reproduce the M5 submitted forecasts. In this regard, individuals and companies will be able to use the code and the instructions provided, crediting the person/group that has developed them, to improve their organizational forecasts.
Companies providing forecasting services and those claiming proprietary software will have to provide the organizers with a detailed description of how their forecasts were made and a source, or execution file for reproducing their forecasts. Given the critical importance of objectivity and replicability, such description and file will be mandatory for participating in the competition. An execution file can be submitted in case that the source program nee
Similar to the M3 and M4 competitions, there will be a special issue of the International Journal of Forecasting (IJF) exclusively devoted to all aspects of the M5 Competition with special emphasis on what we have learned and how we can use such learning to improve the field of forecasting and expand its usefulness and applicability.
Like done in the M4 competition, there will be fourteen (14) benchmark methods, ten (10) statistical and four (4) Machine Learning (ML) ones. As these methods are well known, readily available, and straightforward to apply, the accuracy of the new ones proposed in the M4 Competition must provide superior accuracy in order to be adopted and used in practice (taking also into account the computational time it would be required to utilize a more accurate method versus the benchmarks whose computational requirements are minimal).