Learning by example – the linear regression model
Imagine that you and a friend own a small ice cream shop. You are discussing how many kilograms (kg) of ice cream to produce each day and you both agree on the fact that the hotter the weather is, the more ice cream will be sold. You add that this is not the only factor to take into account, but there are other variables that can also affect the number of sales. As rational people and good analysts, you decide to run a small experiment by recording the mean temperature during the shop opening hours and the amount of ice cream that is sold. The summer turns out to be particularly rainy, and the temperature variation is high, which helps you to achieve a good range for the variables. The final dataset looks like the following table:
Your model states that the amount of ice cream sold is (directly) proportional to the mean temperature. In order to test this hypothesis, we can make a scatter plot of the collected data:
- Select the full range of cells containing the table, click on Insert menu, and select Charts:
- Now, click on Scatter, as follows:
After writing the names of the axis titles, you should get a chart that is similar to the following chart:
We see that there is indeed a linear correlation and that it is positive (the larger the temperature value, the more ice cream you sell). We can then represent the model using a linear equation, as follows:
IC = a * T + b (1)
Here, IC is the amount of ice cream sold, T is the mean temperature, and a and b are constant values to be calculated by a linear regression.
To obtain the values of a and b, we can use Excel's Analysis ToolPak data analysis add-in. If you have not enabled it, refer link https://support.office.com/en-ie/article/use-the-analysis-toolpak-to-perform-complex-data-analysis-6c67ccf0-f4a9-487c-8dec-bdb5a2cefab6 for instructions on how to do it.
- Select the data range in your worksheet, go to Data in the main menu and then select Data Analysis:
- In the pop-up menu, select Regression and click on OK:
- Make sure that the x and y ranges are correct (x is temperature and y is ice cream amount). Select Line Fit Plots to see the regression line on top of the data points in a new diagram:
Looking at the output, we see that the line that best fits the data can be written as follows:
IC = 1.5* T + 6 (2)
There is a standard error for a of ±0.2, and for b of ±4. The R2 value is 0.78, which means that the fit is not very good and only 78% of the variation in ice cream sales can be explained by the mean temperature. So, you and your friend were both right!
The following diagram shows the fitted line:
It is clear that the line represents the data quite well, but some points are a little bit off, showing that you need to take other factors into consideration when predicting ice cream consumption. In any case, given the mean forecasted temperature for one day, you can use equation (2) to have a rough estimation of how much ice cream to produce to cover the possible demand.
Keep the rest of the linear regression results to hand, as we are going to use some of them in the following sections.