A Data Science Christmas
Even though there are a lot of issues we need to think about and there are a lot of concerns about hygiene precautions and the upcoming holidays, we wanted to wish you all a Merry Christmas and therefore give you something to smile about. Maybe you (or maybe your children) have wondered whether it’s going to snow on Christmas in Leipzig. For all of you who have asked themselves this question, one fellow researcher of ScaDS.AI Leipzig built a forecasting model based on open data from the German Weather Service. The model maybe can tell us, if it’s really going to snow on Christmas. But how did he do it? And what do gingerbread and weight loss have in common with each other? And what does this question have to do with this snow height prediction model? To explain all this, let's first start with a small tutorial on how to build such a prediction model yourself.
Step 1: The Idea
Creation of a simple model for the prediction of snow depth in Leipzig based on the weather data of the respective day.
Step 2: Data Check
The German Weather Service offers historical and current data, including daily observations at weather stations (temp, pressure, precipitation, sunshine duration, etc.), via https://opendata.dwd.de/ .
For Leipzig data to 3 weather stations are offered (Holzhausen, Halle/Leipzig, Mockau). The data goes back to 1863 with gaps, almost continuous measurements for all 3 stations from 1935 onwards are available. From these data, the amount of snow in cm for Leipzig can be compared since 1935:
As you can see in the graph, the number of days it snows and the average height of snow has decreased since 1935. Nevertheless, there is hope, because of course in some years snow at least fell on December in Leipzig.
Step 3) How to build the model
The goal is to train a model with the historical Leipzig data with the aim to predict the snow height based on the respective daily values.
Merging and cleaning the data sets of the 3 Leipzig weather stations results in a data set of more than 30,000 days with daily values for temperature (min, max, avg), atmospheric pressure, humidity, precipitation height, cloud cover, sunshine duration, vapor pressure - and snow height (the value to be predicted later). You need to further enrich your data with entries for the respective year, month and calendar week.
We used xgboost - OpenSource supervised machine learning library using Decision-Tree-based algorithms and gradient boosting to build the predictive model. This xgboost library has the benefit that it can also handle missing values that are still contained in the merged dataset.
Now train the model on the data set with the data from 1935 to 12/2009.Subsequently, we test which prediction is made for the time range from 01/2010 to 12/2020, which is not yet known to the model:
As you can see the first result is not very accurate due to the simple model, because the predictions for snow with it are only possible with the available respective daily values for temp, air pressure etc. for this particular day. In order to develop a more accurate forecast model, it would be necessary to add further regional and supra-regional data, as the weather is not only a local phenomenon. It would be conceivable to upgrade the model to a more comprehensive and thus also more accurate forecast model by integrating previous days into the forecast using a sliding window approach.
Although the model unfortunately doesn't give much hope for a white Christmas, there is still some chance for a snowball fight or a day in the snow on a sled for the rest of December. Which may not be such a bad idea in terms of the plentiful food we eat during the Christmas and pre-Christmas season. At least if you look at the correlations and intersections of the Google search queries statistics of gingerbread and losing weight in Germany 🧐😉
Now it's up to you to decide whether maybe it's just a coincidence or if there is a definite connection between these search requests. Nevertheless, a more or less healthy diet during the holidays combined with some exercise in the fresh air also helps to keep our immune system active. Thus, a day in the snow also supports us in a broader sense to stay healthy through the pandemic. In keeping with this spirit, we wish you a Merry Christmas, a Happy New Year and to stay healthy and happy 🙃