Tesla: Stock Price Prediction

Dale Wahl
Towards Data Science
7 min readDec 29, 2017

--

Quick Note: I will not be predicting the stock price of Tesla. But I will try.

I set out on this particular problem for two reasons. The first was simply to have some goal in mind when trying to learn how to use ARIMA models and to work with time series data. And the second was to get filthy, stinking rich. The first goal seemed to work out great and that’s what we’ll be focused on in this post. The second… well lucky for you, I will not lose all incentive to keep up with writing!

My plan was to create an ARIMA model that tracks close enough to the closing price of TSLA each day and then feed that into a second model that incorporates natural language processing particularly on Elon Musk’s tweets (he’s pretty prolific on the ole Twitter), but also other news sources. Since, so far as I can tell, the price of Tesla stock is completely speculative, it ought to be based on the news (Tesla has never had a profitable year and only a couple of profitable quarters). The great thing about data science is that if you have a random hypothesis like this, it’s pretty easy to test it.

“When something is important enough, you do it even if the odds are not in your favour.”

Elon Musk

TSLA stock price since 2010

As you can see from the above, Telsa has been on the up and up since early 2013 which happens to be shortly after the release of their first car. I jumped in by buying a few of shares (literally 3) in January 2017 and have seen returns of 50%. Unfortunately I have no idea if I ought to sell them or keep holding on. I was hoping a model might help me avoid all this up and down nonsense.

Alright, what is a time series problem anyway? A time series problem is one where any given observation is dependent on the observations made before it. This is pretty simple to think about if you imagine buying groceries. If you go into the store and find that cheese is five dollars more than the last time you bought cheese, you are not likely to find that reasonable. Of course if the price had been increasing gradually for a few months now, you might find its new price to be reasonable (if annoying). Outside factors are likely contributing as well, but for the average buyer, their decision to purchase comes mainly from the prices before. And this need not be only the last few prices. If we are talking about purchasing strawberries, we could say that the price follows a pattern based on historical prices. If strawberries are in season, we expect the price to go down and then increase as the season ends and they become rarer. The price today, might depend more on the price 365 days ago than yesterday. My friend Ben made an excellent post breaking these compenents down here.

As you have probably surmised, there is a strong time series component to the price of stocks. The ARIMA model I had in mind combines two features of time series. The AR stands for the first part which looks at how “autoregressive” the price is and the MA stands for the second which takes into account the effect of the moving average in price. (The I is for integrated.) Autoregressive examines the correlation of todays price with individual previous prices and describes that relationship. We can plot how much an effect each prior day has on the current day.

This shows the correlation between the current price and each previous price up to 365 days

The above graph shows this correlation; as the lag in days increases, the correlation steadily decreases. This tells us that the most important correlation is the price immediately preceding the current price. If we saw other spikes, that would be a good indicator that there are other correlations (such as every week, quarter, or year). We can also look at the partial autocorrelation at different lags to see if there are additional correlations to consider.

This maps the partial autocorrelation in closing price over the same time period as above

The partial autocorrelation looks at the same relationship as above, but removes previous relationships. In other words, on lag 7, it looks at the relationship seven observations ago (seven days ago with our closing price data) while removing the effects of lag 1 through 6. This can show seasonal relationships like we might find in our strawberry example from before. These charts can be a little tricky to read, but the above one has no immediately obvious relationships and shows mostly noise.

That covers the autoregressive nature of the ARIMA model, but what about the moving average portion? That can be a bit trickier to find hidden inside the data, but we might think of it as a smoothing out of the change in our data to remove noise and random variance.

Comparing different rolling means to over six months

You can see above that the rolling mean of Tesla’s stock price softens out the sharp changes which can often represent noise and other fluctuations. This can help your model become more resilient with predictions and not be entirely dependent on autoregression. How responsive you need your model to be can help you decide the time period for your rolling mean that makes the most sense. You can see that as the number of days used to calculate the rolling mean increases, it takes longer for the model to respond to changes.

The last component to consider is the idea of “stationarity”. Stated simply, data is stationary if the mean and variance are the same over time. It is often a requirement for proper statistical annalysis. As we saw earlier, the price of Tesla’s stock has been increasing over time, particularly since 2013, and thus is not stationary. There are various ways to transform data and make it stationary which will allow us to model and make predictions. We essentially find the trend in the data and any seasonality, remove those aspects out of the data for modeling, make the predictions, and then add the trend and seasonality back into the predicted result.

We can see our observed stock prices on top, the overall trend in the price second, any yearly repetition third, and finally our residual which is whatever variation is still left over and needs explaining

Here is an attempt to look at any seasonal trends over a year and any trend over the life of the stock. The bottom three graphs represent the lifetime trend, the yearly variation, and the residual which will add up to create the observed graph on top! The residual at the bottom is all that is left to predict and we hope that it is as stationary as possible. I also looked at taking what is called the first difference (taking the current price and subtracting the one immediately before) and conducted a test called the Dickey-Fuller Test, which probably deserves a post in and of itself, to see if this helped us reach stationarity which it did.

Combining all of these factors, we can come up with the parameters needed to train a ARIMA model on our data. We will want to train the model on a subset of our data, such as everything leading up to 2017 and then test the model on 2017.

Not too shabby, yea?

Ta-da! Here is a terrifyingly simple example of how to create this model in Python with the data here.

30 lines of code with half of it me blabbering and a quarter just reading in the data? Talk about standing on the shoulders of giants…

I hope this helped you understand the basics of time series modeling, when to use it, a bit of how to identify parameters for your model, and how to implement the model itself. This just scratches the surface; next up we need to figure out how to evaluate our model and see if it can make us rich!

As always, thanks for reading. Please let me know if you notice any errors or think I could clarify anything. Let me know too if you would like to hear more about how I started evaluating this model and built in the natural language components! You can find the rest of my code here.

--

--