Stock Correlation Versus LSTM Prediction Error

Determining whether an LSTM trained on one stock can generalize to other stocks, depending on correlation.

Gregory Janesch
7 min readSep 19, 2020

When trying to look at examples of LSTMs in Keras, I’ve found a lot that focus on using them to predict stock prices in the future. Most are pretty bare-bones though, consisting of little more than a basic LSTM network and a quick plot of the prediction. Though I think the utility of these models is a little questionable, it brought a question into my head: how accurate are the predictions made by a model trained on one stock if it’s predicting on another stock?

The full code can be found here.

Problem Description

Stocks are correlated with each other to varying degrees, so the behaviors of any given pair of stocks may or may not track each other. The correlation between stocks is usually measured as the correlation of their returns (or at least, that’s what I’ve seen), and it’s easy to compute those yourself.

In addition, there are an immense number of posts and such about predicting stock prices with neural networks. These examples usually don’t go too deep, though, and they invariably train and check the model using data from the same stock. That’s reasonable enough, but it raises the question of how generalizable these models are. It doesn’t seem likely that the models would create good predictions if there was weak correlation between the stock they were trained on and the one it’s predicting on, but maybe it would work well enough for stocks that are more strongly correlated.

So the goal here is:
- Get data on a large number of stocks (preferably hundreds).
- Compute the correlations between the stocks.
- Train an LSTM on a single, reference stock.
- Make predictions for the other stocks using that LSTM model.
- See how some error metric varies with correlation.

Getting the Data

Since I’m aiming to get data on a few hundred stocks, the first list that jumps to mind is the S&P 500. There are actually 505 tickers on there, but that’s because five of the companies have multiple share classes. I just discarded one class for each stock with multiple share classes — the list I ended up using is in the GitHub repo for this post.

I downloaded the data from Tiingo via the pandas_datareader library. Tiingo limits free accounts to 500 unique symbols per month, so it’s feasible to grab this all at once, although you won’t to be able get data for any other ticker with that account for the remainder of the month.

This will take several minutes to execute. If you’re running this code yourself, I recommend saving the data immediately afterward — the file that my run produced was almost 300 megabytes and contained about 2.3 million rows, so it’s not something you want to repeatedly download.

Selecting & Scaling Data

Since we’re dealing with an LSTM, we’d like to have data scaled down to a range that’s better handled by the LSTM inputs. And since the scales of the stocks differ, we need individual scalers for each stock. However, even though only the reference stock will have its data used for training, I want to ensure that all of the stocks have complete data for the same timeframe, since scaling the other stocks on just their test data would exaggerate how big some of the movements were in the stocks.

It turns out that complete data from 2001–01–01 to 2019–12–31 exists for 370 of the stocks, so I opted to just filter down to those.

It didn’t matter to me what the reference stock was, so I just picked one using random.choice() from Python’s standard library. The result was ALL (The Allstate Corporation), so we can set that as a constant, along with the lengths of the inputs and outputs for the network.

For scaling the data, most of the posts I saw used sklearn.preprocessing.MinMaxScaler() on the data to get it to a scale that the LSTM would work better with. I took it one step further, though — it seemed like monitoring the stock’s value changes in terms of percent change was a bit more consistent than using the absolute price. For example, if we consider the somewhat extreme case of Apple (AAPL):

Daily changes in AAPL value by absolute change (difference in price) and percent change.

To deal with this, I made a child class of the MinMaxScaler() which takes the logarithm of the data before applying the usual MinMaxScaler() functionality. As a result, percentage changes are now absolute changes.

Making a child of MinMaxScaler() has several advantages over coding your own. The biggest for me is that MinMaxScaler() already does independent scaling on each column of the data and stores all the necessary information. That’s exactly what’s needed for these few hundred stocks, and this way I don’t need to try to reimplement that myself.

We also need the correlation matrix. Thankfully, pandas has pandas.DataFrame.corr() for this, so we just need to calculate the returns and remove the correlation for the reference stock.

The correlations do vary a decent amount, although I would describe the bulk of stocks as just being mildly correlated. The fact that they’re all positive probably reflects the general tendency for the market to go up over time, especially in the time window we’re considering here.

Finally, create the arrays to hold the training data and the other stock data to predict on.

The LSTM Model

LSTM models in the posts I saw typically used 50 nodes per hidden layer with two to four hidden layers. But they also only predicted one point at a time, and I wanted to see how well a sequence could be predicted. So I made the following model, largely based on an example from here.

The combination of RepeatVector() and TimeDistributed() is what allows the prediction of multiple points. The predictions don’t feed back into the model, though, so every point in the prediction is based on the same data.

Since I was a little unsure about the sizes of the LSTM layers in the model, I tried doing some grid search cross-validation. (I know random hyperparameter searches are more efficient, but since I only have two variables I didn’t think it would make much difference.) Of course, since this is temporal data, we need to split the data appropriately, lest data leaks confuse things.

The best model in the grid search had 140 neurons in each LSTM layer, so that’s what the final model uses.

Making Predictions

Making predictions on future stock prices means predicting the actual price, not just a scaled version of it. As such, the error metric shouldn’t be distorted by having some stock prices in the tens of dollars and others in the hundreds or thousands. Mean absolute percentage error seemed like a good metric that fit this requirement.

So first, a check on the reference stock. How well did the model do on it?

It’s okay — it’s off by 1–2% for most of these estimations, which isn’t too bad. Of course, the important bit is how accurate it is when the data isn’t scaled.

Once it’s unscaled, we end up with about a 4.1% MAPE. I’m not sure if that’s particularly good or not, but it brings us to the main question: How do the other stocks fare?

It’s hard to tell exactly what is or isn’t there. It seems like there are fewer extreme MAPE values at both high and low correlations, but maybe that’s just because there are fewer points out there. We can try something a little stricter, by binning the data based on correlation and running an ANOVA on the bins (with a boxplot for visualization purposes).

A p-value of around 0.51 is much larger than any typical significance level, so it looks like there’s no statistically discernible differences between the above groups, despite the boxplot suggesting otherwise. But this is all on the scaled data. What if it’s unscaled?

The main takeaway from this — which could be seen on the reference stock — is that unscaling the data increases the MAPE by a lot, to somewhat worrying levels in a lot of cases. There’s still nothing very strong looking in this plot, especially with the MAPE values being considerably more spread out than before. It again looks like the higher correlations might not have as much spread, but it’s still tenuous.

The ANOVA is a lot more suggestive this time around, though. With p=0.047, this would be statistically significant for some common significance levels (including 0.05), though not all (it’s still above 0.01, for instance).

Conclusions

With this basic LSTM model, there might be some relationship between prediction error and stock correlation. Given how the MAPE values for the unscaled predictions on the non-reference stocks looked, there’s clearly work to be done on creating a more accurate model. Running this code a number of times would also be necessary to get a strong picture of the truth, given the random initialization that comes with neural networks.

--

--

Gregory Janesch

Early-career data scientist/statistician, recently finished a Master’s in Statistics.