Create Historically Accurate Financial Data for Machine Learning

Since 2011, Valspresso, Inc. has invested heavily in cleaning and linking historical financial data in order to provide an accurate point-in-time view of data for analysis, backtesting, and machine learning training.

When we began our FinTech journey with our first prototype in 2009, it was clear, early on, that providers of financial data were not maintaining their data properly for historical analysis.  Their focus was on providing accurate data as of today. This focus made sense as they were serving analysts, traders, and advisors who are focused on making decisions today. But this focus on today’s accuracy, meant that they were neglecting the historical accuracy of their data.  In many cases, providers overwrite their database records with the latest information and thereby destroy the previous version of that record, forever losing that historical value. This neglect for historical accuracy means that there is no reliable way to perform historical analysis, backtesting, or machine learning training.  As these activities have driven many of the algorithms behind automated trading, it is not surprising how many of them fail once implemented in daily trading operations.

A simple inspection of a provider’s data schema reveals the issues.  For example, many providers have a “company file” which contains information about that particular company, such as company name, sector, location, and ticker symbol; each of which can and often does change over the lifetime of a company.  If the schema does not allow for historical versions of the company record, then the data provider, by necessity overwrites data and destroy history. While company name may not be significant, sector, location, and ticker are critical pieces of information to understand the historical behavior of the company’s fundamentals and pricing.  When those historical values are destroyed, historical context is destroyed. One dramatic example surfaced during our initial research leading to our automated fundamental analysis. We noticed an unusual number of fundamentally healthy companies failing (delisted or bankrupt) in 2005 and 2006. It turns out those companies where based in New Orleans when Hurricane Katrina hit on August 29, 2005.

The overwriting of a ticker symbols is probably one of the biggest issues we’ve had to resolve. Over a company’s lifespan, it’s common stock ticker can change. Depending on how the data provider handles this change, it can cause major historical integrity issues when trying to link (join) data across different files.  This results in companies appearing and disappearing during backtesting, thus resulting in unreliable backtesting results. This is such a common problem that the industry has given it a name, survivorship bias. From a data quality perspective, the only reason that survivorship bias exists is because some data provider destroyed the history of failed companies, leaving only the survivors with their data intact.

Valspresso performs historical analysis, train machine learning models, and execute accurate backtesting, therefore we require each company’s data to exist, be complete, and be accurate for every trading day.  We express this in the following: “What did the market know on a given day?” Expressed more precisely:

“For any given day in the past, what was the market’s understanding of the financial history and stock pricing history of a company?” 

Some refer to this as “point-in-time” data, but we have discovered there is no industry standard definition for “point-in-time”, so different data providers define it differently and provide different levels or flavors of “point-in-time”.  In fact, whether data is “point-in-time” very much depends on how you plan to analyze the data.

There are many more classes of data problems beyond the few easy to understand examples previously mentioned.  In addition, the point-in-time issue also reveals a set of interwoven problems. What follows is a list of some of the data problems we have encountered over the past eight years and processes that we put in place to resolve them. 

Missing Field Values

When a publicly traded company files their financial statements with the Securities and Exchange Commission (SEC), they have some flexibility in what values they report.  While not a complete list, we have encountered basic things like earnings per share, share count, current assets, total assets, current liabilities, and total liabilities with blank values.  To resolve missing values issues we developed routines to calculate those values from other values on the balance sheet or income statement.

Incorrect Field Values

We have encountered clearly wrong data values. We have seen this in everything from stock’s closing price to a company’s earnings per share.  Catching these types of errors either require cross reference checking, human review, and/or heuristics (rules of thumb). We employ cross checking and heuristics for the bulk of our quality checks as well as human spot checking.

Missing Records

A daily stock pricing record can be missing due to error or simply because a stock is not traded on a given day. We have also encountered missing financial filings either due to error or regulatory exceptions.  Because we require every company to have records every day, our system patches these holes with the latest available information when appropriate.

Overwritten Data

As described previously, when data is overwritten, its history is destroyed. While the data providers may be overwriting their data and destroying their history, we never overwrite and maintain history.  We have been doing this since 2011 and therefore have a more accurate historical record than our data providers.

Changes that affect history

There are some corporate actions that propagate into the past thereby changing history.  The best example is a stock split. In a 2:1 stock split, the price will half AND the number of shares will double. Because the price changed, adjusted prices in the past must be changed.  More subtly, the change in the number of shares also affects things on the financial statement such as earnings per share (EPS).  The earnings are constant, but the number of shares (denominator) changed, therefore historical EPS needs to be adjusted in order to be able to do any kind of trend analysis on EPS. At the time of the split, companies usually amend their filings going back typically three years. If all goes well, all of these changes happen instantaneously and correctly.  From our experience it does not always go well, which then leads to intervention. This intervention itself can get confusing since we have several moving targets: split event, adjusted pricing, and financial filings. Determining which of those have been correctly adjusted and when to intervene depending on which day we are asking the question “What did the market know?” and how far back is the market looking is a rather complicated process.

Missing or Invalid Foreign Keys

A foreign key is a field in one file (or table) that refers to unique primary key in another file (or table).  If this foreign key is missing or invalid, then data becomes orphaned (detached) from related data. This means that the data is not usable. The most critical broken foreign key we have encountered and resolved is the ticker symbol in the company file relating the company to the proper ticker in the stock file.  Some data providers only store the current ticker in the company file, which means that if ticker ever changes history is broken. We have developed a heuristic algorithm to reconstruct a company’s ticker and price history. The algorithm deals with missing ticker, wrong ticker, and changed ticker. It searches through the list of possible tickers in a way similar to how a human would try to resolve these issues. Humans are good at looking at inconsistently labeled data and inferring the correct answer. They do this by applying a set of heuristics (rules of thumb) to find the most likely match between company and ticker. Our algorithm applies those same “rules of thumb” to resolve issues across all companies every day.

Point-in-Time Data

In order for our system to understand the state of a public company at a point in time, it must have financial data and pricing information accurate at the time.  This means that we need to know which version of the financial statement (which can be amended) is associated with which pricing date. This turns out to be an extremely complicated problem. The nature of the problem is that history can change.  Companies can file amendments or restatements to their historical filings, thereby giving investors a new version of history. This version of history can potentially change everyday.  Typically, it only changes quarterly or during corporate events, but because we don’t know when those can happen, we have to design our systems to allow for each day to have its own perspective on history.  This is the only way that we can accurately ask the question “What did the market know on a given day?”.

Let’s say that we are interested in the quarterly trailing twelve month EPS history for a company over the past three years.  That’s 12 quarters. Each one of those quarters can have multiple amendments or restatements. Each one of those quarters may have official filings (e.g. 10-Q) or preliminary filings (e.g. 8-K).  EPS is especially prone to historical change because it must be adjusted when a stock splits. So we can’t simply ask the question “What was the EPS for Q2 of 2015?” Because we will get many answers.  We must always ask these questions from the perspective of a particular day. So we can ask the question “What was the EPS for Q2 of 2015 as of March 16, 2018?” Once we fix our temporal perspective, we can then travel back in time from that point applying all the amendments that had been filed up to that point.  Because we were interested in analyzing three years, we need to do this for each of the 12 quarters. To make things even more complicated some algorithms may only consider 10-Qs, but others may allow 8-Ks to be considered.

Point-in-Time data is a multi-dimensional problem, where the number of dimensions is the number of years and/or quarters times the number of amendments times the number of types of filings that need to be inspected.  So in theory, each day can have a different perspective on history.


Financial data is inherently messy due to the fact that history can change with each amendment, stock split, and ticker change.  Data providers are only making the problem worse by not properly maintaining historical values. Valspresso has invested heavily in resolving these issues in it’s own databases.  It’s been a long, expensive, and complicated undertaking, but well worth the investment as demonstrated by the efficacy of our machine learning models and backtesting results.

Category: Uncategorized