Data mismatch between Fundamental data sources

Course Name: [Data

Hi,

  1. I tried to compare the fundamental data with two to three data sources and I see differences. Please let me know how do we handle these cases?

  2. There are many missing values in the fundamental data in many features(more than 10-20%). Please let me know how to handle this?

  • Does forward filling help?

Regards
Shreyas Balakrishna

Hi Shreyas,

Approach to Resolve Discrepancies

A. Prioritize Official Sources (If Accuracy is Critical)
  • Verify with SEC filings (10-K, 10-Q) or equivalent official reports.
  • If discrepancies exist, trust official filings over third-party data vendors.
B. Ballpark Estimate (If Precision is Not Critical, e.g., for ML modeling or quick analysis)

For each fundamental metric (e.g., revenue, EPS) for a single company and period:

  1. Gather fundamental values from different sources (typically 3-4 values).
  2. Compute the max/min ratio:
  3. Set a threshold (e.g., 1.1 or 1.2 based on acceptable variance).
  • If Ratio > Threshold, verify with official filings.
  • If Ratio ≤ Threshold, take the mean of available values as a reasonable estimate.

Ok thanks

Hi Shreyas,

for query 2:
If fundamental data has many missing values, here’s how to handle it:

  • Try to update missing values using official sources.

  • Perform exploratory data analysis (EDA) to look for patterns:

    • Which companies have missing values?
    • Are there specific periods when this happens?
    • Are certain features consistently missing across companies?
  • If a feature has more than X% missing data, consider dropping it unless it’s crucial.

  • Forward fill (ffill) is not ideal for this case.

  • Some models like XGBoost can handle missing values internally, so imputation may not always be necessary.

  • Create a custom score using the Z-score approach within the same period across the industry.

    • Standardize key metrics (e.g., ROE, EBITDA margin) using Z-scores.
    • Skip NA values and compute the average Z-score of available metrics
  • Consider using:

from sklearn.impute import IterativeImputer
  • When imputing, include additional features like:
    • % change in stock price for the respective period (e.g., quarter).
    • Industry averages for the same and other related metrics.
    • Financial ratios to improve imputation accuracy.

Also, the best approach depends on the model you’re building.

Sure. Thanks a lot!