Query on scaling and centering

Shrirang_Ashok_Yadav_BD6F1 · November 3, 2022, 6:46pm

In machine learning,Scale the data module,what is the meaning of our dataset has feature values?

what are features values? what is the significance of feature values and what they do?

what is the meaning of " Centring reduces the mean value of the features to 0. Scaling refers to dividing each entry by the standard deviation of the data.This transforms the standard deviation of the features to 1"?

varun_kumar_pothula · November 4, 2022, 6:30pm

Hello Shrirang,

In general, machine learning algorithms/models take input data to generate output. This input data is known as features. The features can be raw data or data extracted from the raw data.

For example, consider using a machine learning model to predict stock prices.

The OHLCV (Open, High, Low, Close, Volume) data can be raw data. In this case, you can either use the same raw data i.e. OHLCV data as features or you can extract features such as moving averages, technical indicators etc from the raw data (OHLCV). Extracting (creating) features from raw data extracts information from the raw data which helps to train the model and helps in price prediction.

These features are divided into three types i.e. quantitative, ordinal, and categorical.

For example, using close price data, you can calculate the returns and extract a feature. This is a quantitative feature since it is a numerical value.

Similarly, you can categorise the day as 'positive' if returns are greater than zero and 'negative' if returns are less than zero to create a feature that represents the daily sentiment. This is a categorical feature since there is no intrinsic ordering to the values of this feature.

Along similar lines, you can also create a feature that represents the degree of sentiment like 'highly positive' when returns are greater than 2%, and 'highly negative' when returns are less than 2%. This is an ordinal feature since the clear ordering of the categories is observed and values are non-numerical.

Different types of features are created to extract more information from the raw data.

If you take a simple case of input features with OHLCV data. The volume values are generally in order of 100,000 to millions where are O, H, L, and C are generally less than 1000. So, as you can guess, the volume data will have high influence on the model than the open, high, low, and close prices since the volume data is of a much higher degree. So, to avoid this scenario where different features have different scales, we need to scale the features (standardizing) before you input the data into the model. The standard way of scaling the data is to subtract the mean from the data points and divide them by its standard deviation. This would centre that data, equate the mean of the resultant data to zero and changes the standard deviation of the data to 1. This takes away the effect of the scale/unit of measurement on the features.

Please read this article for a much deeper understanding of feature scaling.

Hope this helps!