Automatic Predictive Model Constructor (APMC) Documentation
- GENERAL WORKFLOW
- DATA FORMAT T PREPARATION
- ACCURACY THRESHOLD
- EXTRAPOLATION RISK
- STATISTICAL REPORT
- EXAMPLE DATA SETS
APMC is a module of Gene-Calc application available from www.gene-calc.pl/apmc.
The aim of APMC is to automatic train, develop and evaluate supervised Machine Learning models for regression and / or classification purposes. An additional functionality is to generate a report containing descriptive statistics, correlation charts and incidentally a decision tree graph.
2/ GENERAL WORKFLOW
Technically workflow of module is splitted in two steps, in first step users need to upload data set [data set format requirements in section no. 3] and then select:
- type of model purposes [classification or regression]
- normalization [True or False] [more information in section no. 4]
Subsequently pre-models are parallel training. In case of classification models, these are algorithms:
- K-nearest neighbours
- Logistic regression
- Supported vector machines
- Random forest classification
whereas in case of regression models, these are algorithms:
- Simple linear regression
- Lasso linear regression
- Ridge linear regression
- Random forest regression
The last part of first step is model selection, users need to select one model from above-mentioned lists, the decision is based on models metrics, APMC suggests which model is the most accurate and should be chosen, based on CV-metric [more information about in section no. 6].
For classification models metrics contain:
- Mean of cross validation scores
- Classification report
while for regression models metrics contain:
- Mean of cross validation scores
- Medium squared error (MSE)
- Medium absolute error (MAE)
While users select model considered as the best for their purposes the second step of workflow begins. Chosen model is introduced into SearchGrid for optimization of hyperparameters. Obtained best hyperparameters are used to train final model which is evaluate and then used to make prediction.
3/ DATA FORMAT AND PREPARATION
Uploaded data must have *.csv, *.xls or *.xlsx extensions, data in file [data set] need to be constructed as shown in the pictures below.
Predictors are values used to predict target value. Names of columns are required and used as predictors and target variables names. Only numbers are allowed as predictors, in case of categorical data users must convert them to corresponding numeric labels.
Target value in case of classification models may include strings, in case of regression only numbers are allowed.
Allowed number of predictors columns is 10, allowed number of target values column is 1, allowed number of rows/records is 5000.
If data construction and size is accurate according mentioned rules, data set is splitted into 2 parts: 70% is used to train models and obtain cross validation score, 30% is used as validation set used to calculate all metrics (excluding cross validation score).
4/ ACCURACY THRESHOLD WARING
When choosing the model entering to SearchGrid, APMC suggests which model is probably the most suitable. At the same time, it suggests which models are probably not acceptable. Models that have a lower cross-validation score than dummy models are considered as not useful.
In some cases normalization may have positive influence on model training, if normalization is True predictors data is transformed following formula:
z = (x - u) / s
z - normalized value
x - raw value
u - mean
s - standard deviation
6/ EXTRAPOLATION RISK
If predictor/s value/s used to predict target value is/are extrapolated warning information is displayed. Implemented extrapolation definition:
if x > m + 3 * s
if x < m - 3 * s
x is extrapolated
x - input predictor value
m - mean of predictor values in data set
s - standard deviation of predictor values in data set
Metrics are measures of the model prediction accuracy. Metrics definitions following scikit-learn documentation:
-> Cross validation score (CV, k=5)
This metric is the average accuracy of the k-times trained model on (k-1)-number of folds relative to one test fold.
-> Classification report
This metric contain couple of more important metrics for classification model evaluation, including: precision, recall and F1.
The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.
The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.
The F-beta (F-1) score weights recall more than precision by a factor of beta. Beta == 1.0 means recall and precision are equally important.
The support is the number of occurrences of each class in y_true.
Is a regression score function. Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R2 score of 0.0.
-> MAE - Mean absolute error
-> MSE - Mean squared error
8/ STATISTICAL REPORT
Statistical report may be generated after model training, it contains:
a) descriptive statistics tables with:
- Standard deviation
- Min and Max values
b) Plots of correlation with Pearson correlation and p-value
c) In case of classification model categorical violin plots
d) Graph of decision trees in case of random forest models
e) Predictors coefficients in case of: random forest classification and regression models
descriptive statistics tables
Violin plot - features a kernel density estimation [kde] of the underlying distribution
Coefficients of predictors
9/ EXAMPLE DATA SETS
Data used to generate examples [source Kaggle]:
for classification purposes:
for regression purposes:
--> USA Housing