class preprocess ( )

The class to preprocess the input variables of each model group and initiate the models properties such as cross-validation and brute-force searching.


Parameters

(__init__ method of preprocess class)

tunedpars_rfr dic, default= {"min_weight_fraction_leaf":[0,0.02,0.04],"n_estimators":[50,100,150,2'00,250,300],"criterion": ["mse","mae"],"min_samples_split":[2,5] }

The dictionary that determines brute-force searching parameters of Random Forest regression. For more details on the regression method inputs, refer to sklearn library.


tunedpars_svr dic default={"kernel":[ "poly", "rbf", "sigmoid"],"C":np.logspace(-1, 1, 3),"gamma":np.logspace(-3, 1, 3) }

The dictionary that determines brute-force searching parameters of Support Vector regression. For more details on the regression method inputs, refer to sklearn library.


tunedpars_nusvr dic default={"kernel":["linear", "poly", "rbf", "sigmoid"] }

The dictionary that determines brute-force searching parameters of Nu Support Vector Regression. For more details on the regression method inputs, refer to sklearn library.


tunedpars_mlp dic default={"activation" : [ "logistic", "tanh"],"solver" : ["lbfgs", "sgd", "adam"],"alpha":[0.0001,0.0003],"hidden_layer_sizes":[(50,)*2,(50,)*3,(50,)*4,(100,)*2,(100,)*3,(100,)*4],"max_iter":[1000],"n_iter_no_change":[10]}

The dictionary that determines brute-force searching parameters of Multilayer Perceptron regression. For more details on the regression method inputs, refer to sklearn library.


tunedpars_lr dic default={}

The dictionary that determines brute-force searching parameters of Linear regression. For more details on the regression method inputs, refer to sklearn library.


tunedpars_br dic default={}

The dictionary that determines brute-force searching parameters of Bayesian Ridge regression. For more details on the regression method inputs, refer to sklearn library.


tunedpars_ard dic default={}

The dictionary that determines brute-force searching parameters of Bayesian ARD regression. For more details on the regression method inputs, refer to sklearn library.


tunedpars_omp dic default={}

The dictionary that determines brute-force searching parameters of Orthogonal Matching Pursuit regression. For more details on the regression method inputs, refer to sklearn library.


tunedpars_elnet dic default={"l1_ratio":[.1, .5, .7,.9,.99]}

The dictionary that determines brute-force searching parameters of ElasticNet (Linear regression with combined L1 and L2 priors as regularizer) regression. For more details on the regression method inputs, refer to sklearn library.


tunedpars_muelnet dic default= {"l1_ratio":[.1, .5, .7,.9,.99]}

The dictionary that determines brute-force searching parameters of Multi-task ElasticNet (trained with L1/L2 mixed-norm as regularizer) regression. For more details on the regression method inputs, refer to sklearn library.


which_regs dic default= {"muelnet":True, "rfr":True, "mlp":True, "elnet":True, "omp":True, "br":True, "ard":True, "svr":True, "nusvr":False}

The dictionary that determines which regression models have to be included in cross-validation and brute-force searching process


apply_on_log Boolean default=True

If True, Apart from the main values, fits the models to log(1 + x) (Natural logarithm) of data. If the scores of the regression on logarithm of the data is higher than the real values, The chosen method will always calculate the log(1 + x) of the data before fitting the models. Note that if this is the case, to have the real outputs, exp(x) - 1 (the inverse of log(1 + x)) will be calculated.


cv int or Boolean default="auto"

If cv=”auto”, the cross-validation number of folds will be calculated automatically. It is beneficial when there is few data available (max=10, min=2). If cv is an integer values, it determines number of folds of the cross-validation.


Methods

  • __init__ (self)

  • fit(self, inp_var, var_name, fields,direc, remove_outliers=True, write_outliers_input=True, year_type=”all”, inc_zeros_inp_var=False, write_integrated_data=True, q1=0.05, q3=0.95, IQR_inp_var=True, IQR_rat_inp_var=3, mean_mode_inp_var=”arithmetic”, elnino=None, lanina=None)

  • model_pars(self,kwargs)


Attributes

model_pars_name_dic dic

A dictionary that stores brute-force searching parameters of all models. The keys and values are dictionary names and parameters dictionaries respectively

EXAMPLE:


key: "tunedpars_rfr"

Value: tunedpars_rfr= {"min_weight_fraction_leaf":[0,0.04],"n_estimators":[150,200],"criterion": ["mse"] }

IMPORTANT NOTE: 'model_pars' method have to be used to change the brute-force searching parameters


db_input_args_dics dic

A dictionary that stores input parameters of the fit method


direc str

The directory of the data preprocessing class output


month_grouped_inp_var list

A list of monthly grouped data. Each element of the list is a dataframe containing the grouped data of an specific month


preprocess.fit ( )

preprocess.fit(self, inp_var, var_name, fields, direc, remove_outliers=True,write_outliers_input=True, year_type=”all”, inc_zeros_inp_var=False, write_integrated_data=True, q1=0.05, q3=0.95, IQR_inp_var=True, IQR_rat_inp_var=3, mean_mode_inp_var=”arithmetic”, elnino=None, lanina=None):

The method to preprocess the input data of the models


Parameters

inp_var Pandas dataframe

Input pandas dataframe containing the dependent value, unique IDs for each sample, and date. (The columns “Value”, “ID” and “Date” have to be found in the dataframe)


var_name str

Name of the dependent variable.


fields list of strings

List of independent variable names that have to be existed in the inp_var dataframe


direc str

The directory of the data preprocessing class output


remove_outliers Boolean default=True

To remove outliers based on the introduced variables. If False, the variables related to removing outliers will be ignored.


write_outliers_input Boolean default=True

Effective if remove_outliers=True. To generate .xls file in directory folders of the class after removing outliers


year_type str default="all"

"all", "elnino" or "lanina". In case elnino or lanina years have to be selected from the database


inc_zeros_inp_var Boolean default=False

Effective if remove_outliers=True. Removing outliers could be done not considering the zero values in the database. (The zero values will be seperated, outliers will be removed, then zero values will be added to database)


write_integrated_data Boolean default=True

Generate two .xls outputs consisting of integrated data, and the quantity of data in each month


q1 float default=0.05

Effective if remove_outliers=True and IQR_inp_var=False. Lower percentile limit to determine the outliers.


q3 float default=0.95

Effective if remove_outliers=True and IQR_inp_var=False. Upper percentile limit to determine the outliers


IQR_inp_var Boolean default=True

Effective if remove_outliers=True. Determining the upper limit of the outliers using this formula: $X*q_{0.75} + IQR_{rat}abs(X q{0.25} - X * q_{0.75})$

Lower limit = q1


IQR_rat_inp_var float default=3

Effective if remove_outliers=True and IQR_inp_var=True. This parameter used in Xq_0.75 + IQR_ratabs(Xq_0.25 - Xq_0.75) to determine upper boundary limit.


mean_mode_inp_var str default="arithmetic"

Data averaging method. available options are "arithmetic" or "geometric"


elnino None type or list of integers default=None

List of elnino years


lanina None type or list of integers default=None

List of lanina years


Attributes

db_input_args_dics dic

A dictionary that stores input parameters of the fit method


direc str

The directory of the data preprocessing class output


month_grouped_inp_var list

A list of monthly grouped data. Each element of the list is a dataframe containing the grouped data of an specific month


preprocess.model_pars ( )

preprocess.model_pars(self,**kwargs):

To change the brute-force searching parameters that is already determined in a class.


EXAMPLE:

#Define the new brute-force searching parameters dictionaries:

#brute-force searching parameters for random forest
brutesearch_ran_for_dic={"min_weight_fraction_leaf":[0,0.04],"n_estimators":[150,200],"criterion": ["mse"] }

#brute-force searching parameters for elastic net
brutesearch_elasticnet_dic={"l1_ratio":[ .5, .9]}

#change the brute-force searching parameters
prep_class.model_pars( "tunedpars_rfr" = brutesearch_ran_for_dic, "tunedpars_elnet" = brutesearch_elasticnet_dic )