Complejidad de modelos ¶

!wget --no-cache -O init.py -q https://raw.githubusercontent.com/jdariasl/ML_2020/master/init.py
import init; init.init(force_download=False); 

Julián D. Arias Londoño¶

Profesor Asociado
Departamento de Ingeniería de Sistemas
Universidad de Antioquia, Medellín, Colombia
julian.ariasl@udea.edu.co

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

Supongamos que tenemos un dataset de un problema de clasificación¶

from library.regularization import plot_ellipse
Cov = np.identity(2) * 1.1
Cov2 = np.array([[1.1,0.5],[0.5,1.1]])
Mean = [1.1,2.1]
Mean2 = [4.1,4.1]
ax = plt.subplot(111)
x, y  = np.random.multivariate_normal(Mean, Cov, 100).T
x2, y2  = np.random.multivariate_normal(Mean2, Cov2, 100).T
ax.plot(x,y,'o',alpha= 0.5)
ax.plot(x2,y2,'o',alpha= 0.5)
ax.axis('equal')

plot_ellipse(ax,Mean,Cov,color='b')
plot_ellipse(ax,Mean2,Cov2)
plt.grid()

_images/Clase 06 - Complejidad de modelos, sobreajuste y metodologías de validación_5_0.png

¿Cómo se vería la frontera de clasificación usando un FDG?¶

from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
N = 100
x, y  = np.random.multivariate_normal(Mean, Cov, N).T
x2, y2  = np.random.multivariate_normal(Mean2, Cov2, N).T
X = np.r_[np.c_[x,y],np.c_[x2,y2]]
Y = np.r_[np.ones((N,1)),np.zeros((N,1))]
clf = QuadraticDiscriminantAnalysis()
clf.fit(X,Y.flatten())
plt.scatter(X[:,0],X[:,1],c=Y.flatten(), cmap='Set2',alpha=0.5)

h = .02  # step size in the mesh
# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),np.arange(y_min, y_max, h))

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contour(xx, yy, Z, cmap=plt.cm.Blues)
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.grid()

_images/Clase 06 - Complejidad de modelos, sobreajuste y metodologías de validación_7_0.png

from library.regularization import Fronteras
clf = QuadraticDiscriminantAnalysis()
Fronteras(clf,100,10)

_images/Clase 06 - Complejidad de modelos, sobreajuste y metodologías de validación_8_0.png

y si cambiamos el tipo de modelo…

from library.regularization import Kernel_classifier
clf = Kernel_classifier(bandwidth=0.5)
Fronteras(clf,100,5)

_images/Clase 06 - Complejidad de modelos, sobreajuste y metodologías de validación_10_0.png

Consultar: Criterios para seleccionar modelos según su complejidad:

Akaike Information Criterion
Bayesian Information Criterion
Minimum Description Length

¿Qué limitaciones tienen dichos criterios?

Bias vs Variance¶

Bias (sesgo): es la diferencia entre la predicción promedio de nuestro modelo y el valor correcto que se pretende predecir. Un modelo con alto bias no captura las relaciones relevantes entre las características de entrada y las variables objetivo de salida, pone poca atención a los datos de entrenamiento y sobre-simplifica el modelo.

Variance: es un error debido a una alta sensibilidad a pequeñas fluctuaciones en el conjunto de entrenamiento. Una varianza alta puede causar que el modelo se centre en el ruido contenido en los datos más que en la salida deseada. Los modelos que cometen error por varianza suelten tener buenas desempeños en entrenamiento pero altas tasasa de error en conjuntos de prueba.

Formalmente:

El sistema que queremos modelar está dado por:

\[y=f({\bf{x}}) + e\]

donde \(e\) es el término de error el cual se asume distribuido normalmente con media 0.

\[Err({\bf{x}}) = E\left[ \left(y - \hat{f}({\bf{x}})\right)^2 \right]\]

Usando propiedades del valor esperado:

\[Err({\bf{x}}) = \left( E[\hat{f}({\bf{x}})] - f({\bf{x}})\right)^2 + E\left[\left(\hat{f}({\bf{x}}) - E\left[\hat{f}({\bf{x}})\right]\right)^2\right] + \sigma_e^2\]

\[Err({\bf{x}}) = \text{Bias}^2 + \text{Variance} + \text{Irreductible Error}\]

from IPython.display import Image
Image("./Images/biasVVariance.png", width = 600)

_images/Clase 06 - Complejidad de modelos, sobreajuste y metodologías de validación_14_0.png

Image("./Images/tradeoff.png", width = 600)

_images/Clase 06 - Complejidad de modelos, sobreajuste y metodologías de validación_15_0.png

Veamos un ejemplo:¶

from library.regularization import PolynomialLinearRegression
def f(size):
    '''
    Returns a sample with 'size' instances without noise.
    '''
    x = np.linspace(0, 4.5, size)
    y = 2 * np.sin(x * 1.5)
    return (x,y)

def sample(size):
    '''
    Returns a sample with 'size' instances.
    '''
    x = np.linspace(0, 4.5, size)
    y = 2 * np.sin(x * 1.5) + np.random.randn(x.size)
    return (x,y)
    
size = 50
f_x,f_y = f(size)
plt.plot(f_x, f_y)
x, y = sample(50)
plt.plot(x, y, 'k.')
model = PolynomialLinearRegression(degree=8)
model.fit(x,y)
p_y = model.predict(x)
plt.plot(f_x, f_y, label="true function")
plt.plot(x, y, 'k.', label="data")
plt.plot(x, p_y, label="polynomial fit")
plt.legend();
plt.grid();

_images/Clase 06 - Complejidad de modelos, sobreajuste y metodologías de validación_17_0.png

plt.figure(figsize=(18,3))
for k, degree in enumerate([3, 5, 10, 18]):
    plt.subplot(1,4,k+1)
    n_samples = 20
    n_models = 20
    avg_y = np.zeros(n_samples)
    for i in range(n_models):
        (x,y) = sample(n_samples)
        model = PolynomialLinearRegression(degree=degree)
        model.fit(x,y)
        p_y = model.predict(x)
        avg_y = avg_y + p_y
        plt.plot(x, p_y, 'k-', alpha=.1)
    avg_y = avg_y / n_models
    plt.plot(x, avg_y, 'b--', label="average model")
    plt.plot(x, f(len(x))[1], 'b--', color="red", lw="3", alpha=.5, label="actual function")
    plt.legend();
    plt.grid();
    plt.title("degree %d"%degree)

_images/Clase 06 - Complejidad de modelos, sobreajuste y metodologías de validación_18_0.png

from numpy.linalg import norm
n_samples = 20
f_x, f_y = f(n_samples)
n_models = 100
max_degree = 15
var_vals =[]
bias_vals = []
error_vals = []
for degree in range(1, max_degree):
    avg_y = np.zeros(n_samples)
    models = []
    for i in range(n_models):
        (x,y) = sample(n_samples)
        model = PolynomialLinearRegression(degree=degree)
        model.fit(x,y)
        p_y = model.predict(x)
        avg_y = avg_y + p_y
        models.append(p_y)
    avg_y = avg_y / n_models
    bias_2 = norm(avg_y - f_y)/f_y.size
    bias_vals.append(bias_2)
    variance = 0
    for p_y in models:
        variance += norm(avg_y - p_y)
    variance /= f_y.size * n_models
    var_vals.append(variance)
    error_vals.append(variance + bias_2)
plt.plot(range(1, max_degree), bias_vals, label='bias')
plt.plot(range(1, max_degree), var_vals, label='variance')
plt.plot(range(1, max_degree), error_vals, label='error = bias+variance')
plt.legend()
plt.xlabel("polynomial degree")
plt.grid();

_images/Clase 06 - Complejidad de modelos, sobreajuste y metodologías de validación_19_0.png

	name	MDVP:Fo(Hz)	MDVP:Fhi(Hz)	MDVP:Flo(Hz)	MDVP:Jitter(%)	MDVP:Jitter(Abs)	MDVP:RAP	MDVP:PPQ	Jitter:DDP	MDVP:Shimmer	...	Shimmer:DDA	NHR	HNR	status	RPDE	DFA	spread1	spread2	D2	PPE
0	phon_R01_S01_1	119.992	157.302	74.997	0.00784	0.00007	0.00370	0.00554	0.01109	0.04374	...	0.06545	0.02211	21.033	1	0.414783	0.815285	-4.813031	0.266482	2.301442	0.284654
1	phon_R01_S01_2	122.400	148.650	113.819	0.00968	0.00008	0.00465	0.00696	0.01394	0.06134	...	0.09403	0.01929	19.085	1	0.458359	0.819521	-4.075192	0.335590	2.486855	0.368674
2	phon_R01_S01_3	116.682	131.111	111.555	0.01050	0.00009	0.00544	0.00781	0.01633	0.05233	...	0.08270	0.01309	20.651	1	0.429895	0.825288	-4.443179	0.311173	2.342259	0.332634
3	phon_R01_S01_4	116.676	137.871	111.366	0.00997	0.00009	0.00502	0.00698	0.01505	0.05492	...	0.08771	0.01353	20.644	1	0.434969	0.819235	-4.117501	0.334147	2.405554	0.368975
4	phon_R01_S01_5	116.014	141.781	110.655	0.01284	0.00011	0.00655	0.00908	0.01966	0.06425	...	0.10470	0.01767	19.649	1	0.417356	0.823484	-3.747787	0.234513	2.332180	0.410335
5	phon_R01_S01_6	120.552	131.162	113.787	0.00968	0.00008	0.00463	0.00750	0.01388	0.04701	...	0.06985	0.01222	21.378	1	0.415564	0.825069	-4.242867	0.299111	2.187560	0.357775
6	phon_R01_S02_1	120.267	137.244	114.820	0.00333	0.00003	0.00155	0.00202	0.00466	0.01608	...	0.02337	0.00607	24.886	1	0.596040	0.764112	-5.634322	0.257682	1.854785	0.211756
7	phon_R01_S02_2	107.332	113.840	104.315	0.00290	0.00003	0.00144	0.00182	0.00431	0.01567	...	0.02487	0.00344	26.892	1	0.637420	0.763262	-6.167603	0.183721	2.064693	0.163755
8	phon_R01_S02_3	95.730	132.068	91.754	0.00551	0.00006	0.00293	0.00332	0.00880	0.02093	...	0.03218	0.01070	21.812	1	0.615551	0.773587	-5.498678	0.327769	2.322511	0.231571
9	phon_R01_S02_4	95.056	120.103	91.226	0.00532	0.00006	0.00268	0.00332	0.00803	0.02838	...	0.04324	0.01022	21.862	1	0.547037	0.798463	-5.011879	0.325996	2.432792	0.271362

	MDVP:Fo(Hz)	MDVP:Fhi(Hz)	MDVP:Flo(Hz)	MDVP:Jitter(%)	MDVP:Jitter(Abs)	MDVP:RAP	MDVP:PPQ	Jitter:DDP	MDVP:Shimmer	MDVP:Shimmer(dB)	...	NHR	HNR	status	RPDE	DFA	spread1	spread2	D2	PPE	Subject
0	119.992	157.302	74.997	0.00784	0.00007	0.00370	0.00554	0.01109	0.04374	0.426	...	0.02211	21.033	1	0.414783	0.815285	-4.813031	0.266482	2.301442	0.284654	S01
1	122.400	148.650	113.819	0.00968	0.00008	0.00465	0.00696	0.01394	0.06134	0.626	...	0.01929	19.085	1	0.458359	0.819521	-4.075192	0.335590	2.486855	0.368674	S01
2	116.682	131.111	111.555	0.01050	0.00009	0.00544	0.00781	0.01633	0.05233	0.482	...	0.01309	20.651	1	0.429895	0.825288	-4.443179	0.311173	2.342259	0.332634	S01
3	116.676	137.871	111.366	0.00997	0.00009	0.00502	0.00698	0.01505	0.05492	0.517	...	0.01353	20.644	1	0.434969	0.819235	-4.117501	0.334147	2.405554	0.368975	S01
4	116.014	141.781	110.655	0.01284	0.00011	0.00655	0.00908	0.01966	0.06425	0.584	...	0.01767	19.649	1	0.417356	0.823484	-3.747787	0.234513	2.332180	0.410335	S01
5	120.552	131.162	113.787	0.00968	0.00008	0.00463	0.00750	0.01388	0.04701	0.456	...	0.01222	21.378	1	0.415564	0.825069	-4.242867	0.299111	2.187560	0.357775	S01
6	120.267	137.244	114.820	0.00333	0.00003	0.00155	0.00202	0.00466	0.01608	0.140	...	0.00607	24.886	1	0.596040	0.764112	-5.634322	0.257682	1.854785	0.211756	S02
7	107.332	113.840	104.315	0.00290	0.00003	0.00144	0.00182	0.00431	0.01567	0.134	...	0.00344	26.892	1	0.637420	0.763262	-6.167603	0.183721	2.064693	0.163755	S02
8	95.730	132.068	91.754	0.00551	0.00006	0.00293	0.00332	0.00880	0.02093	0.191	...	0.01070	21.812	1	0.615551	0.773587	-5.498678	0.327769	2.322511	0.231571	S02
9	95.056	120.103	91.226	0.00532	0.00006	0.00268	0.00332	0.00803	0.02838	0.255	...	0.01022	21.862	1	0.547037	0.798463	-5.011879	0.325996	2.432792	0.271362	S02

2021 Introducción al Machine Learning

Complejidad de modelos

Contents

Complejidad de modelos ¶

Julián D. Arias Londoño¶

Supongamos que tenemos un dataset de un problema de clasificación¶

¿Cómo se vería la frontera de clasificación usando un FDG?¶

Bias vs Variance¶

Veamos un ejemplo:¶

Metodologías de validación¶

Validación cruzada (\(k\)-fold cross-validation)¶

Leave-one-out¶

Validación Bootstrapping (shuffle-split)¶

Leave-p-out¶

Metodología de validación para problemas desbalanceados¶

Metodología de validación por grupos¶

Probemos asumiendo independencia¶

Noten la gran diferencia entre el desempeño en validación y el desempeño en test!¶

Ahora probemos teniendo en cuenta los pacientes¶

Curva de aprendizaje¶

Veamos un ejemplo con los mismos datos del ejemplo anterior.¶

Prueba con otro conjunto de datos¶