Welcome to analysefit’s documentation!

Contents:

The main class for the analysis of a given fit.

class analyzefit.analyze.analysis(X, y, model, predict=None, testing=True)

The main class for the analysis of a given fit.

Args:

model (object): The fitting model (the model must have a predict method).

X (numpy.ndarray): The X valuse to be used for plots.

predict (str): The name of the method that is equivalent to the sklearn predict
function. Default = ‘predict’.

y (numpy.ndarray): The y values to be used for plots.

Attributes:
validate (object): Creates a residual vs fitted plot, a quatile plot, a
spread vs location plot, and a leverage plot and prints the accuracy score to the screen.
res_vs_fit (object): Creates a plot of the residuals vs the fittted
values in an interactive bokeh figure.
quantile (object): Creates a quantile plot for the fitted values in an
interactive bokeh figure.
spread_loc (object): Creates a plot of the spread in residuals vs the fitted
values in an interactive bokeh figure.
leverage (object): Creates a plot of the cooks distance and the influence vs
the standardized residuals in an interactive bokeh figure.
Examples:

The following examples show how to validate the fit of sklearn’s LinearRegression on the housing dataset. It shows how to generate each of the plots that can be used to verify the accuracy af a fit.

>>>> import pandas as pd

>>>> import numpy as np

>>>> from sklearn.linear_model import LinearRegression

>>>> from sklearn.cross_validation import train_test_split

>>>> from sklearn.metrics import mean_squared_error, r2_score

>>>> df = pd.read_csv(‘https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data’, header=None,sep=”s+”)

>>>> df.columns = [“CRIM”, “ZN”, “INDUS”, “CHAS”, “NOX”, “RM”, “AGE”, “DIS”, “RAD”, “TAX”, “PTRATIO”, “B”, “LSTAT”, “MEDV”]

>>>> X = df.iloc[:,:-1].values

>>>> y = df[[“MEDV”]].values

>>>> X_train, X_test, y_train, y_test = train_test_split(X, y,

>>>> test_size=0.3, random_state=0)

>>>> slr = LinearRegression()

>>>> slr.fit(X_train,y_train)

>>>> an = analyze.analysis(X_train, y_train, slr)

>>>> an.validate()

>>>> an.validate(X=X_test, y=y_test, metrics=[mean_squared_error, r2_score)

>>>> an.res_vs_fit()

>>>> an.quantile()

>>>> an.spread_loc()

>>>> an.leverage()

leverage(X=None, y=None, pred=None, interact=True, show=True, title=None, ax=None)

The spread-location, or scale-location, plot of the data.

Args:
X (numpy.ndarray, optional): The dataset to make the plot for
if different than the dataset used to initialize the method.
y (numpy.ndarray, optional): The target values to make the plot for
if different than the dataset used to initialize the method.
pred (numpy.ndarray, optional): The predicted values to make the plot for
if y and X are different than the dataset used to initialize the method.

interact (bool, optional): True if the plot is to be interactive.

show (bool, optional): True if plot is to be displayed.

ax (matplotlib.axes._subplots.AxesSubplot, optional): The subplot on which to
drow the plot.

title (str, optional): The title of the plot.

Rasises:
ValueError: if the number of predictions is not the same as the number of
target values or if the number of rows in the feature matrix is not the same as the number of targets.
Returns:
fig (matplotlib.figure.Figure or bokeh.plotting.figure): An
object containing the plot if show=False.
Examples:

>>>> import pandas as pd

>>>> import numpy as np

>>>> from sklearn.linear_model import LinearRegression

>>>> from sklearn.cross_validation import train_test_split

>>>> from sklearn.metrics import mean_squared_error, r2_score

>>>> df = pd.read_csv(‘https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data’, header=None,sep=”s+”)

>>>> df.columns = [“CRIM”, “ZN”, “INDUS”, “CHAS”, “NOX”, “RM”, “AGE”, “DIS”, “RAD”, “TAX”, “PTRATIO”, “B”, “LSTAT”, “MEDV”]

>>>> X = df.iloc[:,:-1].values

>>>> y = df[[“MEDV”]].values

>>>> X_train, X_test, y_train, y_test = train_test_split(X, y,

>>>> test_size=0.3, random_state=0)

>>>> slr = LinearRegression()

>>>> slr.fit(X_train,y_train)

>>>> an = analyze.analysis(X_train, y_train, slr)

>>>> an.leverage()

>>>> an.leverage(X=X_test,y=y_test,title=”Test values”)

>>>> an.leverage(X=X_test,pred=slr.predict(X_test),y=y_test,title=”Test values”)

quantile(data=None, dist=None, interact=True, show=True, title=None, ax=None)

Makes a quantile plot of the predictions against the desired distribution.

Args:
data (numpy.ndarray, optional): The user supplied data for the quantile plot.
If None then the model predictions will be used.
dist (str or numpy.ndarray, optional): The distribution to be compared to. Either
‘Normal’, ‘Uniform’, or a numpy array of the user defined distribution.

interact (bool, optional): True if the plot is to be interactive.

show (bool, optional): True if plot is to be displayed.

ax (matplotlib.axes._subplots.AxesSubplot, optional): The subplot on which to
drow the plot.

title (str, optional): The title of the plot.

Rasises:
ValueError: if data and the distribution are of different lengths.
Returns:
fig (matplotlib.figure.Figure or bokeh.plotting.figure): An
object containing the plot if show=False.
Examples:

>>>> import pandas as pd

>>>> import numpy as np

>>>> from sklearn.linear_model import LinearRegression

>>>> from sklearn.cross_validation import train_test_split

>>>> from sklearn.metrics import mean_squared_error, r2_score

>>>> df = pd.read_csv(‘https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data’, header=None,sep=”s+”)

>>>> df.columns = [“CRIM”, “ZN”, “INDUS”, “CHAS”, “NOX”, “RM”, “AGE”, “DIS”, “RAD”, “TAX”, “PTRATIO”, “B”, “LSTAT”, “MEDV”]

>>>> X = df.iloc[:,:-1].values

>>>> y = df[[“MEDV”]].values

>>>> X_train, X_test, y_train, y_test = train_test_split(X, y,

>>>> test_size=0.3, random_state=0)

>>>> slr = LinearRegression()

>>>> slr.fit(X_train,y_train)

>>>> an = analyze.analysis(X_train, y_train, slr)

>>>> an.quantile()

>>>> an.quantile(data=y_test,dist=”uniform”,title=”Test values vs uniform distribution”)

>>>> an.quantile(data=y_test,dist=np.random.samples((len(y_test))))

res_vs_fit(X=None, y=None, pred=None, interact=True, show=True, ax=None, title=None)

Makes the residual vs fitted values plot.

Args:
X (numpy.ndarray, optional): The dataset to make the plot for
if different than the dataset used to initialize the method.
y (numpy.ndarray, optional): The target values to make the plot for
if different than the dataset used to initialize the method.
pred (numpy.ndarray, optional): The predicted values to make the plot for
if y and X are different than the dataset used to initialize the method.

interact (bool, optional): True if the plot is to be interactive.

show (bool, optional): True if plot is to be displayed.

ax (matplotlib.axes._subplots.AxesSubplot, optional): The subplot on which to
drow the plot.

title (str, optional): The title of the plot.

Returns:
fig (matplotlib.figure.Figure or bokeh.plotting.figure): An
object containing the plot if show=False.
Examples:

>>>> import pandas as pd

>>>> import numpy as np

>>>> from sklearn.linear_model import LinearRegression

>>>> from sklearn.cross_validation import train_test_split

>>>> from sklearn.metrics import mean_squared_error, r2_score

>>>> df = pd.read_csv(‘https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data’, header=None,sep=”s+”)

>>>> df.columns = [“CRIM”, “ZN”, “INDUS”, “CHAS”, “NOX”, “RM”, “AGE”, “DIS”, “RAD”, “TAX”, “PTRATIO”, “B”, “LSTAT”, “MEDV”]

>>>> X = df.iloc[:,:-1].values

>>>> y = df[[“MEDV”]].values

>>>> X_train, X_test, y_train, y_test = train_test_split(X, y,

>>>> test_size=0.3, random_state=0)

>>>> slr = LinearRegression()

>>>> slr.fit(X_train,y_train)

>>>> an = analyze.analysis(X_train, y_train, slr)

>>>> an.res_vs_fit()

>>>> an.res_vs_fit(X=X_test,y=y_test,title=”Test values”)

>>>> an.res_vs_fit(pred=slr.predict(X_test),y=y_test,title=”Test values”)

spread_loc(X=None, y=None, pred=None, interact=True, show=True, title=None, ax=None)

The spread-location, or scale-location, plot of the data.

Args:
X (numpy.ndarray, optional): The dataset to make the plot for
if different than the dataset used to initialize the method.
y (numpy.ndarray, optional): The target values to make the plot for
if different than the dataset used to initialize the method.
pred (numpy.ndarray, optional): The predicted values to make the plot for
if y and X are different than the dataset used to initialize the method.

interact (bool, optional): True if the plot is to be interactive.

show (bool, optional): True if plot is to be displayed.

ax (matplotlib.axes._subplots.AxesSubplot, optional): The subplot on which to
drow the plot.

title (str, optional): The title of the plot.

Returns:
fig (matplotlib.figure.Figure or bokeh.plotting.figure): An
object containing the plot if show=False.
Examples:

>>>> import pandas as pd

>>>> import numpy as np

>>>> from sklearn.linear_model import LinearRegression

>>>> from sklearn.cross_validation import train_test_split

>>>> from sklearn.metrics import mean_squared_error, r2_score

>>>> df = pd.read_csv(‘https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data’, header=None,sep=”s+”)

>>>> df.columns = [“CRIM”, “ZN”, “INDUS”, “CHAS”, “NOX”, “RM”, “AGE”, “DIS”, “RAD”, “TAX”, “PTRATIO”, “B”, “LSTAT”, “MEDV”]

>>>> X = df.iloc[:,:-1].values

>>>> y = df[[“MEDV”]].values

>>>> X_train, X_test, y_train, y_test = train_test_split(X, y,

>>>> test_size=0.3, random_state=0)

>>>> slr = LinearRegression()

>>>> slr.fit(X_train,y_train)

>>>> an = analyze.analysis(X_train, y_train, slr)

>>>> an.spread_loc()

>>>> an.spread_loc(X=X_test,y=y_test,title=”Test values”)

>>>> an.spread_loc(pred=slr.predict(X_test),y=y_test,title=”Test values”)

validate(X=None, y=None, pred=None, dist=None, metric=None, testing=False)

The spread-location, or scale-location, plot of the data.

Args:
X (numpy.ndarray, optional): The dataset to make the plot for
if different than the dataset used to initialize the method.
y (numpy.ndarray, optional): The target values to make the plot for
if different than the dataset used to initialize the method.
pred (numpy.ndarray, optional): The predicted values to make the plot for
if y and X are different than the dataset used to initialize the method.
dist (str or numpy.ndarray, optional): The distribution to be compared to. Either
‘Normal’, ‘Uniform’, or a numpy array of the user defined distribution.
metric (function or list of functions, optional): The functions used to
determine how accurate the fit is.

testing (bool, optional): True if this is a unit test.

Returns:
score (list of float): The scores from each of the metrics of in testing mode.
Examples:

>>>> import pandas as pd

>>>> import numpy as np

>>>> from sklearn.linear_model import LinearRegression

>>>> from sklearn.cross_validation import train_test_split

>>>> from sklearn.metrics import mean_squared_error, r2_score

>>>> df = pd.read_csv(‘https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data’, header=None,sep=”s+”)

>>>> df.columns = [“CRIM”, “ZN”, “INDUS”, “CHAS”, “NOX”, “RM”, “AGE”, “DIS”, “RAD”, “TAX”, “PTRATIO”, “B”, “LSTAT”, “MEDV”]

>>>> X = df.iloc[:,:-1].values

>>>> y = df[[“MEDV”]].values

>>>> X_train, X_test, y_train, y_test = train_test_split(X, y,

>>>> test_size=0.3, random_state=0)

>>>> slr = LinearRegression()

>>>> slr.fit(X_train,y_train)

>>>> an = analyze.analysis(X_train, y_train, slr)

>>>> an.validate()

>>>> an.validate(X=X_test, y=y_test, metrics=[mean_squared_error, r2_score)

Tools for data minipulation.

analyzefit.manipulate.cooks_dist(y, pred, features)

Finds the Cooks distance for the data in the. See: https://en.wikipedia.org/wiki/Cook%27s_distance

Args:
y (numpy.ndarray): An array containing the correct values of the model. pred (numpy.ndarray): An array containing the predicted values of the model. features (numpy.ndarray): An array containing the features of the regression model.
Returns:
dist (numpy.ndarray): An array of the cooks distance for each point in the input data.
analyzefit.manipulate.hat_diags(X)

Finds the diagonals of the hat matrix for the features in X.

Args:
X (numpy.ndarray): An array containing the features of the regression model.
Returns:
hat_diags (numpy.ndarray): The diagonals of the hat matrix.
analyzefit.manipulate.residual(y, pred)

Finds the residual of the actual vs the predicted values.

Args:
y (numpy.ndarray): An array containing the correct values of the model. pred (numpy.ndarray): An array containing the predicted values of the model.
Returns:
residaul (numpy.ndarray): The residual of the data (y-pred).
Raises:
ValueError: Raises a value error if y and pred don’t have the same number of elements.
analyzefit.manipulate.std_residuals(y, pred)

Finds the residual of the actual vs the predicted values.

Args:
y (numpy.ndarray): An array containing the correct values of the model. pred (numpy.ndarray): An array containing the predicted values of the model.
Returns:
standardized_residaul (numpy.ndarray): The standardazied residual of the data (y-pred).
analyzefit.plotting.scatter(x, y, show_plt=True, x_label=None, y_label=None, label=None, title=None, fig=None, ax=None)

Make a standard matplotlib style scatter plot.

Args:
x (numpy.ndarray): The data for the x-axis. y (numpy.ndarray): The data for the y-axis. show (bool, optional): True if plot is to be shown. x_label (str, optional): The x-axis label. y_label (str, optional): The y-axis label. label (str, optional): The data trend label. title (str, optional): The plot title. fig (matplotlib.figure.Figure, optional): An initial figure to add points too. ax (matplotlib.axes._subplots.AxesSubplot, optional): A subplot object to plot on.
Returns:
fig (matplotlib object): Returns the matplotlib object if show = False.
analyzefit.plotting.scatter_with_hover(x, y, in_notebook=True, show_plt=True, fig=None, name=None, marker=’o’, fig_width=500, fig_height=500, x_label=None, y_label=None, title=None, color=’blue’, **kwargs)

Plots an interactive scatter plot of x vs y using bokeh, with automatic tooltips showing columns from df. Modified from: http://blog.rtwilson.com/bokeh-plots-with-dataframe-based-tooltips/

Args:

x (numpy.ndarray): The data for the x-axis. y (numpy.ndarray): The data for the y-axis.

fig (bokeh.plotting.Figure, optional): Figure on which to plot
(if not given then a new figure will be created)

name (str, optional): Series name to give to the scattered data marker (str, optional): Name of marker to use for scatter plot

Returns:
fig (bokeh.plotting.Figure): Figure (the same as given, or the newly created figure)
if show is False

Indices and tables