Subsetting the data to include the “1” and “8” and plotting the first 20 observations for each we can see the variation in the handwritten data, this is seen in Graph 1 below.
Graph 1: First 20 observations of “1” and “8”
From Graph 1 we can see why developing classifiers to accurately identify the handwritten data can be difficult as individuals that are right handed would have there numbers flowing more the right and vice versa for left handed individuals. Also the variety of writing styles are also showcased. Difits are not too easy to recognize at first sight. Many of them are not even in a standard position or are too blurry making the classification problem a bit difficult.
For teting, there are multiple tests that can help to asses if a set of observations have a multivariate normal distribution. Performing a basic Shapiro-Wilk Multivariate Normality Test we get the following:
Error in solve.default(R %*% t(R), tol = 1e-18) :
Lapack routine dgesv: system is exactly singular: U[1,1] = 0
Error in solve.default(R %*% t(R), tol = 1e-18) :
Lapack routine dgesv: system is exactly singular: U[1,1] = 0
For a Henze-Zirkler Multivariate Normality Test we obtain the same result as above. Obtaining this in every test because the system are singular. So until this point in our analysis we can’t assure that any of the training sets have a multivariate normal distribution.
Obtaining covariance matrices for both “1” and “8” we can see that both matrices are too large to easily compare and see if they are equivalent. One thing we can do is use PCA to plot projected vectors in first two components and check if they have the same covariance matrix structure. Plotting the projected vectors in two dimensions we obtain Graph 2 below.
Graph 2: PCA to Plot Projected vectors in 2 dimensions
From Graph 2 we can see, both types don’t have the same structure so they can have a same covariance matrix and maybe just one of them...