 Scroll down to discover

# Exploratory Factor Analysis In Python Python is one of the easiest and most user-friendly codes that can be written by coders. Factor analysis is one of the popular methods of discovering underlying factors and latent variables in data. What do I mean by latent variables? Let’s find out. Say we have a dataset with 6 variables-

1. Age
2. Weight
3. Height
4. Salary
5. Bank Balance
6. Number of Credit Cards

And we have collected these 6 data points for 1000 users. Now after thorough analysis we find out that there exist two latent or hidden variables that are dependent on some of the data points present here. Fig: A chart showing latent variables derived from original data points

The figure above gives a good understanding of how factor-analysis helps find hidden variables. In this specific example, I chose data points that one can easily separate and analyze, but when you have a data set with headings like x1, x2, y1, y2, z1, z2, you will need the proper tools to get this job done. When it comes to exploratory factor analysis, which is by far the most popular factor analysis approach amond researchers, the basic assumption made at the very beginning is that every variable at hand is directly associated with a factor.

### What Is The Aim Of Factor Analysis Of Python?

The main aim of factor analysis is to reduce the number of data-points by extracting unobservable variables. The new variables make it easier for a market researcher or a statistician to complete his study, and also makes the data more consumable. This conversion of

Observable variables  →  Unobservable variables

needs to be done through Factor Extraction followed by Factor Rotation.

In Factor Extraction, we decide the number of factors that we want and how we want to extract it. This is done using variance partitioning methods, of which common factor analysis and principal components analysis are common. Next comes Factor Rotation, which is done to reduce the complexity of the solution.

### How to find these latent variables using Python?

Python comes with multiple third-party libraries that can help you in statistical analysis. And today we will be using its factor_analyser library. We will also be using pandas to handle our data and convert it to a data frame, that makes it more usable. Another library matplotlib is being used for creating graphs. One point to note is that certain commands in this code may not print the result directly if you are running this in the terminal. I would recommend you to use Jupyter notebooks since it helps visualize data well.

In the first step, we import the libraries that are required for the task and load our dataset. Once that is done, we will view the columns.  Once you see the columns, you will get a fair idea of the number of features. You can also use df.head to get a snapshot of the data.

 Index([‘Unnamed: 0’, ‘A1’, ‘A2’, ‘A3’, ‘A4’, ‘A5’, ‘C1’, ‘C2’, ‘C3’, ‘C4’,       ‘C5’, ‘E1’, ‘E2’, ‘E3’, ‘E4’, ‘E5’, ‘N1’, ‘N2’, ‘N3’, ‘N4’, ‘N5’, ‘O1’,       ‘O2’, ‘O3’, ‘O4’, ‘O5’, ‘gender’, ‘education’, ‘age’],       dtype=‘object’)

Before we try out an exploratory analysis of the dataset, we need to prepare the data. For this, we will be dropping the first and the last three columns of the dataset and also removing all values that are “Nan”. After data preparation completion, we shall view the schema of the data.

 #Steps required for data preparation df.drop([“gender”, “education”, “age”], axis =1 , inplace = True) df = df.iloc[0:,1:26] df.dropna(inplace=True) df.info() Fig: Output of df.info()

Once you have viewed the schema of the data, you can also see the data itself – a few rows of it.

 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O2 O3 O4 O5 2 4 3 4 4 2 3 3 4 4 3 3 3 4 4 3 4 2 2 3 3 6 3 4 3 2 4 5 2 5 5 4 4 3 4 1 1 6 4 3 3 3 3 5 5 4 2 4 3 3 5 4 5 4 4 4 5 4 2 5 2 4 4 4 5 4 5 4 2 3 4 2 5 5 2 4 4 6 5 5 4 4 3 5 5 5 3 4 4 4 2 5 2 4 1 3 3 4 3 5 2 3 3 4 5 4 4 5 3 2 2 2 5 4 5 2 3 4 4 3 3 3 4 3 3

After this, we shall create a Factor analyzer object and extract the eigenvalues. This used to create a Scree plot.

 #Create an object of FactorAnalyser with number of factors 6 and varimax rotation fa=FactorAnalyzer(n_factors=6, rotation=‘varimax’) fa.fit(df) #Extract eigenvalues and create the Scree Plot eigen_values, vectors = fa.get_eigenvalues() plt.scatter(range(1,df.shape+1),eigen_values) plt.plot(range(1,df.shape+1),eigen_values) plt.title(‘Scree Plot’) plt.xlabel(‘Factors’) plt.ylabel(‘Eigenvalue’) plt.grid() plt.show() Fig: Scree Plot

From the scree plot, we can see that the number of eigenvalues greater than one is 6. Hence our guess of setting n_factors as 6 was correct. In case of a mismatch, we could have reinitialized the FactorAnalyser object before continuing to the next step. There can be a possibility that a different value of n_factor can perform better, but we shall know that only after generating the loading values.

In the final step, we will extract the factor loadings for each variable corresponding to each factor. We have converted the final 2d array to a Pandas data frame so that it is easier to understand.

 0 1 2 3 4 5 A1 0.0952197 0.0407832 0.0487339 -0.530987 -0.113057 0.161216 A2 0.0331313 0.235538 0.133714 0.661141 0.0637338 -0.00624354 A3 -0.00962088 0.343008 0.121353 0.605933 0.0339903 0.160106 A4 -0.0815176 0.219717 0.23514 0.404594 -0.125338 0.0863557 A5 -0.149616 0.414458 0.106382 0.469698 0.0309766 0.236519 C1 -0.0043584 0.0772478 0.554582 0.0075107 0.190124 0.095035 C2 0.0683301 0.0383704 0.674545 0.057055 0.0875926 0.152775 C3 -0.0399937 0.0318673 0.551164 0.101282 -0.0113381 0.00899628 C4 0.216283 -0.0662408 -0.638475 -0.102617 -0.143846 0.318359 C5 0.284187 -0.180812 -0.544838 -0.0599548 0.0258371 0.132423 E1 0.0222798 -0.590451 0.0539149 -0.130851 -0.0712046 0.156583 E2 0.233624 -0.684578 -0.0884971 -0.116716 -0.045561 0.115065 E3 -0.000895006 0.556774 0.10339 0.179396 0.24118 0.267291 E4 -0.136788 0.658395 0.113798 0.241143 -0.107808 0.158513 E5 0.0344896 0.507535 0.309813 0.0788043 0.200821 0.0087473 N1 0.805806 0.0680113 -0.0512638 -0.174849 -0.0749771 -0.0962662 N2 0.789832 0.0229583 -0.0374769 -0.141134 0.00672646 -0.139823 N3 0.725081 -0.0656869 -0.0590394 -0.0191838 -0.0106636 0.0624953 N4 0.578319 -0.345072 -0.162174 0.000403125 0.0629165 0.147551 N5 0.523097 -0.161675 -0.025305 0.0901248 -0.161892 0.120049 O1 -0.020004 0.225339 0.133201 0.00517794 0.479477 0.21869 O2 0.15623 -0.00198152 -0.0860468 0.0439891 -0.49664 0.134693 O3 0.011851 0.325954 0.0938796 0.0766416 0.566128 0.210777 O4 0.207281 -0.177746 -0.00567147 0.133656 0.349227 0.178068 O5 0.0632344 -0.0142211 -0.0470592 -0.0575608 -0.576743 0.135936

### What do the results tell us?

So you got a 2-D array of 25 rows and 5 columns as the final output. What does that mean? If you take a closer look at the data, you will spot that specific factors have high loading values for specific variables.

 Factor 0 N1, N2, N3, N4, N5 Factor 2 E1, E2, E3, E4, E5 Factor 3 C1, C2, C3, C4, C5 Factor 4 A1, A2, A3, A4, A5 Factor 5 O1, O2, O3, O4, O5 Factor 6 No high loading value for any variable.

So in a way, our estimate missed the mark, and our results will be better if redone with just 5 factors instead of 6. You can try that out on your system and share the results!