Submit Your Requirement
Scroll down to discover

Exploratory Factor Analysis In Python

September 2, 2020Category : Blog
Exploratory Factor Analysis In Python

Last Updated on by Tarun

 

Python is one of the easiest and most user-friendly codes that can be written by coders. Factor analysis is one of the popular methods of discovering underlying factors and latent variables in data. What do I mean by latent variables? Let’s find out. Say we have a dataset with 6 variables-

 

  1. Age
  2. Weight
  3. Height
  4. Salary
  5. Bank Balance
  6. Number of Credit Cards

 

And we have collected these 6 data points for 1000 users. Now after thorough analysis we find out that there exist two latent or hidden variables that are dependent on some of the data points present here.

 

Python

Fig: A chart showing latent variables derived from original data points

 

The figure above gives a good understanding of how factor-analysis helps find hidden variables. In this specific example, I chose data points that one can easily separate and analyze, but when you have a data set with headings like x1, x2, y1, y2, z1, z2, you will need the proper tools to get this job done. When it comes to exploratory factor analysis, which is by far the most popular factor analysis approach amond researchers, the basic assumption made at the very beginning is that every variable at hand is directly associated with a factor.

 

What Is The Aim Of Factor Analysis Of Python?

 

The main aim of factor analysis is to reduce the number of data-points by extracting unobservable variables. The new variables make it easier for a market researcher or a statistician to complete his study, and also makes the data more consumable. This conversion of

 

Observable variables  →  Unobservable variables

 

needs to be done through Factor Extraction followed by Factor Rotation.

 

In Factor Extraction, we decide the number of factors that we want and how we want to extract it. This is done using variance partitioning methods, of which common factor analysis and principal components analysis are common. Next comes Factor Rotation, which is done to reduce the complexity of the solution.

 

How to find these latent variables using Python?

 

Python comes with multiple third-party libraries that can help you in statistical analysis. And today we will be using its factor_analyser library. We will also be using pandas to handle our data and convert it to a data frame, that makes it more usable. Another library matplotlib is being used for creating graphs. One point to note is that certain commands in this code may not print the result directly if you are running this in the terminal. I would recommend you to use Jupyter notebooks since it helps visualize data well.

 

In the first step, we import the libraries that are required for the task and load our dataset. Once that is done, we will view the columns.

Python

Python

Once you see the columns, you will get a fair idea of the number of features. You can also use df.head to get a snapshot of the data.

Index([‘Unnamed: 0’, ‘A1’, ‘A2’, ‘A3’, ‘A4’, ‘A5’, ‘C1’, ‘C2’, ‘C3’, ‘C4’,
      ‘C5’, ‘E1’, ‘E2’, ‘E3’, ‘E4’, ‘E5’, ‘N1’, ‘N2’, ‘N3’, ‘N4’, ‘N5’, ‘O1’,
      ‘O2’, ‘O3’, ‘O4’, ‘O5’, ‘gender’, ‘education’, ‘age’],
      dtype=‘object’)

Before we try out an exploratory analysis of the dataset, we need to prepare the data. For this, we will be dropping the first and the last three columns of the dataset and also removing all values that are “Nan”. After data preparation completion, we shall view the schema of the data.

#Steps required for data preparation
df.drop([“gender”, “education”, “age”], axis =1 , inplace = True)
df = df.iloc[0:,1:26]
df.dropna(inplace=True)
df.info()
Python

Fig: Output of df.info()

Once you have viewed the schema of the data, you can also see the data itself – a few rows of it.

#View the data
df.head()

 

A1A2A3A4A5C1C2C3C4C5E1E2E3E4E5N1N2N3N4N5O1O2O3O4O5
2434423344333443422336343
2452554434116433335542433
5454445425244454542342552
4465544355534442524133435
2334544532225452344333433

Fig: Output of df.head()

After this, we shall create a Factor analyzer object and extract the eigenvalues. This used to create a Scree plot.

#Create an object of FactorAnalyser with number of factors 6 and varimax rotation
fa=FactorAnalyzer(n_factors=6, rotation=‘varimax’)
fa.fit(df)
#Extract eigenvalues and create the Scree Plot
eigen_values, vectors = fa.get_eigenvalues()

plt.scatter(range(1,df.shape[1]+1),eigen_values)
plt.plot(range(1,df.shape[1]+1),eigen_values)
plt.title(‘Scree Plot’)
plt.xlabel(‘Factors’)
plt.ylabel(‘Eigenvalue’)
plt.grid()
plt.show()
Python

Fig: Scree Plot

From the scree plot, we can see that the number of eigenvalues greater than one is 6. Hence our guess of setting n_factors as 6 was correct. In case of a mismatch, we could have reinitialized the FactorAnalyser object before continuing to the next step. There can be a possibility that a different value of n_factor can perform better, but we shall know that only after generating the loading values.

#Extract factor loadings for each variable and convert the 2d array to a dataframe
loadings = fa.loadings_
df_new = pd.DataFrame(loadings)

In the final step, we will extract the factor loadings for each variable corresponding to each factor. We have converted the final 2d array to a Pandas data frame so that it is easier to understand.

012345
A10.095219742250.040783158080.04873388544-0.5309873495-0.11305732960.1612163531
A20.033131276070.23553803940.13371439460.66114097590.063733787-0.006243536373
A3-0.0096208841580.34300817310.12135336730.60593269460.033990265310.1601064273
A4-0.081517558750.21971672040.23513953150.4045940388-0.1253380190.08635570252
A5-0.14961588540.41445767370.10638216540.46969829140.030976572470.2365193425
C1-0.0043584023220.077247752440.55458225420.0075106961440.19012372940.0950350462
C20.068330083590.038370383780.67454545030.057054987630.087592591380.1527750794
C3-0.039993673370.031867300370.55116443940.1012822407-0.01133808750.008996283589
C40.2162833656-0.06624077375-0.6384754897-0.1026169404-0.14384647540.3183589004
C50.2841872452-0.1808116969-0.5448376774-0.059954821850.025837094440.1324234458
E10.02227979411-0.59045089050.05391490631-0.1308505313-0.071204579530.1565826561
E20.2336235667-0.6845776318-0.08849707106-0.1167156651-0.04556104150.1150654014
E3-0.00089500626650.55677417960.10339034730.17939648060.24117990370.2672913156
E4-0.13678807620.65839490720.1137980050.2411429611-0.10780820350.1585128513
E50.034489588490.50753508230.3098125280.078804286340.20082135070.00874730272
N10.80580593880.06801130302-0.05126378745-0.1748493582-0.07497711895-0.09626617104
N20.78983168690.02295829406-0.03747686989-0.14113448190.006726461165-0.1398226126
N30.7250812171-0.06568693383-0.05903943749-0.01918381747-0.010663554810.06249533658
N40.5783188498-0.3450723247-0.1621738610.00040312494460.062916483140.147551243
N50.523097071-0.161675117-0.025304977770.09012479018-0.16189197780.1200494769
O1-0.020004017960.2253385620.13320079950.0051779378570.47947721670.2186898349
O20.1562301084-0.001981519764-0.086046849370.04398910669-0.49663967440.1346929729
O30.011851015130.32595448240.093879610970.076641646860.56612804770.2107772204
O40.2072805716-0.1777457135-0.0056714659690.13365557230.34922713650.178068367
O50.06323436646-0.01422106258-0.04705922852-0.05756077769-0.5767426350.1359358722

Fig: Output of fa.loadings_

 

What do the results tell us?

 

So you got a 2-D array of 25 rows and 5 columns as the final output. What does that mean? If you take a closer look at the data, you will spot that specific factors have high loading values for specific variables.

Factor 0N1, N2, N3, N4, N5 
Factor 2E1, E2, E3, E4, E5 
Factor 3C1, C2, C3, C4, C5
Factor 4A1, A2, A3, A4, A5
Factor 5O1, O2, O3, O4, O5
Factor 6No high loading value for any variable.

So in a way, our estimate missed the mark, and our results will be better if redone with just 5 factors instead of 6. You can try that out on your system and share the results!

Get The Latest Updates

© Promptcloud 2009-2020 / All rights reserved.
To top