Exploring Enron Corpus Using Machine Learning

By Michael Phillips

1. Background and goals

The Enron corpus is a collection of hundreds of thousands of e-mails sent back and forth between Enron employees before investigations began into corporate fraud at the company. This information was combined with financial data, and a manually-created list of Persons-of-Interest (POI). The goal of the project was to input the data into machine learning algorithms and try to classify the people as a POI to match the manually-created list of POI's. Machine learning is useful here as the dataset, while not huge, is still far too big for a person to make much sense of it. Machine learning algorithms can comb through the data and find connections within automatically and much faster than a person could.

Some exploration of the dataset was beneficial to understand what data was available and the state it was in. Some characteristics of the data that were found are:

  • Shape of data: 146 rows, 21 variables
  • Number of POI's: 18
  • Outliers: 2 observations from the data were removed, the 'TOTAL' row and the 'TRAVEL AGENCY IN THE PARK' row. These were removed because they were not people, one is a summary statistic that made its way into the data, and the other is a business.
  • There were some features with many missing values, namely 'deferral_payments' with 107 NaN's, 'director_fees' with 129, 'loan_advances' with 142, and 'restricted_stock_deferred' with 128. I chose to leave these in, and determine if I needed to remove them after running some machine learning algorithms and testing their performance. Ultimately, I did leave them in the dataset used for testing.

2. Feature Selection

I decided to use a combination of PCA and SelectKBest to choose my features. All of the features except 'POI' and each users e-mail address were input in to PCA so that the algorithm could examine them and find the most useful mix for classifying POI's. I was curious if this approach, giving the algorithm as much data as possible, would work as intuitively I assumed it would.

Also included was the feature I made, which was a ratio of total e-mails sent and e-mails sent to POI's. I thought this would be a way to find out which of the Enron employees was sending larger or smaller proportions of e-mails to POI's, regardless of how many e-mails that person had sent total. As it turned out the results were quite good so I stuck with my original intuition. Additionally, I did scale the features using a MinMaxScaler. PCA works on the idea that all the variables are similarly scaled, and then maximizing variance of the features. Scaling the data before PCA ensures that the algorithm will be able to determine variance between items whose scales are wildly different. For example, PCA will compare number of e-mails sent to a person's salary. That will only work if both items are on the same scale.

Select K Best was used after PCA to select the most relevant feature(s) to use in my decision tree classifier. I used GridSearchCV for each step to find the optimal parameters. My input parameters were 3,4, or 5 PCA elements, and 1,2, or 3 features selected by KBest. I decided on these options for the gridsearch because there were 18 features to start with and using PCA to reduce dimensions I thought most of the variance in the data would be captured in 3,4 or 5 principle components, GridSearchCV chose 3 in my eventual model. For the SelectKBest paramaters I wanted to give the gridsearch some options in deciding what produced the best results. I was curious what it would pick, and was suprised that only 1 principle component gave the best results in my decision tree.

Some descriptive statistics of the best fit model as chosen by GridSearchCV:

PCA Components - the most important features were 'salary', 'bonus', and 'shared_receipt_with_poi'. My engineered feature scored decently at .2177 in the first principle component.

0.41583361  
0.14516597  
0.11322756  
0.35413973  
0.22300055  
0.21847095
0.17837527  
0.22242261 
-0.26732407  
0.24575159  
0.13860938  
0.39270083
0.21776259 
-0.20748451  
0.10157953  
0.18626908 
-0.02688082  
0.21802625]

SelectKBest Scores - only the first principle component was chosen in the final model

3.17762766e+01   
1.18702204e+00   
1.29861970e-02

Decision Tree Feature Importance - the model only had 1 feature, which had a feature importance of 1.

3. Choose an Algorithm

The final algorithm I decided on was a decision tree. I chose this one because the algorithm was performant on my machine, taking about 5-10 minutes for a full validated run through the data. This was important because it allowed me to tweak parameters and also test different outputs (such as feature scores, PCA scores, etc) much faster than using a slower-performing algorithm. Decision tree’s also happened to produce impressive results, easily beating much slower algorithms that would make iteration and trouble-shooting more laborious tasks. The combination of the decision tree algorithm’s performance and speed to produce that performance met every need I had for this project.

Final Decision Tree evaluation metrics:

Accuracy: 0.83847
Precision: 0.38411
Recall: 0.35050
F1: 0.36654
F2: 0.35674 Total predictions: 15000
True positives: 701
False positives: 1124
False negatives: 1299
True negatives: 11876

I did try several other algorithms, which I will briefly describe below.

SVM – I came to the conclusion that SVM would certainly work for this task, but it performed too slowly for me (approximately 20 to 30 minutes for a full run) to realistically iterate on parameters enough to produce satisfactory results.

Example SVM evaluation metrics:

Accuracy: 0.79767
Precision: 0.22977
Recall: 0.22000
F1: 0.22478
F2: 0.22189 Total predictions: 15000
True positives: 440
False positives: 1475
False negatives: 1560
True negatives: 11525

Gaussian Naïve Bayes – Speed was impressive here, but performance, while surprisingly good, was not quite good enough. With no parameters to tune, I quickly moved on from GaussianNB.

Example GaussianNB evaluation metrics:

Accuracy: 0.86847
Precision: 0.51346
Recall: 0.25750
F1: 0.34299
F2: 0.28602 Total predictions: 15000
True positives: 515
False positives: 488
False negatives: 1485
True negatives: 12512

RandomForest – If one decision tree is very good, would 1000 be great? Sadly, no. Well, to be fair, its likely RandomForest could be parameter tuned to perform very well on the Enron data but again I didn’t like the time-value proposition of tuning with the very long training and testing times (~30 minutes).

Example RandomForest evaluation metrics:

Accuracy: 0.81920
Precision: 0.29540
Recall: 0.25700
F1: 0.27487
F2: 0.26386 Total predictions: 15000
True positives: 514
False positives: 1226
False negatives: 1486
True negatives: 11774

4. Parameter Tuning

Parameter tuning is the process of tweaking how a machine learning algorithm runs, how it makes its decisions and produces results. This can have a dramatic influence on the results of running the algorithm. If this process is not done correctly it can lead to overfit or underfit models that do not classify or predict values accurately. To take some of the guesswork out of the process I used GridSearchCV to find the optimal parameters for the chosen algorithms I experimented with.

As mentioned earlier, in my final pipeline I gave GridSearchCV several options: 3, 4, or 5 principle components to chose from during the PCA process, 1 or 2 of those features to chose for SelectKBest, and for the decision tree 10, 20, 30, 40 as the values for 'min_samples_split'. 'Min_samples_split' is the minimum value at a tree node required for additional splitting to be performed. GridSearchCV tried all of the different combinations and eventually decided that 3 PCA elements, 1 KBest feature, and a min_samples_split of 10 for the decision tree produced the best results.

5. Validation

Validation in a supervised learning algorithm is the practice of splitting data into train and test sets where class labels are known at each step. For the train step, the class labels are applied to the feature data and this combination is used to build a model that will predict unlabeled classes. For the test step, the known class labels are hidden while the algorithm does its classifications, and then the predicted labels are compared to the actual known labels. Evaluation metrics are then used to determine how effective the algorithm was at predicting ‘unknown’ classes.

Since the data was relatively sparse, only 18 actual POI’s in the dataset, I chose to use Stratified Shuffle Split as my validation method. A possible mistake if using a different method, such as Train Test Split, especially with the sparseness of this data, is that only 1 or maybe 0 POI’s might be in a train/test split. If this was the case the evaluation metrics would be skewed and very misleading.

Stratified Shuffle Split works a bit differently than Train Test Split. Instead of splitting data based on a percentage, SSS produces a number (provided by parameter) of train test indices that all have the same percentage of samples for each class as in the main dataset. These indices are then used to perform many different tests of the data and the average evaluation metrics for all tests are provided at the end.

6. Evaluation Metrics

Two evaluation metrics that I used to evaluate my chosen algorithm’s performance were precision and recall. Generally speaking, precision is the number of true positives divided by true positives plus false positives. Recall is true positives divided by true positives plus false negatives. Precision can also be thought of as the probability that a positive classification is a true positive, and recall as the probability that out of all the possible positive classifications the algorithm will return a true positive for a random element.

The final results for the chosen decision tree algorithm were as follows:

Precision: 0.38411
Recall: 0.35050

For this particular project, precision can be thought of as how 'strict' the algorithm was, if a person is flagged as a POI then it is very likely that person is actually a POI (and not a false positive). The algorithm is more likely to not risk making a mistake and flagging edge cases that might really be a POI though. From the precision score of .38, you could say 38% of the time the algorithm made the correct classification out of all the positive-POI classifications that were possible. Recall can be thought of as how wide of a net you are casting. With a high recall score most (or all) POI's will be correctly classified, but there will likely be false positives mixed in as well. From the recall score of .35, you could say 35% of the time the algorithm made the correct POI classification out of all possible POI's in the dataset.