This project applies Exploratory Data Analysis (EDA) and Natural Language Processing (NLP) techniques to a large corpus of emails from the Enron Email Dataset.
The primary goal is to:
- Identify key discussion topics
- Measure topic frequency
- Analyze sentiment patterns within corporate email communications
By combining traditional NLP preprocessing with topic modeling and sentiment analysis, this project provides insights into the dominant themes present in the Enron email corpus.
- Perform exploratory data analysis on email text data
- Clean and preprocess unstructured text
- Extract latent topics using topic modeling
- Evaluate topic quality using coherence scores
- Analyze sentiment using VADER
- Summarize dominant themes across the dataset
Source: Kaggle – Enron Email Dataset
https://www.kaggle.com/datasets/wcukierski/enron-email-dataset
File Used:
emails.csv
This dataset contains hundreds of thousands of real emails exchanged within Enron prior to its collapse.
- Python
- pandas
- numpy
- re, string
- warnings
- nltk
- spaCy
- gensim
- scikit-learn (NMF)
- TF-IDF Vectorization
- Gensim CoherenceModel
- NLTK VADER SentimentIntensityAnalyzer
- matplotlib
Key package versions used:
numpy 1.26.4
scipy 1.13.1
gensim 4.3.3