Skip to content

saad-sharif/Enron-Emails-NLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Enron NLP – Topic Modeling & Email Analysis

Project Overview

This project applies Exploratory Data Analysis (EDA) and Natural Language Processing (NLP) techniques to a large corpus of emails from the Enron Email Dataset.

The primary goal is to:

  • Identify key discussion topics
  • Measure topic frequency
  • Analyze sentiment patterns within corporate email communications

By combining traditional NLP preprocessing with topic modeling and sentiment analysis, this project provides insights into the dominant themes present in the Enron email corpus.


Objectives

  • Perform exploratory data analysis on email text data
  • Clean and preprocess unstructured text
  • Extract latent topics using topic modeling
  • Evaluate topic quality using coherence scores
  • Analyze sentiment using VADER
  • Summarize dominant themes across the dataset

Dataset

Source: Kaggle – Enron Email Dataset
https://www.kaggle.com/datasets/wcukierski/enron-email-dataset

File Used:

  • emails.csv

This dataset contains hundreds of thousands of real emails exchanged within Enron prior to its collapse.


Technologies & Libraries Used

Core Libraries

  • Python
  • pandas
  • numpy
  • re, string
  • warnings

NLP & Text Processing

  • nltk
  • spaCy
  • gensim

Topic Modeling

  • scikit-learn (NMF)
  • TF-IDF Vectorization
  • Gensim CoherenceModel

Sentiment Analysis

  • NLTK VADER SentimentIntensityAnalyzer

Visualization

  • matplotlib

Environment & Versions

Key package versions used:

numpy    1.26.4
scipy    1.13.1
gensim   4.3.3

About

Exploratory data analysis and Natural Language Processing techniques applied to a corpus of emails to identify the key topics and their frequencies

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors