The Advanced Data Analysis Toolkit is a powerful solution for analyzing and visualizing relational databases. It allows users to connect to databases, explore schemas, perform in-depth data analysis, and generate reusable reports. The toolkit is built with advanced Python techniques and modern development practices for scalability, maintainability, and ease of use.
Reports are composed of two main sections:
- Schema Diagram: Visualizes tables, primary keys, foreign keys, and their relationships.
- Data Quality Overview: Summarizes missing values, null distributions, and column completeness.
- Descriptive Statistics: Provides ranges, percentiles, distributions, and outlier detection for numerical fields.
- Prompt-Driven Queries: User prompts are transformed into SQL queries using a Retrieval-Augmented Generation (RAG) model.
- Includes a random question generator for exploratory analysis.
- Result Tables: Displays outputs from executed queries.
- Visualizations: Presents graphical representations of query results.
Example dynamic analysis questions for the Chinook database:
- How many employees are there in each age group?
- Which are the top 10 most frequently used genres (with usage counts)?
- Which genre has generated the highest total revenue?
- What are the total revenues and number of tracks sold for each genre (subplots/bar plots with two axes)?
Currently, the toolkit supports:
- SQLite
Database type and connection URL can be configured in the settings.
Diagrams and result tables can be expanded into preview windows for enhanced visibility. The preview supports zooming, panning, and uses SVG rendering for high-quality graphics.
All images and tables can be exported as SVG or CSV files.
The justfile is used for task automation. Key tasks include:
- Linting:
just lint: Runs code style checks usingruff,black,mypy,vulture, andpip-audit.just lint-fix: Automatically fixes code style issues.just lint-full: Runs bothlint-fixandlint.
- Testing:
just test *args: Runs tests with optional arguments.just coverage: Generates a coverage report.
- Serving the Application:
just serve: Starts the server in production mode.just serve-dev: Starts the server in development mode with auto-reload.just serve-presentation: Starts the server with presentation settings (optimized for generative models).
- Fetching Sample Data:
just fetch-chinook: Downloads and sets up the Chinook sample database.
The toolkit can be deployed using Docker Compose. Application can be started in three modes:
docker compose up toolkit-server
docker compose up toolkit-dev
docker compose up toolkit-presentationWhich correspond to the serve, serve-dev, and serve-presentation just commands respectively.
All the commands will launch the server and make it accessible at the configured port.
The toolkit can also be run from the command line. Example usage:
python ./src/driver.py --helpThe CLI supports standard logging and verbosity flags. Use -v to increase verbosity:
- No
-v: warnings and errors only -v: info messages-vvor more: debug messages
For the sake of clarity, the dependencies related to logging have been omitted from the graph.
The frontend and backend communicate via a RESTful API. The frontend sends requests to the backend for data retrieval, report generation, and other operations. The backend processes these requests, interacts with the database, and returns the results to the frontend for display. Frontend components are hosted as static files served by the backend.
This project employs several advanced Python techniques and tools:
- Unit Testing: Comprehensive test coverage using
pytest. - Pytest Fixtures: Reusable, isolated test setup using
@pytest.fixture. - Mocking: Controlled dependency isolation with
pytestmonkeypatching. - Dataclasses: Structured, lightweight data containers using
DataClass. - Type Hinting: Explicit static typing with Python type annotations to improve readability and tooling support.
- Exception Handling: Explicit and robust error handling to ensure reliability.
- Separation of Concerns: Clear boundaries between API access, retry logic, and application logic.
- Configuration via Environment Variables: Runtime configuration using environment variables and
.envfiles. - Logging: Structured logging for debugging and operational visibility.
- Python Environment Management: Dependency and virtual environment management using
uv. - Containerization: Application containerization using Docker and Docker Compose for reproducible environments.
- CI Pipeline: Automated linting checks executed as part of a continuous integration pipeline.
The development environment for the Advanced Data Analysis Toolkit is designed to ensure consistency, ease of use, and scalability. Below are the key components and tools used in the development process:
The project uses uv for package management, which provides:
- Isolated Environments: Each project has its own virtual environment to avoid dependency conflicts.
- Dependency Caching: Speeds up installation by caching dependencies.
- Reproducible Builds: Ensures that the same dependencies are installed across different environments.
The project includes a .devcontainer configuration, which provides:
- Pre-installed Dependencies: Ensures all required libraries and tools are available.
- Docker-based Isolation: Guarantees a consistent environment across different machines.
- Simplified Onboarding: New contributors can quickly set up their environment.
The justfile is used for task automation, allowing developers to run common tasks such as linting, testing, and serving the application with simple commands. This streamlines the development workflow and reduces the potential for errors. The available commands were detailed in the "Usage" section above.
The Dockerfile defines a multi-stage build process:
- Builder Stage:
- Uses the
uvimage for dependency management and bytecode compilation. - Installs necessary system dependencies like
libcairo2andgraphviz. - Caches dependencies for faster builds.
- Uses the
- Runtime Stage:
- Uses a slim Python image for a lightweight runtime environment.
- Copies the application and dependencies from the builder stage.
- Exposes port
8000for the application.
ruff: Ensures code adheres to style guidelines.black: Automatically formats code for consistency.mypy: Performs static type checking.vulture: Identifies unused code.pip-audit: Checks for vulnerabilities in dependencies.
- Git Hooks: Pre-commit hooks are used for linting and testing.
- Pull Requests: All changes are introduced via pull requests, requiring approval from at least one other team member.
- Copilot Auto-Reviews: GitHub Copilot is configured to provide automated code reviews, assisting in maintaining code quality and consistency.




