Local Research Project

🕸️ Project Clarity

Code Intelligence & Knowledge Graph Visualisation — transforming opaque Python repositories into interactive, human-readable dependency maps with semantic decomposition and zero-loss reconstruction.

100%Code Reconstruction

V5Current Version

3Abstraction Levels

1000+Node Graph Scale

PythonASTPyvis / vis.js StreamlitDockerGitPythonRadonPydantic

← All Projects

TL;DR — Quick Summary

Treats source code not as flat text, but as a dynamic, interconnected knowledge graph — decomposing Python repositories into CodeChunk objects (functions, classes, module-level scopes) and mapping their dependencies as a weighted directed graph.
Custom AST NodeVisitor performs multi-pass analysis: identification → dependency mapping → data linkage. Resolves aliased imports (import pandas as pd) to reflect true dependency origins.
Three levels of abstraction: Function-level (granular), File-level (structural), and Directory-level (architectural) — all rendered as interactive physics-based force-directed graphs via vis.js.
Zero-Loss Rebuild Engine: takes granular CodeChunk objects, sorts by original line numbers, and reconstructs the entire repository from scratch. If the rebuilt repo runs identically to the original, semantic decomposition is verified complete.
Dockerised for zero-config deployment; analysed real-world research codebases including protein structure analysis and data engineering pipelines.

The Core Idea

Most code understanding tools are text-based search — grep, regex, static text analysis. Project Clarity takes a fundamentally different approach: it treats a Python codebase as a graph of logical entities, where every function, class, and module is a node, and every call/import relationship is an edge with a weight (how many times does A call B?).

This graph representation enables capabilities impossible with text search: visualising hidden circular dependencies, identifying "god functions" with hundreds of dependents, finding dead code islands, simulating execution flows, and — most uniquely — proving that the system understands the code completely by reconstructing the entire repository from the extracted graph.

Technical Architecture

📥 Data Acquisition Layer

Two input modes: (1) Remote repositories — clones any public/private GitHub repo via GitPython into secured temporary storage; (2) Local directories — direct filesystem analysis with MD5-based change detection to manage state across incremental runs.

🧠 Semantic Intelligence Engine

Custom AST NodeVisitor performing three passes: First Pass (Identification) — locates all functions and classes with source segments and line numbers. Second Pass (Dependency Mapping) — resolves aliased imports and tracks inter-function calls. Third Pass (Data Linkage) — tracks variable assignments from function returns.

🎨 Visualisation Layer

Streamlit dashboard with three abstraction levels (Function/File/Directory). Interactive physics-based force-directed graphs via Pyvis (vis.js under the hood) — nodes repel and edges attract based on dependency weight. Optimised for 1000+ node graphs.

🔁 Rebuild Engine

The repo_builder module takes CodeChunk objects, sorts them by original line numbers, and reconstructs the entire repository from scratch. Integration with Black for automated code formatting during rebuild. Serves as both a proof-of-correctness and a codebase-wide refactoring tool.

The CodeChunk Data Model

Every logical block of code is encapsulated as a CodeChunk Pydantic object with:

Unique ID — deterministic hash from file path + function signature

Source Segment — exact code with start_line, end_line for perfect reconstruction

Dependency Counter — weighted adjacency list (not just who, but how many times)

Complexity Metrics — Cyclomatic Complexity via Radon + Lines of Code

Inputs/Outputs — extracted from function signatures and return statements

Hash-based Identity — detects code changes across incremental runs

The V1 → V5 Development Journey

Version	Milestone
V1.0	Parser Foundation — first recursive AST visitor for Python files. Proved the CodeChunk concept.
V2.0	Visual Revolution — Pyvis + Streamlit integration. Interactive force-directed graphs became the core UI.
V3.0	Intelligence Layer — Radon for complexity metrics, GitPython for remote repo handling, aliased import resolution.
V4.0	Rebuild Engine — repo_builder proved semantic completeness by reconstructing repos from graph data.
V5.0 (Current)	Containerisation & Scale — full Docker support, optimised rendering for 1000+ node graphs, DataLink model for data flow tracking.

Outcomes

🔁

100% Semantic Reconstruction Accuracy

Rebuilt repos run identically to originals — proving complete semantic decomposition

🔬

X-Ray Code Analysis

Revealed hidden circular imports and high-complexity hotspots in real-world research codebases

🚀

Zero-Config Deployment

Docker: any developer can analyse their entire codebase with a single command

📚

Educational Use

Used as a pedagogical tool to teach code modularity and dependency management principles

← All Projects