MolBridge — Shobhit Vats Sharma

TL;DR — Quick Summary

A production-ready Streamlit + FastAPI platform for analysing noncovalent interactions (NCIs) in protein PDB structures, born from IISER Pune Masters research on pnictogen and tetrel bonding.
Detects 15+ NCI families including hydrogen bonds, halogen bonds, chalcogen, pnictogen, tetrel bonds, π–π stacking, cation–π, salt bridges, metal coordination, and more.
High-performance core: KD-tree spatial pruning, vectorised geometry, shared-memory parallelism, optional Numba/Rust acceleration — achieving a 60% reduction in total computational runtime on the ParamBrahma HPC cluster.
Outputs interactive 3D visualisations, Ramachandran plots, interaction network graphs, PDF/PPTX/Excel reports, and a full REST API for programmatic access.
Reduced data exploration time by 40% for early users by centralising fragmented datasets and integrating LLM-assisted interpretation.

Problem & Motivation

Noncovalent interactions (NCIs) are the molecular forces that determine protein folding, stability, enzyme catalysis, drug binding, and macromolecular assembly. Yet, existing tools for studying them are fragmented — one tool for hydrogen bonds, another for π-stacking, none with a unified, reproducible analysis pipeline for all 15+ known NCI families simultaneously.

During my Masters research at IISER Pune on chalcogen, pnictogen, and tetrel bonding in protein structures, I repeatedly faced this gap. Analysing a single protein required stitching together outputs from multiple incompatible tools, with no provenance tracking and inconsistent geometric criteria. MolBridge was built to be the unified, literature-grounded platform I wished had existed.

Architecture

MolBridge follows a layered architecture separating detection logic, computation strategy, and interfaces.

🔬 Detection Layer

A decorator-based detector registry allows each NCI family to define its own geometric criteria, parameter presets, and detection logic independently — making the system fully extensible by adding new detectors without touching core code.

⚡ Performance Layer

Vector geometry fast-paths (NumPy), KD-tree spatial pruning, adaptive threshold tuning, task graph precompute stage, shared-memory parallelism (POSIX), and optional Numba/Rust kernels. Auto-profile mode selects the right strategy per structure size.

📡 API Layer

FastAPI REST backend with async job execution, progress tracking, and multiple output formats (JSON, CSV, PDF, Excel). Full Swagger documentation at /docs. Supports programmatic batch processing of hundreds of PDB structures.

🎨 UI Layer

Streamlit web interface with interactive py3Dmol 3D viewer, Plotly heatmaps and distribution charts, Ramachandran plots, force-directed interaction network graphs, command palette (Ctrl+K), and scenario profiles as YAML templates.

Interaction Families Detected

→ Hydrogen Bonds (conventional, low-barrier, C5-type)

→ Halogen Bonds (Cl, Br, I, F — sigma hole)

→ π–π Stacking (face-to-face & edge-to-face)

→ Cation–π Interactions

→ Anion–π Interactions

→ CH–π Interactions

→ Sulfur–π Interactions

→ n→π* Orbital Interactions

→ Chalcogen Bonds (S, Se, Te)

→ Pnictogen Bonds (N, P, As)

→ Tetrel Bonds (C, Si, Ge)

→ Salt Bridges (ARG/LYS vs ASP/GLU)

→ Hydrophobic Contacts

→ London Dispersion Forces

→ Metal Coordination (Zn, Fe, Mg, Ca, Cu…)

→ H-bond Subtype Classification (5 classes)

Performance Engineering

Key innovation: an adaptive auto-profile system that inspects atom count, detector count, and estimated workload — then selects the optimal combination of performance features automatically.

Technique	What it does	Impact
Vector Geometry Fast-Paths	Batched distance/angle math via NumPy matrices instead of nested loops	2-5x speedup on angular calculations
KD-Tree Spatial Pruning	Partitions 3D space to eliminate irrelevant atom pairs in O(N log N)	Eliminates billions of irrelevant calculations
Adaptive Threshold Tuning	Dynamically relaxes/tightens distance cutoffs per detector based on candidate density	Balances accuracy and speed per structure
Shared Memory Parallelism	POSIX shared memory blocks for coords, ring centroids, H-bond donor/acceptors across process pool	Near-zero copy overhead for large structures
Numba JIT Kernels	JIT-compiled pairwise distance and geometry primitives	C-speed execution without C++ complexity
Rust Geometry Extension (opt-in)	PyO3 pairwise_sq_dists — fastest available backend	Maximum performance for massive proteins
Task Graph Precompute	Extracts aromatic rings, centroids, donors/acceptors once per structure, fans out to all detectors	Eliminates redundant recomputation across 15 detectors

Reproducibility & Scientific Rigour

MolBridge treats reproducibility as a first-class concern — something often absent in research-grade bioinformatics tools.

🔐 Provenance Hashing

Every export embeds a provenance digest: structure signature + parameters + detector set + version tag. Researchers can cite this in manuscripts for full reproducibility.

📊 Golden Regression Framework

A curated set of PDB structures forms a golden baseline. CI enforces that interaction counts and timing don't deviate beyond 5% between versions.

📐 Literature-Anchored Criteria

All geometric thresholds are derived from published crystallographic literature (CSD studies). Three presets: Conservative, Literature Default, and Exploratory.

🔁 Normalised Records

Interaction records are normalised to a common schema regardless of which detector produced them, enabling cross-family analysis and consistent CSV/Excel exports.

Outcomes & Impact

📉

60% Runtime Reduction

Redesigned HPC data workflows on IISER Pune's ParamBrahma cluster, delivering 60% reduction in total computational runtime and enabling higher-throughput protein analysis.

⚡

40% Faster Data Exploration

By centralising fragmented datasets and automating schema validation, reduced data exploration time by 40% for early users compared to previous ad hoc methods.

🌐

Live Deployed Application

The only project in the portfolio with a publicly accessible live deployment. Accessible to the global structural biology community at molbridge.streamlit.app.

🔬

Research Integration

Directly integrates into ongoing Masters research on pnictogen and tetrel bonding — serving as both the analytical tool and the web-based validation platform for the thesis.

Roadmap

→ AI-Agentic Interpretations: LLM synthesis of complex NCI profiles into plain-English narratives
→ Distributed Execution: Ray/Dask scaffold for multi-node large-batch processing
→ Ligand & Small Molecule Support: Extending detection beyond protein-only structures
→ Expanded Rust Kernel Coverage: Broaden geometry acceleration to all 15 detector families

🌐 Open Live App ← All Projects

🌿 MolBridge: NonCovalent Atlas

TL;DR — Quick Summary

Problem & Motivation

Architecture

🔬 Detection Layer

⚡ Performance Layer

📡 API Layer

🎨 UI Layer

Interaction Families Detected

Performance Engineering

Reproducibility & Scientific Rigour

🔐 Provenance Hashing

📊 Golden Regression Framework

📐 Literature-Anchored Criteria

🔁 Normalised Records

Outcomes & Impact

60% Runtime Reduction

40% Faster Data Exploration

Live Deployed Application

Research Integration

Roadmap