VerifiedTutor

Ask anything from your lecture books. Every answer is retrieved from your PDFs, generated by a local LLM, then verified claim-by-claim against the source by an NLI model. You see a calibrated confidence band, per-claim entailment scores, and a self-explanation prompt for active learning. Out-of-scope questions are refused. Everything runs locally - no cloud, no telemetry. Why this is novel »

What this system does that no production assistant does

Modern Q&A systems (ChatGPT, Perplexity, NotebookLM) are fluent but rarely verify their answers against retrieved sources at the claim level, and they rarely surface calibrated uncertainty to the user. Educational deployments compound this with privacy concerns from cloud inference. This system addresses five gaps documented in the recent literature:

Faithfulness gap - Manakul et al. EMNLP 2023 (SelfCheckGPT) and Es et al. EACL 2024 (RAGAS) show RAG fluency does not imply faithfulness; we run a DeBERTa-v3 NLI cross-encoder on every generated claim against the retrieved excerpt and surface per-claim entailment to the user.
Calibration gap - Kadavath et al. 2022 (Language Models (Mostly) Know What They Know) documents systematic LLM over-confidence; we publish a composite confidence band combining retrieval similarity and faithfulness.
Mid-context fact-drop - Liu et al. TACL 2024 (Lost in the Middle) shows long contexts cause fact loss; our hybrid retriever (semantic + TF-IDF) keeps top-k tight (k=5) and verification flags drift.
Pedagogical scaffolding gap - self-explanation is well-established in learning sciences (Chi et al. 1989), but production LLM tutors rarely close the loop; every answer here ships with a self-explanation prompt and a probing follow-up.
Educational AI privacy gap - Khosravi et al. Computers and Education: AI 2022 (Trustworthy AI in Education) flags FERPA / GDPR concerns with cloud-based tutors; this entire pipeline runs on a local 6 GB consumer GPU, with no network calls during inference.

Stack: Qwen 2.5 1.5B Instruct (bf16, generator), MiniLM-L6-v2 (semantic retriever), TF-IDF (lexical reranker), DeBERTa-v3-base NLI (faithfulness), Flask + SSE for streaming.

Answer

Per-claim verification

Pedagogical follow-up

Sources

What this system does that no production assistant does