What would you ask W.E.B. Du Bois if you could talk to his writing?
Enter the ArchiveNot a chatbot. Not a guess. An answer grounded in his own words, with citations you can check.
Type your question and press Enter
From your question to a citation-backed answer — every step is transparent.
Three simple steps: You ask → It searches → It answers with quotes. Click each step below to learn more.
You type a question
Archive searches Du Bois's works
Embedding + Vector Search
Finds matching passages
Top-K Passage Retrieval
Writes an answer with quotes
LLM Synthesis + Citation
You type a question in plain English. "What did Du Bois think about the color line?" The system takes your words and prepares them for search.
Technically, your question is converted into a vector embedding — a list of numbers that captures the meaning of your question. Two questions with similar meaning (like "What about the color line?" and "How did Du Bois describe racial division?") will have similar embeddings, even though they use different words.
The system doesn't search for keywords — it searches for meaning. Your question's embedding is compared to every passage in the archive to find the most relevant matches.
The archive uses pgvector (a PostgreSQL extension) for similarity search. Each passage in the corpus is an 800-token chunk with 200-token overlap. The embedding model converts both your question and each passage into vectors, then finds the closest matches using cosine similarity. We retrieve the top results that pass a similarity threshold of 0.7 or higher.
The archive pulls the most relevant passages from Du Bois's works. These are the raw source materials — his actual words.
Retrieval quality depends on chunk size and overlap. If chunks are too small, you lose context. Too large, and irrelevant text dilutes the signal. The 800-token chunk with 200-token overlap was validated across the full corpus. The system returns the top matches, but we always .slice(0, topK) to prevent overflow — never trust the result count from the vector database.
A language model reads the retrieved passages and composes a coherent answer. The key: it must cite which passage each claim comes from. No citation = no trust.
The LLM (Claude) receives the retrieved passages as context, along with instructions to ground every claim in the source material and provide citations. This is the "generation" in Retrieval Augmented Generation. The model doesn't use its training data about Du Bois — it synthesizes answers only from the retrieved passages. This constrains hallucination but doesn't eliminate it entirely, which is why citation verification matters.
Not all AI responses are created equal. Can you tell which one to trust?
Your task: Three responses to the same question. One has real citations, one has no citations, and one has fake citations. Click the one you trust most.
Hallucination in RAG systems: Even with retrieval augmentation, LLMs can generate plausible but unsourced claims. They may also attribute real quotes to wrong works or fabricate citations entirely. The minimum standard for trust is a verifiable citation — a passage you can look up in the original text.
Question: "What did Du Bois say about education?"
Du Bois argued that education was the key to racial uplift. He wrote: "The Talented Tenth of the Negro race must be made leaders of thought and missionaries of culture among their people."
Source: "The Talented Tenth" (1903), paragraph 3
Has citationDu Bois believed deeply in the power of education and thought that every Black person should have access to university-level learning. He spent his entire life fighting for educational equality across the country.
No citationDu Bois stated: "Education is the great equalizer of the conditions of men, the balance wheel of the social machinery."
Source: The Souls of Black Folk (1903), Chapter 12
Fake citationFive questions. Five real archive responses. His words, his works, verifiable.
Think about it: Du Bois wrote this in 1903. Was he right? Is the color line still a problem in the 21st century? What evidence would you use to argue your position?
Retrieval analysis: This passage is one of the most frequently retrieved chunks in the archive because it appears in multiple works and is highly embedded in the corpus. A good RAG system should retrieve this passage for many related queries. But does high retrieval frequency mean it's always the best passage for a given question? Consider: if you asked "How did Du Bois's views on race change over time?" — would this same passage be the right answer?
What stands out: Du Bois calls the church the "social centre" — not just a place of worship. What institutions in your community serve a similar role today?
Methodology note: The Philadelphia Negro (1899) was the first major sociological case study of an African American community. Du Bois combined door-to-door surveys, census data, and ethnographic observation. The archive retrieves this passage from a work that pioneered American sociology — a fact that enriches the citation beyond the quote itself.
Debate it: Du Bois vs. Washington — should education focus on practical skills or on developing leaders and thinkers? Is this still a relevant debate today?
Retrieval note: The Du Bois-Washington debate is well-represented in the corpus across multiple works. A nuanced question about this debate should retrieve passages from both The Souls of Black Folk (Chapter III: "Of Mr. Booker T. Washington and Others") and "The Talented Tenth." Evaluate: does the system retrieve from multiple relevant works, or does it over-index on the most famous passage?
Amazing fact: Du Bois and his students drew 63 data visualizations by hand — no computers, no software. These plates are now in the Library of Congress and are considered masterpieces of data visualization.
Archive limitation: The Paris Exposition plates are visual artifacts, not text. A text-based RAG system can retrieve Du Bois's writing about the exposition but not the plates themselves. This highlights a structural limitation of text-only archives when the primary sources are visual. A multimodal archive would retrieve the actual plates as evidence.
Think about it: Du Bois believed that if you showed people the data, they would have to change their minds. Did it work? Does data change minds today?
Evaluation exercise: This response draws from multiple sources in the corpus. Rate the retrieval: did the system find the best passages to answer this question, or just the most frequently matched ones? What passage would you have retrieved if you were building the search?
The most honest thing an archive can do is tell you what it doesn't know.
What happens when you ask a question the archive can't answer from its sources? That's not a failure — that's honesty. A system that admits its limits is more trustworthy than one that always has an answer.
Corpus limitations as a feature, not a bug. The archive contains 27 works from 1896–1920. The decision about what to include — and what to exclude — is a curatorial choice that shapes every answer the system can give. What's IN the archive is a statement about what matters.
The archive's answer: It shouldn't answer this confidently. Du Bois died in 1963. Social media didn't exist. A responsible RAG system should flag this as speculative and refuse to generate a definitive answer from a corpus that predates the question by a century.
The problem: An unconstrained LLM would happily generate "Du Bois would have viewed social media as a tool for..." — plausible speculation that looks like a sourced answer. RAG should prevent this, but only if the system is designed to say "I cannot answer this from the available sources" when retrieval confidence is low.
The archive's answer: The 27-work corpus (1896–1920) captures Du Bois's early career. His relationship with communism evolved over decades — he joined the Communist Party in 1961 at age 93. The archive can speak to his early socialist leanings but cannot address his full political evolution.
Temporal bias: A corpus limited to 1896–1920 will systematically underrepresent Du Bois's later radicalization. Any answer the system gives about his political beliefs will be incomplete by design. Users must understand the temporal boundaries of the archive to evaluate answers correctly.
The archive's answer: Du Bois's autobiography is not in the corpus — The Autobiography of W.E.B. Du Bois (published posthumously, 1968) is likely still under copyright. Some childhood references appear in Darkwater (1920), but the archive cannot provide a comprehensive answer about his early life.
Copyright as curation: The decision to exclude The Autobiography and Dusk of Dawn (1940) is a copyright decision that has intellectual consequences. A user asking about Du Bois's life gets a corpus shaped by legal constraints, not by scholarly completeness. This is true of every digital archive — the legal landscape shapes what knowledge is accessible.
Now it's your turn. Craft a question, predict the answer, and evaluate what a good response looks like.
You've thought critically about what makes an archive response trustworthy. A good answer cites specific works, provides quotable passages, and acknowledges what it doesn't know. You're ready to use the Living Archive like a scholar.
You've mapped the full pipeline from question to evaluation. This is how archivists and AI researchers think about retrieval systems — not just "does it answer?" but "does it answer well, from the right sources, with verifiable claims?" You're ready to contribute to the conversation about how we build ethical AI archives.
"The true college will ever have one goal — not to earn meat, but to know the end and aim of that life which meat nourishes."— W.E.B. Du Bois, The Souls of Black Folk (1903)