Dataset card: publicvcons/us-federal-vcons
Stable URL: https://policy.publicvcons.org/dataset-card
Summary
IETF vCons of US federal government conversations (floor sessions, committee hearings, agency/mission audio), live and historical, with transcripts, speaker diarization, a neutral local-LLM analysis, a lawful-basis attachment, and a SCITT lifecycle chain with transparency receipts.
Provenance & licensing
Sources are US government works in the public domain (17 U.S.C. § 105), obtained from official channels, the National Archives, and the Internet Archive (CC0 / public-domain mark). The corpus is released CC0 1.0. Each record cites its source identifier and binds the source-media SHA-256.
Structure
One vCon per line (JSONL mirror of publicvcons/vcons):
vcon.json (spec 0.4.0), an inlined + sidecar
lawful_basis, transcript and summary analyses, and a
scitt/ chain (signed statements + inclusion-proof
receipts). Media blobs are referenced by URL + hash, not inlined.
How it was produced (reproducible)
All compute is local, no paid APIs: whisper.cpp large-v3 (transcription), pyannote.audio 3.1 (diarization), a 4-bit local LLM (analysis). The pipeline, source profiles, model choices and pinned environment are public in publicvcons/conserver.
Known limitations
Speakers are anonymous unless identified from public metadata. ASR and 8B-class analysis contain errors; the neutral summary is descriptive only. Some artifacts are segment-capped (recorded in the vCon). The authoritative record is always the cited primary source.
Verification
Integrity is independently checkable — see /verify. Lawful basis: /lawful-basis.