PublicVCons — dataset card

Dataset card: publicvcons/us-federal-vcons

Stable URL: https://policy.publicvcons.org/dataset-card

Summary

IETF vCons of US federal government conversations (floor sessions, committee hearings, agency/mission audio), live and historical, with transcripts, speaker diarization, a neutral local-LLM analysis, a lawful-basis attachment, and a SCITT lifecycle chain with transparency receipts.

Provenance & licensing

Sources are US government works in the public domain (17 U.S.C. § 105), obtained from official channels, the National Archives, and the Internet Archive (CC0 / public-domain mark). The corpus is released CC0 1.0. Each record cites its source identifier and binds the source-media SHA-256.

Structure

One vCon per line (JSONL mirror of publicvcons/vcons): vcon.json (spec 0.4.0), an inlined + sidecar lawful_basis, transcript and summary analyses, and a scitt/ chain (signed statements + inclusion-proof receipts). Media blobs are referenced by URL + hash, not inlined.

How it was produced (reproducible)

All compute is local, no paid APIs: whisper.cpp large-v3 (transcription), pyannote.audio 3.1 (diarization), a 4-bit local LLM (analysis). The pipeline, source profiles, model choices and pinned environment are public in publicvcons/conserver.

Known limitations

Speakers are anonymous unless identified from public metadata. ASR and 8B-class analysis contain errors; the neutral summary is descriptive only. Some artifacts are segment-capped (recorded in the vCon). The authoritative record is always the cited primary source.

Verification

Integrity is independently checkable — see /verify. Lawful basis: /lawful-basis.