words_to_data parses US Code titles and Public Laws (bills) from USLM XML format, providing structured access to legislative text and the ability to track changes between document versions. All data structures are JSON-serializable for easy integration with other systems.
- Parse USC and Public Law documents - Extract hierarchical structure from USLM XML files
- Bill amendment extraction - Identify USC references and amending actions from bills
- Hierarchical diffing - Compute word-level differences between document versions
- Parallel processing - Parse multiple documents concurrently using Rayon
- Dual path system - Track both structural paths and official USLM identifiers
- Rich text content - Capture heading, chapeau, proviso, content, and continuation fields
- JSON serialization - All structures implement Serde traits
Add to your Cargo.toml:
[dependencies]
words-to-data = "0.1.1"Or clone and build locally:
git clone https://github.com/yourusername/words_to_data
cd words_to_data
cargo build --lib --releaseTitle data can be retrieved from https://uscode.house.gov/download/download.shtml
Bill data is available on https://congress.gov
use words_to_data::uslm::parser::parse;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let document = parse("tests/test_data/usc/2025-07-18/usc07.xml", "2025-07-18")?;
println!("Parsed: {}", document.data.verbose_name);
println!("USLM ID: {:?}", document.data.uslm_id);
println!("Children: {}", document.children.len());
Ok(())
}use std::fs;
use words_to_data::{diff::TreeDiff, uslm::parser::parse};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let doc_old = parse("tests/test_data/usc/2025-07-18/usc07.xml", "2025-07-18")?;
let doc_new = parse("tests/test_data/usc/2025-07-30/usc07.xml", "2025-07-30")?;
let diff = TreeDiff::from_elements(&doc_old, &doc_new);
words_to_data::utils::write_json_file(&diff, "diff.json")?;
Ok(())
}use words_to_data::uslm::bill_parser::parse_bill_amendments;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let data = parse_bill_amendments("tests/test_data/bills/hr-119-21.xml")?;
println!("Bill {}: {} amendments found", data.bill_id, data.amendments.len());
for amendment in &data.amendments {
println!("\nAmendment at: {}", amendment.source_path);
println!(" USC sections modified: {}", amendment.target_paths.len());
println!(" Actions: {:?}", amendment.action_types);
}
Ok(())
}use words_to_data::utils::parse_uslm_directory;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let documents = parse_uslm_directory("tests/test_data/usc/2025-07-18", "2025-07-18")?;
println!("Parsed {} documents in parallel", documents.len());
for doc in documents.iter().take(5) {
println!(" - {} ({})", doc.data.verbose_name, doc.data.path);
}
Ok(())
}Documents are represented as trees of USLMElement structures. Each element contains:
- ElementData: Metadata, text content, and identification
- Children: Nested child elements forming the document hierarchy
The library uses two types of paths:
-
Structural Path: Full hierarchy including all elements Example:
uscodedocument_26/title_26/subtitle_A/chapter_1/section_174 -
USLM ID: Official USLM identifier (excludes structural-only elements) Example:
/us/usc/t26/s174/a/1
Each element can contain up to five distinct text fields:
- Heading: Section or subsection title
- Chapeau: Opening text before enumerated items
- Proviso: Conditional or qualifying clauses
- Content: Main body text
- Continuation: Text appearing after child elements
The TreeDiff structure mirrors the element hierarchy and tracks:
- Field changes: Word-level differences in text content fields
- Added elements: New child elements in the newer version
- Removed elements: Elements that existed in the older version
- Child diffs: Recursive diffs for matching child elements
Diffs are computed using word-level granularity via the similar crate.
Generate and view the full API documentation:
cargo doc --openThe documentation includes detailed information about all public types, functions, and examples.