COVID-19 Case Data Generator
Generate semi-realistic COVID-19 testing event data with stateful patient lifecycles, weighted outcomes, and exportable NDJSON or JSON output.
About
Testing pipelines, streaming demos, and healthcare dashboards require volumetric event data that respects epidemiological state transitions. A random JSON blob fails here. Each patient record must follow a lifecycle: a NEW event (test administered, result pending) must precede any UPDATED event (confirmed, negative, recovered, dead). Confirmed cases must have a non-zero probability of generating a subsequent recovery or death event within a realistic time window of 5 - 30 days. This generator implements a weighted Markov state machine where transition probabilities approximate CDC aggregate ratios: roughly 40% positivity, 2% case fatality among confirmed. Dates, names, locations, and demographics are synthesized from embedded dictionaries without external API calls. Output is valid NDJSON or JSON array, suitable for piping into Kafka, curl, or jq.
Limitations: demographic distributions are US-centric. Age-weighted severity is simplified to two tiers (<60 and ≥60). Geographic data is not correlated to real outbreak hotspots. The tool approximates aggregate statistics and does not model transmission dynamics, R-number, or hospital capacity. For load-testing a streaming consumer or populating a dashboard prototype, this level of fidelity is sufficient. For epidemiological research, it is not.
Formulas
Each patient follows a Markov chain with weighted transitions. The probability of terminal state S given a confirmed case depends on age bracket:
Age generation uses the Box-Muller transform to approximate a normal distribution:
Where μ = 45 (mean age), σ = 18 (standard deviation), and u1, u2 are uniform random values in (0, 1). Result is clamped to [1, 99].
UUID v4 generation uses crypto.getRandomValues to fill 16 random bytes, then sets version bits (0100) at byte 6 and variant bits (10) at byte 8 per RFC 4122.
The state machine transition diagram:
Total event count per patient ranges from 2 (NEW + single UPDATED) to 3 (NEW + confirmed + outcome). The user-specified generation amount N refers to total events, not patients. Approximate patient count: P ≈ N2.2 (mean events per patient lifecycle).
Reference Data
| Event Type | Status Value | Probability Weight | Follows | Delay Range | Description |
|---|---|---|---|---|---|
| NEW | pending | 100% (initial) | - | 0 days | Patient symptomatic, test administered, awaiting result |
| UPDATED | confirmed | 40% | pending | 1 - 14 days | Positive test result returned |
| UPDATED | negative | 55% | pending | 1 - 14 days | Negative test result returned |
| UPDATED | probable | 5% | pending | 1 - 14 days | Probable case, clinical diagnosis without conclusive test |
| UPDATED | recovered | 85% of confirmed | confirmed | 7 - 30 days | Patient recovered and released |
| UPDATED | dead | 2% (age < 60), 8% (age ≥ 60) | confirmed | 5 - 25 days | Patient deceased |
| UPDATED | hospitalized | 13% of confirmed (remainder) | confirmed | 2 - 10 days | Patient hospitalized, outcome pending further events |
| Output Field Reference | |||||
event_id | UUID v4 - unique per event record | ||||
patient_id | UUID v4 - consistent across all events for same patient | ||||
event_type | NEW or UPDATED | ||||
status | One of: pending, confirmed, negative, probable, recovered, dead, hospitalized | ||||
patient.first_name | Synthesized from embedded dictionary (~200 entries) | ||||
patient.last_name | Synthesized from embedded dictionary (~200 entries) | ||||
patient.age | Weighted normal distribution, μ = 45, σ = 18, clamped 1 - 99 | ||||
patient.gender | M, F, or X (weighted 48/48/4) | ||||
patient.phone | US format: (XXX) XXX-XXXX | ||||
location.city | From embedded US city list (~120 entries) | ||||
location.state | US state abbreviation | ||||
location.zip | 5-digit code matching state range | ||||
timestamp | ISO 8601 within configured date range | ||||
symptomatic | Boolean - 70% true for confirmed, 30% for negative | ||||
Frequently Asked Questions
patient_id (UUID v4) at creation. The NEW event is always generated first with status pending. Subsequent UPDATED events reference the same patient_id and carry monotonically increasing timestamps. The generator processes all lifecycle stages for a patient before moving to the next, so piping the output preserves causal ordering. If you shuffle the output, you can re-sort by patient_id then timestamp to restore the correct event sequence.confirmed (approximately 40% of all patients). Among those, the fatality rate is age-dependent: 2% for age < 60, 8% for age ≥ 60. The hospitalized category captures the remainder after subtracting dead and recovered percentages. This means roughly 13% of confirmed cases appear as hospitalized with no final resolution, simulating an open-ended data stream where not all outcomes are reported.kafka-console-producer, jq streaming mode (jq -c "."), and curl POST loops. The JSON Array format is better suited for file-based import into databases or REST API bulk endpoints. Copy the output or download the file and pipe it directly.(XXX) XXX-XXXX with area codes drawn from a realistic range (200-999, excluding reserved prefixes like 555). ZIP codes are generated within plausible ranges for each state (e.g., New York ZIPs start with 100xx-149xx). They are structurally valid but not guaranteed to correspond to real addresses. For privacy reasons, no real PII is embedded in the dictionaries.NEW event timestamps. Each NEW event receives a random timestamp within this range. Subsequent UPDATED events are offset forward by their delay range (1 - 14 days for test results, 5 - 30 days for outcomes). This means UPDATED events can have timestamps that exceed the configured end date, which mirrors real-world data where a test administered on the last day of a reporting period returns results days later.