User Rating 0.0
Total Usage 0 times
Total events (NEW + UPDATED), not patients. Range: 1–100,000
Earliest timestamp for NEW events
Latest timestamp for NEW events
NDJSON for streaming/piping, JSON Array for file import
Percentage of tests returning confirmed positive
Percentage of confirmed cases resulting in death (age-adjusted internally)
Presets:
Is this tool helpful?

Your feedback helps us improve.

About

Testing pipelines, streaming demos, and healthcare dashboards require volumetric event data that respects epidemiological state transitions. A random JSON blob fails here. Each patient record must follow a lifecycle: a NEW event (test administered, result pending) must precede any UPDATED event (confirmed, negative, recovered, dead). Confirmed cases must have a non-zero probability of generating a subsequent recovery or death event within a realistic time window of 5 - 30 days. This generator implements a weighted Markov state machine where transition probabilities approximate CDC aggregate ratios: roughly 40% positivity, 2% case fatality among confirmed. Dates, names, locations, and demographics are synthesized from embedded dictionaries without external API calls. Output is valid NDJSON or JSON array, suitable for piping into Kafka, curl, or jq.

Limitations: demographic distributions are US-centric. Age-weighted severity is simplified to two tiers (<60 and ≥60). Geographic data is not correlated to real outbreak hotspots. The tool approximates aggregate statistics and does not model transmission dynamics, R-number, or hospital capacity. For load-testing a streaming consumer or populating a dashboard prototype, this level of fidelity is sufficient. For epidemiological research, it is not.

covid-19 data-generator test-data ndjson json faker mock-data streaming-data healthcare-data event-generator

Formulas

Each patient follows a Markov chain with weighted transitions. The probability of terminal state S given a confirmed case depends on age bracket:

P(S = dead | confirmed) =
{
0.02 if age < 600.08 if age 60

Age generation uses the Box-Muller transform to approximate a normal distribution:

age = round(μ + σ 2 ln(u1) cos(2πu2))

Where μ = 45 (mean age), σ = 18 (standard deviation), and u1, u2 are uniform random values in (0, 1). Result is clamped to [1, 99].

UUID v4 generation uses crypto.getRandomValues to fill 16 random bytes, then sets version bits (0100) at byte 6 and variant bits (10) at byte 8 per RFC 4122.

The state machine transition diagram:

NEW(pending) UPDATED(confirmed | negative | probable) UPDATED(recovered | dead | hospitalized)

Total event count per patient ranges from 2 (NEW + single UPDATED) to 3 (NEW + confirmed + outcome). The user-specified generation amount N refers to total events, not patients. Approximate patient count: P N2.2 (mean events per patient lifecycle).

Reference Data

Event TypeStatus ValueProbability WeightFollowsDelay RangeDescription
NEWpending100% (initial) - 0 daysPatient symptomatic, test administered, awaiting result
UPDATEDconfirmed40%pending1 - 14 daysPositive test result returned
UPDATEDnegative55%pending1 - 14 daysNegative test result returned
UPDATEDprobable5%pending1 - 14 daysProbable case, clinical diagnosis without conclusive test
UPDATEDrecovered85% of confirmedconfirmed7 - 30 daysPatient recovered and released
UPDATEDdead2% (age < 60), 8% (age ≥ 60)confirmed5 - 25 daysPatient deceased
UPDATEDhospitalized13% of confirmed (remainder)confirmed2 - 10 daysPatient hospitalized, outcome pending further events
Output Field Reference
event_idUUID v4 - unique per event record
patient_idUUID v4 - consistent across all events for same patient
event_typeNEW or UPDATED
statusOne of: pending, confirmed, negative, probable, recovered, dead, hospitalized
patient.first_nameSynthesized from embedded dictionary (~200 entries)
patient.last_nameSynthesized from embedded dictionary (~200 entries)
patient.ageWeighted normal distribution, μ = 45, σ = 18, clamped 1 - 99
patient.genderM, F, or X (weighted 48/48/4)
patient.phoneUS format: (XXX) XXX-XXXX
location.cityFrom embedded US city list (~120 entries)
location.stateUS state abbreviation
location.zip5-digit code matching state range
timestampISO 8601 within configured date range
symptomaticBoolean - 70% true for confirmed, 30% for negative

Frequently Asked Questions

Every patient receives a stable patient_id (UUID v4) at creation. The NEW event is always generated first with status pending. Subsequent UPDATED events reference the same patient_id and carry monotonically increasing timestamps. The generator processes all lifecycle stages for a patient before moving to the next, so piping the output preserves causal ordering. If you shuffle the output, you can re-sort by patient_id then timestamp to restore the correct event sequence.
The second-tier transitions (recovered, dead, hospitalized) apply only to patients whose first UPDATED status is confirmed (approximately 40% of all patients). Among those, the fatality rate is age-dependent: 2% for age < 60, 8% for age ≥ 60. The hospitalized category captures the remainder after subtracting dead and recovered percentages. This means roughly 13% of confirmed cases appear as hospitalized with no final resolution, simulating an open-ended data stream where not all outcomes are reported.
Yes. The NDJSON (Newline Delimited JSON) format produces one valid JSON object per line with no wrapping array brackets. This is the standard input format for kafka-console-producer, jq streaming mode (jq -c "."), and curl POST loops. The JSON Array format is better suited for file-based import into databases or REST API bulk endpoints. Copy the output or download the file and pipe it directly.
Phone numbers follow the US format (XXX) XXX-XXXX with area codes drawn from a realistic range (200-999, excluding reserved prefixes like 555). ZIP codes are generated within plausible ranges for each state (e.g., New York ZIPs start with 100xx-149xx). They are structurally valid but not guaranteed to correspond to real addresses. For privacy reasons, no real PII is embedded in the dictionaries.
Generation above 500 events is offloaded to a Web Worker to prevent UI freezing. The progress bar updates in real time. The practical upper limit is around 100,000 events before browser memory constraints on the resulting string become a concern (approximately 80MB of JSON). For very large datasets, download the file rather than previewing in-browser, as rendering large text blocks into the DOM is the bottleneck, not generation itself.
The start and end dates define the window for NEW event timestamps. Each NEW event receives a random timestamp within this range. Subsequent UPDATED events are offset forward by their delay range (1 - 14 days for test results, 5 - 30 days for outcomes). This means UPDATED events can have timestamps that exceed the configured end date, which mirrors real-world data where a test administered on the last day of a reporting period returns results days later.