Description
Heterogeneous data inputs from instruments, assays, and public repositories require skillful integration to create tidy, analysis-ready datasets. Students learn to use Python and the Linux shell to extract, transform, merge, and clean data, converting between domain formats and general formats such as CSV and JSON, with documented schemas and quality checks for downstream visualization and analytics. Working in a command-line environment, they build reproducible workflows with standard utilities and organize projects for traceability. Students also manage scheduling and data integrity in incremental updates (handling late-arriving records, duplicates, and versioned refreshes), load prepared datasets into visualization tools, and create simple web endpoints to share transformed datasets for defined analytical tasks.

Registration in this course is restricted to students admitted to the Data Visualization in Biological Sciences program.