A hospital uses an electronic health records (EHR) system to collect two types of data:
– Patient information, which includes a patient's name and address.
– Diagnostic tests conducted and the results of these tests.
Patient information is expected to change periodically. Existing diagnostic test data never changes and only new records are added.
The hospital runs an Amazon Redshift cluster with four dc2.large nodes and wants to automate the ingestion of the patient information and diagnostic test data into respective Amazon Redshift tables for analysis. The EHR system exports data as CSV files to an Amazon S3 bucket on a daily basis. Two sets of CSV files are generated. One set of files is for patient information with updates, deletes, and inserts. The other set of files is for new diagnostic test data only.
What is the MOST cost-effective solution to meet these requirements?
A. Use Amazon EMR with Apache Hudi. Run daily ETL jobs using Apache Spark and the Amazon Redshift JDBC driver.
B. Use an AWS Glue crawler to catalog the data in Amazon S3. Use Amazon Redshift Spectrum to perform scheduled queries of the data in Amazon S3 and ingest the data into the patient information table and the diagnostic tests table.
C. Use an AWS Lambda function to run a COPY command that appends new diagnostic test data to the diagnostic tests table. Run another COPY command to load the patient information data into the staging tables. Use a stored procedure to handle create, update, and delete operations for the patient information table.
D. Use AWS Database Migration Service (AWS DMS) to collect and process change data capture (CDC) records. Use the COPY command to load patient information data into the staging tables. Use a stored procedure to handle create, update, and delete operations for the patient information table.
– Patient information, which includes a patient's name and address.
– Diagnostic tests conducted and the results of these tests.
Patient information is expected to change periodically. Existing diagnostic test data never changes and only new records are added.
The hospital runs an Amazon Redshift cluster with four dc2.large nodes and wants to automate the ingestion of the patient information and diagnostic test data into respective Amazon Redshift tables for analysis. The EHR system exports data as CSV files to an Amazon S3 bucket on a daily basis. Two sets of CSV files are generated. One set of files is for patient information with updates, deletes, and inserts. The other set of files is for new diagnostic test data only.
What is the MOST cost-effective solution to meet these requirements?
A. Use Amazon EMR with Apache Hudi. Run daily ETL jobs using Apache Spark and the Amazon Redshift JDBC driver.
B. Use an AWS Glue crawler to catalog the data in Amazon S3. Use Amazon Redshift Spectrum to perform scheduled queries of the data in Amazon S3 and ingest the data into the patient information table and the diagnostic tests table.
C. Use an AWS Lambda function to run a COPY command that appends new diagnostic test data to the diagnostic tests table. Run another COPY command to load the patient information data into the staging tables. Use a stored procedure to handle create, update, and delete operations for the patient information table.
D. Use AWS Database Migration Service (AWS DMS) to collect and process change data capture (CDC) records. Use the COPY command to load patient information data into the staging tables. Use a stored procedure to handle create, update, and delete operations for the patient information table.