AWS Certified Data Analytics – Specialty DAS-C01 – Question130

A transport company wants to track vehicular movements by capturing geolocation records. The records are 10 B in size and up to 10,000 records are captured each second. Data transmission delays of a few minutes are acceptable, considering unreliable network conditions. The transport company decided to use Amazon Kinesis Data Streams to ingest the data. The company is looking for a reliable mechanism to send data to Kinesis Data Streams while maximizing the throughput efficiency of the Kinesis shards.
Which solution will meet the company's requirements?

A.
Kinesis Agent
B. Kinesis Producer Library (KPL)
C. Kinesis Data Firehose
D. Kinesis SDK

Correct Answer: B

AWS Certified Data Analytics – Specialty DAS-C01 – Question129

A company owns manufacturing facilities with Internet of Things (IoT) devices installed to monitor safety data.
The company has configured an Amazon Kinesis data stream as a source for an Amazon Kinesis Data Firehose delivery stream, which outputs data to Amazon S3. The company's operations team wants to gain insights from the IoT data to monitor data quality at ingestion. The insights need to be derived in near-real time, and the output must be logged to Amazon DynamoDB for further analysis.
Which solution meets these requirements?

A.
Create an Amazon Kinesis Data Analytics for SQL application to read and analyze the data in the data stream. Add an output configuration so that everything written to an in-application stream persists in a DynamoDB table.
B. Create an Amazon Kinesis Data Analytics for SQL application to read and analyze the data in the data stream. Add an output configuration so that everything written to an in-application stream is passed to an AWS Lambda function that saves the data in a DynamoDB table as persistent data.
C. Configure an AWS Lambda function to analyze the data in the Kinesis Data Firehose delivery stream. Save the output to a DynamoDB table.
D. Configure an AWS Lambda function to analyze the data in the Kinesis Data Firehose delivery stream and save the output to an S3 bucket. Schedule an AWS Glue job to periodically copy the data from the bucket to a DynamoDB table.

Correct Answer: B

AWS Certified Data Analytics – Specialty DAS-C01 – Question128

A public sector organization ingests large datasets from various relational databases into an Amazon S3 data lake on a daily basis. Data analysts need a mechanism to profile the data and diagnose data quality issues after the data is ingested into Amazon S3. The solution should allow the data analysts to visualize and explore the data quality metrics through a user interface.
Which set of steps provide a solution that meets these requirements?

A.
Create a new AWS Glue DataBrew dataset for each dataset in the S3 data lake. Create a new DataBrew project for each dataset. Create a profile job for each project and schedule it to run daily. Instruct the data analysts to explore the data quality metrics by using the DataBrew console.
B. Create a new AWS Glue ETL job that uses the Deequ Spark library for data validation and schedule the ETL job to run daily. Store the output of the ETL job within an S3 bucket. Instruct the data analysts to query and visualize the data quality metrics by using the Amazon Athena console.
C. Schedule an AWS Lambda function to run daily by using Amazon EventBridge (Amazon CloudWatch Events). Configure the Lambda function to test the data quality of each object and store the results in an S3 bucket. Create an Amazon QuickSight dashboard to query and visualize the results. Instruct the data analysts to explore the data quality metrics using QuickSight.
D. Schedule an AWS Step Functions workflow to run daily by using Amazon EventBridge (Amazon CloudWatch Events). Configure the steps by using AWS Lambda functions to perform the data quality checks and update the catalog tags in the AWS Glue Data Catalog with the results. Instruct the data analysts to explore the data quality metrics using the Data Catalog console.

Correct Answer: A

AWS Certified Data Analytics – Specialty DAS-C01 – Question127

A company wants to improve user satisfaction for its smart home system by adding more features to its recommendation engine. Each sensor asynchronously pushes its nested JSON data into Amazon Kinesis Data Streams using the Kinesis Producer Library (KPL) in Java. Statistics from a set of failed sensors showed that, when a sensor is malfunctioning, its recorded data is not always sent to the cloud.
The company needs a solution that offers near-real-time analytics on the data from the most updated sensors.
Which solution enables the company to meet these requirements?

A.
Set the RecordMaxBufferedTime property of the KPL to "-1" to disable the buffering on the sensor side. Use Kinesis Data Analytics to enrich the data based on a company-developed anomaly detection SQL script. Push the enriched data to a fleet of Kinesis data streams and enable the data transformation feature to flatten the JSON file. Instantiate a dense storage Amazon Redshift cluster and use it as the destination for the Kinesis Data Firehose delivery stream.
B. Update the sensors code to use the PutRecord/PutRecords call from the Kinesis Data Streams API with the AWS SDK for Java. Use Kinesis Data Analytics to enrich the data based on a company-developed anomaly detection SQL script. Direct the output of KDA application to a Kinesis Data Firehose delivery stream, enable the data transformation feature to flatten the JSON file, and set the Kinesis Data Firehose destination to an Amazon OpenSearch Service (Amazon Elasticsearch Service) cluster.
C. Set the RecordMaxBufferedTime property of the KPL to "0" to disable the buffering on the sensor side. Connect for each stream a dedicated Kinesis Data Firehose delivery stream and enable the data transformation feature to flatten the JSON file before sending it to an Amazon S3 bucket. Load the S3 data into an Amazon Redshift cluster.
D. Update the sensors code to use the PutRecord/PutRecords call from the Kinesis Data Streams API with the AWS SDK for Java. Use AWS Glue to fetch and process data from the stream using the Kinesis Client Library (KCL). Instantiate an Amazon OpenSearch Service (Amazon Elasticsearch Service) cluster and use AWS Lambda to directly push data into it.

Correct Answer: B

AWS Certified Data Analytics – Specialty DAS-C01 – Question126

A banking company wants to collect large volumes of transactional data using Amazon Kinesis Data Streams for real-time analytics. The company uses PutRecord to send data to Amazon Kinesis, and has observed network outages during certain times of the day. The company wants to obtain exactly once semantics for the entire processing pipeline.
What should the company do to obtain these characteristics?

A.
Design the application so it can remove duplicates during processing by embedding a unique ID in each record.
B. Rely on the processing semantics of Amazon Kinesis Data Analytics to avoid duplicate processing of events.
C. Design the data producer so events are not ingested into Kinesis Data Streams multiple times.
D. Rely on the exactly once processing semantics of Apache Flink and Apache Spark Streaming included in Amazon EMR.

Correct Answer: A

AWS Certified Data Analytics – Specialty DAS-C01 – Question125

An ecommerce company uses Amazon Aurora PostgreSQL to process and store live transactional data and uses Amazon Redshift for its data warehouse solution. A nightly ETL job has been implemented to update the Redshift cluster with new data from the PostgreSQL database. The business has grown rapidly and so has the size and cost of the Redshift cluster. The company's data analytics team needs to create a solution to archive historical data and only keep the most recent 12 months of data in Amazon Redshift to reduce costs. Data analysts should also be able to run analytics queries that effectively combine data from live transactional data in PostgreSQL, current data in Redshift, and archived historical data.
Which combination of tasks will meet these requirements? (Choose three.)

A.
Configure the Amazon Redshift Federated Query feature to query live transactional data in the PostgreSQL database.
B. Configure Amazon Redshift Spectrum to query live transactional data in the PostgreSQL database.
C. Schedule a monthly job to copy data older than 12 months to Amazon S3 by using the UNLOAD command, and then delete that data from the Redshift cluster. Configure Amazon Redshift Spectrum to access historical data in Amazon S3.
D. Schedule a monthly job to copy data older than 12 months to Amazon S3 Glacier Flexible Retrieval by using the UNLOAD command, and then delete that data from the Redshift cluster. Configure Redshift Spectrum to access historical data with S3 Glacier Flexible Retrieval.
E. Create a late-binding view in Amazon Redshift that combines live, current, and historical data from different sources.
F. Create a materialized view in Amazon Redshift that combines live, current, and historical data from different sources.

Correct Answer: ACD

AWS Certified Data Analytics – Specialty DAS-C01 – Question124

A company that produces network devices has millions of users. Data is collected from the devices on an hourly basis and stored in an Amazon S3 data lake.
The company runs analyses on the last 24 hours of data flow logs for abnormality detection and to troubleshoot and resolve user issues. The company also analyzes historical logs dating back 2 years to discover patterns and look for improvement opportunities.
The data flow logs contain many metrics, such as date, timestamp, source IP, and target IP. There are about 10 billion events every day.
How should this data be stored for optimal performance?

A.
In Apache ORC partitioned by date and sorted by source IP
B. In compressed .csv partitioned by date and sorted by source IP
C. In Apache Parquet partitioned by source IP and sorted by date
D. In compressed nested JSON partitioned by source IP and sorted by date

Correct Answer: A

AWS Certified Data Analytics – Specialty DAS-C01 – Question123

A data analytics specialist is building an automated ETL ingestion pipeline using AWS Glue to ingest compressed files that have been uploaded to an Amazon S3 bucket. The ingestion pipeline should support incremental data processing.
Which AWS Glue feature should the data analytics specialist use to meet this requirement?

A.
Workflows
B. Triggers
C. Job bookmarks
D. Classifiers

Correct Answer: C

AWS Certified Data Analytics – Specialty DAS-C01 – Question122

An online food delivery company wants to optimize its storage costs. The company has been collecting operational data for the last 10 years in a data lake that was built on Amazon S3 by using a Standard storage class. The company does not keep data that is older than 7 years. The data analytics team frequently uses data from the past 6 months for reporting and runs queries on data from the last 2 years about once a month. Data that is more than 2 years old is rarely accessed and is only used for audit purposes.
Which combination of solutions will optimize the company's storage costs? (Choose two.)

A.
Create an S3 Lifecycle configuration rule to transition data that is older than 6 months to the S3 Standard- Infrequent Access (S3 Standard-IA) storage class. Create another S3 Lifecycle configuration rule to transition data that is older than 2 years to the S3 Glacier Deep Archive storage class.
B. Create an S3 Lifecycle configuration rule to transition data that is older than 6 months to the S3 One Zone- Infrequent Access (S3 One Zone-IA) storage class. Create another S3 Lifecycle configuration rule to transition data that is older than 2 years to the S3 Glacier Flexible Retrieval storage class.
C. Use the S3 Intelligent-Tiering storage class to store data instead of the S3 Standard storage class.
D. Create an S3 Lifecycle expiration rule to delete data that is older than 7 years.
E. Create an S3 Lifecycle configuration rule to transition data that is older than 7 years to the S3 Glacier Deep Archive storage class.

Correct Answer: CE

AWS Certified Data Analytics – Specialty DAS-C01 – Question121

A financial company uses Amazon Athena to query data from an Amazon S3 data lake. Files are stored in the S3 data lake in Apache ORC format. Data analysts recently introduced nested fields in the data lake ORC files, and noticed that queries are taking longer to run in Athena. A data analysts discovered that more data than what is required is being scanned for the queries.
What is the MOST operationally efficient solution to improve query performance?

A.
Flatten nested data and create separate files for each nested dataset.
B. Use the Athena query engine V2 and push the query filter to the source ORC file.
C. Use Apache Parquet format instead of ORC format.
D. Recreate the data partition strategy and further narrow down the data filter criteria.

Correct Answer: C