AWS Certified Data Analytics – Specialty DAS-C01 – Question110

A marketing company is storing its campaign response data in Amazon S3. A consistent set of sources has generated the data for each campaign. The data is saved into Amazon S3 as .csv files. A business analyst will use Amazon Athena to analyze each campaign's data. The company needs the cost of ongoing data analysis with Athena to be minimized.
Which combination of actions should a data analytics specialist take to meet these requirements? (Choose two.)

A. Convert the .csv files to Apache Parquet.
B. Convert the .csv files to Apache Avro.
C. Partition the data by campaign.
D. Partition the data by source.
E. Compress the .csv files.

Correct Answer: AC

Explanation:

Reference: https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/

AWS Certified Data Analytics – Specialty DAS-C01 – Question109

A company wants to improve the data load time of a sales data dashboard. Data has been collected as .csv files and stored within an Amazon S3 bucket that is partitioned by date. The data is then loaded to an Amazon Redshift data warehouse for frequent analysis. The data volume is up to 500 GB per day.
Which solution will improve the data loading performance?

A. Compress .csv files and use an INSERT statement to ingest data into Amazon Redshift.
B. Split large csv files, then use a COPY command to load data into Amazon Redshift.
C. Use Amazon Kinesis Data Firehose to ingest data into Amazon Redshift.
D. Load the .csv files in an unsorted key order and vacuum the table in Amazon Redshift.

Correct Answer: B

Explanation:

Reference: https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-use-multiple-files.html

AWS Certified Data Analytics – Specialty DAS-C01 – Question108

A company wants to run analytics on its Elastic Load Balancing logs stored in Amazon S3. A data analyst needs to be able to query all data from a desired year, month, or day. The data analyst should also be able to query a subset of the columns. The company requires minimal operational overhead and the most cost- effective solution.
Which approach meets these requirements for optimizing and querying the log data?

A. Use an AWS Glue job nightly to transform new log files into .csv format and partition by year, month, and day. Use AWS Glue crawlers to detect new partitions. Use Amazon Athena to query data.
B. Launch a long-running Amazon EMR cluster that continuously transforms new log files from Amazon S3 into its Hadoop Distributed File System (HDFS) storage and partitions by year, month, and day. Use Apache Presto to query the optimized format.
C. Launch a transient Amazon EMR cluster nightly to transform new log files into Apache ORC format and partition by year, month, and day. Use Amazon Redshift Spectrum to query the data.
D. Use an AWS Glue job nightly to transform new log files into Apache Parquet format and partition by year, month, and day. Use AWS Glue crawlers to detect new partitions. Use Amazon Athena to query data.

Correct Answer: D

Explanation:

Reference: https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/

AWS Certified Data Analytics – Specialty DAS-C01 – Question107

A company has developed several AWS Glue jobs to validate and transform its data from Amazon S3 and load it into Amazon RDS for MySQL in batches once every day. The ETL jobs read the S3 data using a DynamicFrame. Currently, the ETL developers are experiencing challenges in processing only the incremental data on every run, as the AWS Glue job processes all the S3 input data on each run.
Which approach would allow the developers to solve the issue with minimal coding effort?

A. Have the ETL jobs read the data from Amazon S3 using a DataFrame.
B. Enable job bookmarks on the AWS Glue jobs.
C. Create custom logic on the ETL jobs to track the processed S3 objects.
D. Have the ETL jobs delete the processed objects or data from Amazon S3 after each run.

Correct Answer: B

Explanation:

Reference: https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html

AWS Certified Data Analytics – Specialty DAS-C01 – Question106

A manufacturing company has many IoT devices in different facilities across the world. The company is using Amazon Kinesis Data Streams to collect the data from the devices.
The company's operations team has started to observe many WriteThroughputExceeded exceptions. The operations team determines that the reason is the number of records that are being written to certain shards.
The data contains device ID, capture date, measurement type, measurement value, and facility ID The facility ID is used as the partition key.
Which action will resolve this issue?

A. Change the partition key from facility ID to a randomly generated key.
B. Increase the number of shards.
C. Archive the data on the producers' side.
D. Change the partition key from facility ID to capture date.

Correct Answer: B

Explanation:

Reference: https://aws.amazon.com/premiumsupport/knowledge-center/kinesis-data-stream-throttling/

AWS Certified Data Analytics – Specialty DAS-C01 – Question105

A company uses Amazon OpenSearch Service (Amazon Elasticsearch Service) to store and analyze its website clickstream data. The company ingests 1 TB of data daily using Amazon Kinesis Data Firehose and stores one day's worth of data in an Amazon OpenSearch Service (Amazon Elasticsearch Service) cluster.
The company has very slow query performance on the Amazon OpenSearch Service (Amazon Elasticsearch Service) index and occasionally sees errors from Kinesis Data Firehose when attempting to write to the index.
The Amazon OpenSearch Service (Amazon Elasticsearch Service) cluster has 10 nodes running a single index and 3 dedicated master nodes. Each data node has 1.5 TB of Amazon EBS storage attached and the cluster is configured with 1,000 shards. Occasionally, JVMMemoryPressure errors are found in the cluster logs.
Which solution will improve the performance of Amazon OpenSearch Service (Amazon Elasticsearch Service)?

A. Increase the memory of the Amazon OpenSearch Service (Amazon Elasticsearch Service) master nodes.
B. Decrease the number of Amazon OpenSearch Service (Amazon Elasticsearch Service) data nodes.
C. Decrease the number of Amazon OpenSearch Service (Amazon Elasticsearch Service) shards for the index.
D. Increase the number of Amazon OpenSearch Service (Amazon Elasticsearch Service) shards for the index.

Correct Answer: C

Explanation:

Reference: https://www.bluematador.com/docs/troubleshooting/aws-elasticsearch-jvm-pressure#:~:text=Amazon%20recommends%20keeping%20JVM%20pressure,getting%20into%20a%20red%20state

AWS Certified Data Analytics – Specialty DAS-C01 – Question104

A company has several Amazon EC2 instances sitting behind an Application Load Balancer (ALB). The company wants its IT Infrastructure team to analyze the IP addresses coming into the company's ALB. The ALB is configured to store access logs in Amazon S3. The access logs create about 1 TB of data each day, and access to the data will be infrequent. The company needs a solution that is scalable, cost-effective, and has minimal maintenance requirements.
Which solution meets these requirements?

A. Copy the data into Amazon Redshift and query the data.
B. Use Amazon EMR and Apache Hive to query the S3 data.
C. Use Amazon Athena to query the S3 data.
D. Use Amazon Redshift Spectrum to query the S3 data.

Correct Answer: D

Explanation:

Reference: https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum-query-s3-data.html

AWS Certified Data Analytics – Specialty DAS-C01 – Question103

A healthcare company ingests patient data from multiple data sources and stores it in an Amazon S3 staging bucket. An AWS Glue ETL job transforms the data, which is written to an S3-based data lake to be queried using Amazon Athena. The company wants to match patient records even when the records do not have a common unique identifier.
Which solution meets this requirement?

A. Use Amazon Macie pattern matching as part of the ETL job.
B. Train and use the AWS Glue PySpark filter class in the ETL job.
C. Partition tables and use the ETL job to partition the data on patient name.
D. Train and use the AWS Glue FindMatches ML transform in the ETL job.

Correct Answer: D

Explanation:

Reference: https://docs.aws.amazon.com/glue/latest/dg/machine-learning.html

AWS Certified Data Analytics – Specialty DAS-C01 – Question102

A company has collected more than 100 TB of log files in the last 24 months. The files are stored as raw text in a dedicated Amazon S3 bucket. Each object has a key of the form year-month-day_log_HHmmss.txt where HHmmss represents the time the log file was initially created. A table was created in Amazon Athena that points to the S3 bucket. One-time queries are run against a subset of columns in the table several times an hour.
A data analyst must make changes to reduce the cost of running these queries. Management wants a solution with minimal maintenance overhead.
Which combination of steps should the data analyst take to meet these requirements? (Choose three.)

A. Convert the log files to Apache Avro format.
B. Add a key prefix of the form date=year-month-day/ to the S3 objects to partition the data.
C. Convert the log files to Apache Parquet format.
D. Add a key prefix of the form year-month-day/ to the S3 objects to partition the data.
E. Drop and recreate the table with the PARTITIONED BY clause. Run the ALTER TABLE ADD PARTITION statement.
F. Drop and recreate the table with the PARTITIONED BY clause. Run the MSCK REPAIR TABLE statement.

Correct Answer: BCF

Explanation:

Reference: https://aws.amazon.com/premiumsupport/knowledge-center/s3-object-key-naming-pattern/
https://docs.aws.amazon.com/athena/latest/ug/convert-to-columnar.html
https://docs.aws.amazon.com/athena/latest/ug/msck-repair-table.html

AWS Certified Data Analytics – Specialty DAS-C01 – Question101

A central government organization is collecting events from various internal applications using Amazon Managed Streaming for Apache Kafka (Amazon MSK). The organization has configured a separate Kafka topic for each application to separate the data. For security reasons, the Kafka cluster has been configured to only allow TLS encrypted data and it encrypts the data at rest.
A recent application update showed that one of the applications was configured incorrectly, resulting in writing data to a Kafka topic that belongs to another application. This resulted in multiple errors in the analytics pipeline as data from different applications appeared on the same topic. After this incident, the organization wants to prevent applications from writing to a topic different than the one they should write to.
Which solution meets these requirements with the least amount of effort?

A. Create a different Amazon EC2 security group for each application. Configure each security group to have access to a specific topic in the Amazon MSK cluster. Attach the security group to each application based on the topic that the applications should read and write to.
B. Install Kafka Connect on each application instance and configure each Kafka Connect instance to write to a specific topic only.
C. Use Kafka ACLs and configure read and write permissions for each topic. Use the distinguished name of the clients' TLS certificates as the principal of the ACL.
D. Create a different Amazon EC2 security group for each application. Create an Amazon MSK cluster and Kafka topic for each application. Configure each security group to have access to the specific cluster.

Correct Answer: C

Explanation:

Reference: https://docs.aws.amazon.com/msk/latest/developerguide/msk-acls.html

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.