AWS Certified Data Analytics – Specialty DAS-C01 – Question040

A gaming company is building a serverless data lake. The company is ingesting streaming data into Amazon Kinesis Data Streams and is writing the data to Amazon S3 through Amazon Kinesis Data Firehose. The company is using 10 MB as the S3 buffer size and is using 90 seconds as the buffer interval. The company runs an AWS Glue ETL job to merge and transform the data to a different format before writing the data back to Amazon S3.
Recently, the company has experienced substantial growth in its data volume. The AWS Glue ETL jobs are frequently showing an OutOfMemoryError error.
Which solutions will resolve this issue without incurring additional costs? (Choose two.)

A. Place the small files into one S3 folder. Define one single table for the small S3 files in AWS Glue Data Catalog. Rerun the AWS Glue ETL jobs against this AWS Glue table.
B. Create an AWS Lambda function to merge small S3 files and invoke them periodically. Run the AWS Glue ETL jobs after successful completion of the Lambda function.
C. Run the S3DistCp utility in Amazon EMR to merge a large number of small S3 files before running the AWS Glue ETL jobs.
D. Use the groupFiles setting in the AWS Glue ETL job to merge small S3 files and rerun AWS Glue ETL jobs.
E. Update the Kinesis Data Firehose S3 buffer size to 128 MB. Update the buffer interval to 900 seconds.

Correct Answer: AC

Explanation:

Reference: https://docs.aws.amazon.com/glue/latest/dg/grouping-input-files.html
https://docs.aws.amazon.com/glue/latest/dg/grouping-input-files.html

AWS Certified Data Analytics – Specialty DAS-C01 – Question039

A company uses Amazon Redshift to store its data. The reporting team runs ad-hoc queries to generate reports from the Amazon Redshift database. The reporting team recently started to experience inconsistencies in report generation. Ad-hoc queries used to generate reports that would typically take minutes to run can take hours to run. A data analytics specialist debugging the issue finds that ad-hoc queries are stuck in the queue behind long-running queries.
How should the data analytics specialist resolve the issue?

A. Create partitions in the tables queried in ad-hoc queries.
B. Configure automatic workload management (WLM) from the Amazon Redshift console.
C. Create Amazon Simple Queue Service (Amazon SQS) queues with different priorities. Assign queries to a queue based on priority.
D. Run the VACUUM command for all tables in the database.

Correct Answer: C

Explanation:

Reference: https://aws.amazon.com/sqs/features/

AWS Certified Data Analytics – Specialty DAS-C01 – Question038

A data architect is building an Amazon S3 data lake for a bank. The goal is to provide a single data repository for customer data needs, such as personalized recommendations. The bank uses Amazon Kinesis Data Firehose to ingest customers' personal information bank accounts, and transactions in near-real time from a transactional relational database. The bank requires all personally identifiable information (PII) that is stored in the AWS Cloud to be masked.
Which solution will meet these requirements?

A. Invoke an AWS Lambda function from Kinesis Data Firehose to mask PII before delivering the data into Amazon S3.
B. Use Amazon Macie, and configure it to discover and mask PII.
C. Enable server-side encryption (SSE) in Amazon S3.
D. Invoke Amazon Comprehend from Kinesis Data Firehose to detect and mask PII before delivering the data into Amazon S3.

Correct Answer: C

Explanation:

Reference: https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingServerSideEncryption.html

AWS Certified Data Analytics – Specialty DAS-C01 – Question037

A marketing company collects data from third-party providers and uses transient Amazon EMR clusters to process this data. The company wants to host an Apache Hive metastore that is persistent, reliable, and can be accessed by EMR clusters and multiple AWS services and accounts simultaneously. The metastore must also be available at all times.
Which solution meets these requirements with the LEAST operational overhead?

A. Use AWS Glue Data Catalog as the metastore
B. Use an external Amazon EC2 instance running MySQL as the metastore
C. Use Amazon RDS for MySQL as the metastore
D. Use Amazon S3 as the metastore

Correct Answer: A

Explanation:

Reference: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html

AWS Certified Data Analytics – Specialty DAS-C01 – Question036

A company is sending historical datasets to Amazon S3 for storage. A data engineer at the company wants to make these datasets available for analysis using Amazon Athena. The engineer also wants to encrypt the Athena query results in an S3 results location by using AWS solutions for encryption. The requirements for encrypting the query results are as follows:
– Use custom keys for encryption of the primary dataset query results.
– Use generic encryption for all other query results.
– Provide an audit trail for the primary dataset queries that shows when the keys were used and by whom.
Which solution meets these requirements?

A. Use server-side encryption with S3 managed encryption keys (SSE-S3) for the primary dataset. Use SSE-S3 for the other datasets.
B. Use server-side encryption with customer-provided encryption keys (SSE-C) for the primary dataset. Use server-side encryption with S3 managed encryption keys (SSE-S3) for the other datasets.
C. Use server-side encryption with AWS KMS managed customer master keys (SSE-KMS CMKs) for the primary dataset. Use server-side encryption with S3 managed encryption keys (SSE-S3) for the other datasets.
D. Use client-side encryption with AWS Key Management Service (AWS KMS) customer managed keys for the primary dataset. Use S3 client-side encryption with client-side keys for the other datasets.

Correct Answer: A

Explanation:

Reference: https://d1.awsstatic.com/product-marketing/S3/Amazon_S3_Security_eBook_2020.pdf

AWS Certified Data Analytics – Specialty DAS-C01 – Question035

A manufacturing company uses Amazon Connect to manage its contact center and Salesforce to manage its customer relationship management (CRM) data. The data engineering team must build a pipeline to ingest data from the contact center and CRM system into a data lake that is built on Amazon S3.
What is the MOST efficient way to collect data in the data lake with the LEAST operational overhead?

A. Use Amazon Kinesis Data Streams to ingest Amazon Connect data and Amazon AppFlow to ingest Salesforce data.
B. Use Amazon Kinesis Data Firehose to ingest Amazon Connect data and Amazon Kinesis Data Streams to ingest Salesforce data.
C. Use Amazon Kinesis Data Firehose to ingest Amazon Connect data and Amazon AppFlow to ingest Salesforce data.
D. Use Amazon AppFlow to ingest Amazon Connect data and Amazon Kinesis Data Firehose to ingest Salesforce data.

Correct Answer: B

Explanation:

Reference: https://aws.amazon.com/kinesis/data-firehose/?kinesis-blogs.sort-by=item.additionalFields.createdDate&kinesis-blogs.sort-order=desc

AWS Certified Data Analytics – Specialty DAS-C01 – Question034

A company uses the Amazon Kinesis SDK to write data to Kinesis Data Streams. Compliance requirements state that the data must be encrypted at rest using a key that can be rotated. The company wants to meet this encryption requirement with minimal coding effort.
How can these requirements be met?

A. Create a customer master key (CMK) in AWS KMS. Assign the CMK an alias. Use the AWS Encryption SDK, providing it with the key alias to encrypt and decrypt the data.
B. Create a customer master key (CMK) in AWS KMS. Assign the CMK an alias. Enable server-side encryption on the Kinesis data stream using the CMK alias as the KMS master key.
C. Create a customer master key (CMK) in AWS KMS. Create an AWS Lambda function to encrypt and decrypt the data. Set the KMS key ID in the function's environment variables.
D. Enable server-side encryption on the Kinesis data stream using the default KMS key for Kinesis Data Streams.

Correct Answer: B

Explanation:

Reference: https://aws.amazon.com/kinesis/data-streams/faqs/

AWS Certified Data Analytics – Specialty DAS-C01 – Question033

A data analytics specialist is creating a solution that uses AWS Glue ETL jobs to process .csv and .json files as they arrive in Amazon S3. The data analytics specialist has created separate AWS Glue ETL jobs for processing each file type. The data analytics specialist also has set up an event notification on the S3 bucket for all new object create events. The event invokes an AWS Lambda function to call the appropriate AWS Glue ETL job to run.
The daily number of files is consistent. The files arrive continuously and take 5-10 minutes to process. The data analytics specialist has set up the appropriate permission for the Lambda function and the AWS Glue ETL job to run, but the solution fails in quality testing with the following error:
ConcurrentRunsExceededException
All the files are valid and are in the expected format for processing.
Which set of actions will resolve the error?

A. Create two separate S3 buckets for each file type. Create two separate Lambda functions for the file types and for calls to the corresponding AWS Glue ETL job.
B. Use job bookmarks and turn on continuous logging in each of the AWS Glue ETL job properties.
C. Ensure that the worker type of the AWS Glue ETL job is G.1X or G.2X and that the number of workers is equivalent to the daily number of files to be processed.
D. Increase the maximum number of concurrent runs in the job properties.

Correct Answer: B

AWS Certified Data Analytics – Specialty DAS-C01 – Question032

A company has 10-15 TB of uncompressed .csv files in Amazon S3. The company is evaluating Amazon Athena as a one-time query engine. The company wants to transform the data to optimize query runtime and storage costs.
Which option for data format and compression meets these requirements?

A. CSV compressed with zip
B. JSON compressed with bzip2
C. Apache Parquet compressed with Snappy
D. Apache Avro compressed with LZO

Correct Answer: B

Explanation:

Reference: https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/

AWS Certified Data Analytics – Specialty DAS-C01 – Question031

A technology company has an application with millions of active users every day. The company queries daily usage data with Amazon Athena to understand how users interact with the application. The data includes the date and time, the location ID, and the services used. The company wants to use Athena to run queries to analyze the data with the lowest latency possible.
Which solution meets these requirements?

A. Store the data in Apache Avro format with the date and time as the partition, with the data sorted by the location ID.
B. Store the data in Apache Parquet format with the date and time as the partition, with the data sorted by the location ID.
C. Store the data in Apache ORC format with the location ID as the partition, with the data sorted by the date and time.
D. Store the data in .csv format with the location ID as the partition, with the data sorted by the date and time.

Correct Answer: B

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.