capepy.aws package
Submodules
capepy.aws.dynamodb module
- class capepy.aws.dynamodb.CannedReportTable(table_name=None)
Bases:
TableA DynamoDB table with specific structure for managing CAPE reports.
- get_report(report_id)
Retrieve a specific report entry from the table.
- Parameters:
report_id – The id of the report.
- Returns:
The retrieved report item.
- class capepy.aws.dynamodb.CrawlerTable(table_name=None)
Bases:
TableA DynamoDB table with specific structure for organizing Glue Crawlers.
- get_crawler(bucket_name)
Retrieve a specific crawler from the table.
- Parameters:
bucket_name – The name of the bucket.
- Returns:
The retrieved crawler item.
- class capepy.aws.dynamodb.EtlTable(table_name=None)
Bases:
TableA DynamoDB table with specific structure for organizing ETL jobs.
- get_etls(bucket_name, prefix)
Retrieve a specific ETLs from the table.
- Parameters:
bucket_name – The name of the s3 bucket to get ETL jobs for.
prefix – The required prefix to check for ETL jobs.
- Returns:
The ETL Jobs triggered by the given bucket name and prefix.
- class capepy.aws.dynamodb.PipelineTable(table_name=None)
Bases:
TableA DynamoDB table with specific structure for organizing analysis pipelines.
- NAME_VER_GSI = 'PipelineNameVerIndex'
- get_pipeline(pipeline_id)
Retrieve a specific pipeline from the table.
- Parameters:
pipeline_id – The id of the pipeline.
- Returns:
The retrieved pipeline item.
- get_pipelines_by_name(pipeline_name, pipeline_version=None)
Retrieve pipelines from the table matching name and optional version.
- Parameters:
pipeline_name – The name of the pipeline.
pipeline_version – The optional version of the pipeline.
- Returns:
The retrieved pipeline item(s) in a list (which will be empty in the case of no match) or None on error.
- class capepy.aws.dynamodb.Table(table_name)
Bases:
Boto3ObjectAn object for working with specific DynamoDB Table structures in the CAPE system.
- name
the name of the table
- table
the table object retrieved with the boto3 dynamodb resource
- get_item(key)
Retrieve an item from the loaded table.
- Parameters:
key – The key of the entry to retrieve from the table.
- Returns:
The item value from the DyamoDB table.
- query_items(index_name, key_cond_exp)
Retrieve items from the loaded table using a secondary index.
- Parameters:
index_name – The name of the index to query.
key_cond_exp – The key condition expression to use in the query.
- Returns:
The items value from the DyamoDB table. Any valid query results will come back as a list (including an empty list for no match), None is returned if there is an error.
- class capepy.aws.dynamodb.UserTable(table_name=None)
Bases:
TableA DynamoDB table with specific structure for managing CAPE users.
- get_user(user_id)
Retrieve a specific user from the table.
- Parameters:
user_id – The id of the user.
- Returns:
The retrieved user attributes.
- class capepy.aws.dynamodb.WorkflowMetaTable(table_name=None)
Bases:
TableA DynamoDB table with specific structure for airflow workflow metadata.
This table contains a mapping of workflow DAG IDs to Pipeline IDs for pipelines used in the workflow.
- get_workflow_by_id(dag_id)
Retrieve a specific workflow entry from the table.
- Parameters:
dag_id – The id of the workflow.
- Returns:
The retrieved workflow item.
capepy.aws.glue module
- class capepy.aws.glue.EtlJob
Bases:
Boto3ObjectAn object for creating ETL Jobs for use in AWS Glue
- spark_ctx
The PySpark session
- glue_ctx
The AWS Glue context
- logger
The logger for logging to AWS Glue
- parameters
A dictionary of parameters passed into the job
- get_src_file()
Retrieve the source file from S3 and return its contents as a byte string.
- Raises:
Exception – If the source file is unable to be successfully retrieved from S3.
- Returns:
A byte string of the source file contents
- write_sink_file(sink_data, sink_key=None)
Write data to the sink data file inside the sink S3 bucket as configured by the Glue ETL job.
- Parameters:
sink_data (byes or seekable file-like object) – Object data to be written to s3.
sink_key (
Optional[str]) – The prefix and filename for the new sink data file within the sink s3 bucket.
- Raises:
Exception – If the sink data file is unable to be successfully put into s3.
capepy.aws.lambda_ module
- class capepy.aws.lambda_.BucketNotificationRecord(record)
Bases:
RecordAn object for S3 bucket notification related records passed into AWS Lambda handlers.
- bucket
The name of the bucket
- key
The key into the bucket if relevant
- class capepy.aws.lambda_.EtlRecord(record)
Bases:
QueueRecordAn object for ETL related records passed into AWS Lambda handlers.
- job
The name of the ETL Job
- bucket
The name of the bucket
- key
The key into the bucket if relevant
- class capepy.aws.lambda_.PipelineRecord(record)
Bases:
QueueRecordAn object for pipeline records passed into AWS Lambda handlers.
- name
The name of the analysis pipeline
- version
The version of the analysis pipeline
- parameters
A dictionary of parameters to pass to the analysis pipeline
capepy.aws.meta module
- class capepy.aws.meta.Boto3Object
Bases:
objectContains general resources needed by all AWS utilities for interacting with the boto3 library
- logger
The logger for logging to AWS Glue
- clients
A dictionary of instantiated AWS clients indexed by the name of the AWS service.
- resources
A dictionary of instantiated AWS resources indexed by the name of the AWS service.
- get_client(service_name, **kwargs)
Get a client for the provided service, if one hasn’t been created yet, set a new client.
- Parameters:
service_name (
str) – The name of the service to retrieve a client for.**kwargs – Optional keyword arguments passed to boto3.client() if a new client needs to be set.
- Returns:
The boto3 client.
- get_resource(service_name, **kwargs)
Get a resource for the provided service, if one hasn’t been created yet, set a new resource.
- Parameters:
service_name (
str) – The name of the service to retrieve a resource for.**kwargs – Optional keyword arguments passed to boto3.resource() if a new resource needs to be set.
- Returns:
The boto3 client.
- set_client(service_name, **kwargs)
Set a new client for the provided service.
- Parameters:
service_name (
str) – The name of the service to instantiate a new clientfor.
**kwargs – Optional keyword arguments passed to boto3.client().
- set_resource(service_name, **kwargs)
Set a new resource for the provided service.
- Parameters:
service_name (
str) – The name of the service to instantiate a new resource for.**kwargs – Optional keyword arguments passed to boto3.resource().
capepy.aws.utils module
- capepy.aws.utils.bad_param_response(bad_params)
Gets a response data object and status code when bad params are given.
- Returns:
A tuple contains a response data object and an HTTP 400 status code.
- capepy.aws.utils.decode_error(err)
Decode a client error message from AWS
- Parameters:
err (
ClientError) – The ClientError to parse out the error code and message if they areavailable.
- Returns:
A tuple (code, message) where code is a string containing the error code, and message is a string containing the entire error message.
- capepy.aws.utils.json_serialize_the_unserializable(val)
Serialize a value (e.g. Decimal) that is otherwise not json serializable.
Right now this just handles Decimal, but can be updated as needed.
The json library cannot serialize Decimal values, and floating point values coming back from dynamo are Decimal. So this shims them to floats.
- Parameters:
val – The value to serialize.
- Returns:
the serialized value.
- Raises:
TypeError if even this function cannot serialize.