It has been a while since my last blog. This was mainly due to a problem we've encountered with a customer with their CDK application. At the moment I am still working for an enterprise in the financial sector, where I have joined a data analytics team which is building a data platform in AWS. This data platform is using services like DataSync, S3, Glue, EMR, Athena, LakeFormation and Managed Workflow for Apache Airflow (MWAA).
The complete application is created with CDK in Python. It is leveraging CDK Pipelines to deploy the application to multiple AWS accounts.
With our CDK data platform application ready on the test AWS account, we wanted to deploy to UAT (acceptance) as well. As we are using CDK pipelines that is simply just adding an extra stage to the pipeline where the stage is representing the AWS account. Due to limitations, waiting on the DataSync VM being deployed on premise, we deployed a subset of stacks to the UAT account. These subsets were basically IAM roles, creating S3 buckets and Glue crawlers and jobs.
All was fine until a certain moment. Synthesising was not what it used to be. OK, a bit dramatic here, but it felt like that typical Monday morning when everything went south. Did I forget my coffee? Was it a change in operating system packages, so at least we could blame it on someone/something...?
The only message we got was:
Malformed request, "API" field is required.
As this error didn't make any sense to us, we started debugging. Let me take you on our adventure, what we've done and how we kinda got a work around.
As I am describing our debugging and work-around adventure on our CDK synth issue, no real experience is needed.
So sit back, take a cup of coffee and follow along.
Real World Scenario
As described in the background section, what happened on that typical Monday morning kinda day resulted in that we could not synthesize our CDK application anymore. Normally our flow was: developing locally, CDK synthesizing and then running tests to make sure everything is ok, make pull requests, let someone review it (4-eyes), merge and let the pipeline do its magic. Where developing locally was done in a Windows VDI with Visual Studio Code. This time when running the CDK synth command locally, it resulted with a stack trace error:
(.venv) PS H:Coderepository> cdk synth C:Users******AppDataLocalTemp1tmp5uqjk38xlibprogram.js:9764 throw new Error('Malformed request, "api" field is required'); ^ Error: Malformed request, "api" field is required at KernelHost.processRequest (C:Users******AppDataLocalTemp1tmp5uqjk38xlibprogram.js:9764:27) at KernelHost.run (C:Users******AppDataLocalTemp1tmp5uqjk38xlibprogram.js:9732:22) at Immediate.<anonymous> (C:Users******AppDataLocalTemp1tmp5uqjk38xlibprogram.js:9733:46) at processImmediate (internal/timers.js:464:21)
So, what now? Well of course you ask Google what that malformed request error means. The first answer Google gives you is a Github issue of the CDK project. What a relief, we were not alone with this issue. But reading up on the issue it looked like the investigation that took place wasn't suitable for our problem. Eventually we didn't create over 500 resources. The biggest "stack" was only 115 resources. And all stacks combined were under 200. So, what to do next?
We've updated the GitHub issue with our own part of the story. So feel free to read up on that here. Also we created an AWS Support ticket, as the customer has an enterprise contract with AWS.
In the meantime we did try to work around our problem. One way was to minimize our CloudFormation outputs. As we were trying to start the phase of performance and security testing of a small piece of the application in UAT, we didn't need to deploy the complete application as deployed in DevTest. So basically the application was partly deployed to UAT, just enough resources to not trigger that Malformed request.
But with this partial deployment, we were now stuck. Adding extra resources would mean again triggering that Malformed request error. So what was it that triggered this error in the first place? We looked at the codebase to check if loops or other things could be found as a cause. Was it the extra AWS account which was added? Was it a node version upgrade, or even the CDK version?
All possible root cause scenarios were covered, but none actually brought satisfaction or a pinpointed cause of the problem. It felt like we were running in circles. As we still had the open AWS support ticket and the Github issue pending, it was time to look into alternatives.
So we made a list of four alternative options to work around our unknown triggered CDK synth issue. All with rationale, pros and cons.
Option 1: Manually Deploy in UAT/PRD
As we have our codebase available, we can look into deploying resources manually in UAT and eventually PRD.
The work that needed to be done was rewriting the current app.py file and deploy loose stacks towards the UAT and PRD accounts, instead of letting the CDK pipeline do it for us.
The impediment here is that we do not have access to deploy resources with the developer role in UAT and PRD. We need to check with the Platform team and security if we can get a weaver on this. But as it is impacting a lot on security it doesn't seem like an option.
Option 2: Use the CodeBuild service to deploy manually to UAT/PRD
So basically this is an extension to option 1. The idea is based on how the accounts are set up at the moment. There is a role available in the development account which the CodeBuild service can assume to deploy resources in the application accounts, DevTest, UAT and PRD. At the moment the CDK Pipeline service is orchestrating the correctness of deployments towards these accounts. But it is possible to deploy separate stacks from the CodeBuild service as well. So basically where we do the synth action to generate the CloudFormation templates within CodeBuild, we can also do a "cdk deploy" to deploy stacks.
Rewrite in the CDK pipeline the CodeBuild buildspec file to deploy stacks manually. Option can also be to create multiple CodeBuild projects for deploying only. So a pipeline would look like:
CodeCommit -> CodeBuild (Synthesizing templates) -> CodeBuild DevTest (deploy templates) -> CodeBuild UAT (deploy templates) -> manual approval -> CodeBuild PRD (deploy templates)
It will be manual work to keep CodeBuild buildspec files in sync between the accounts. At the moment the CDK pipeline is taking that burden for us. I don't know if we have the rights in place to use CodeBuild to execute a "cdk deploy" and let the CodeBuild service call the CloudFormation service in UAT and Production. This needs to be tested.
Option 3: Separate CDK pipelines per stacks
So as we are now having one single codebase and one CDK project, the synthesis is stuck with the reached unknown limit. As we want to limit the manual work as much as possible, we can also create multiple CDK pipelines which are responsible for one single stack task instead of multiple stacks. As an example, at the moment our pipeline has stages, where a stage represents an AWS environment (DevTest, UAT, PRD). Within such a stage, stacks per subset of the application are deployed. They are divided into consuming, ingestion, integration, monitoring, orchestration, onboarding and prerequisite stacks. So for each stage all these stacks are being deployed. As the amount of stacks can contribute to the synth issue, a possible solution would be to separate each stack to its own pipeline.
Create multiple CDK pipelines which are responsible for only one stack. So a CDK pipeline for the onboarding stack which will be deployed over all the AWS accounts (DevTest/UAT/PRD), another for the Ingestion part and so on…
Extra resources will be provisioned. This could mean extra costs. A standard CDK pipeline uses KMS keys, S3 buckets, CodePipeline service, CodeBuild service. Also it will leave us with the unsolved CDK synth issue. It could mean that in the future the problem could rise again.
Option 4: Rewrite codebase to Typescript
As we can not tell the exact problem why this CDK Synth issue happens, it might be a good idea to use the CDK native implementation using Typescript instead of Python, so the JSII framework isn't needed anymore. This means an overhaul of our CDK code to a different language. Luckily we do have the working python code in place, so we do not have to investigate if it will work in Typescript as well. It is only rewriting current code to Typescript
Rewriting all code towards the Typescript implementation of CDK
With all the options listed it was up to the team and the product owner to make a decision. Where option 1 and 2 basically were no option at all. It was between 3 and 4. Where 4 was difficult to calculate the amount of time needed for the change. So eventually we went with option 3.
The final solution was to chop up the application. To do so we had to create two more constructs as reusable building blocks:
- secure repository construct
- secure pipeline construct
With those constructs in place it was easy to create a secure CDK project (app) which was following the guidelines set by the security team. As we already had all the code in place, the only thing we needed to change was leveraging Systems Manager Parameters (SSM) more to pass through certain Arn's between CDK applications.
I will write a follow up blog on both constructs, as it had some challenges as well. Especially how to share the constructs in all CDK projects. Small hint, we used Git submodules for that. If you are interested, stay tuned for the next blog.
What I tried to write down was a day, or more a month in the life of a Cloud Consultant. What impediments we've encountered and that it is not always a happy flow. But especially looking beyond the problem and finding that solution which works gives satisfaction.
So keep learning and challenging yourself every day!