Working closely with GCP’s Deployment Manager recently, it was really hard not to notice that Google sometimes… makes bugs. Seriously. Not that many, I definitely introduced more, but it’s still enough to stumble across them now and then. I think within a month I found like 4 of the most obvious ones, but so did the other members of my team, so bugs in GCP is not something uncommon. So, let’s have a look at few?
Deployment Manager and Cloud Functions
‘v1beta2’ or just ‘v1’?
A friend of mine once said that the difference between a bug and a feature is that feature is documented, while the bug is not. By this definition, the following finding is definitely a bug.
If you try to find out what is the type of resource you have to use in order to deploy a cloud functions via Deployment Manager,
cloudfunctions.v1beta2.function will probably be the answer.
gcloud deployment-manager types list, as well as the official documentation will confirm that it is indeed correct.
Well, not quite. There’s another documentation page saying that
v1beta2 is deprecated and you should use
The trick is that DM has no idea about
v1 resource type. I looked for it everywhere, it’s simply not there. Trying to feed DM with
v1 will cause some sort of unknown resource type exception, so I’m definitely not just seeing things.
While not a big deal, it’s still confusing and the only solution for today is to stick to
v1beta2 until it’s completely removed in December. In the meanwhile, it worth checking regularly for newer resource type versions and trying to ignore the warnings along the way.
The mystery of Nodejs 8 build
The other cloud function deployment almost made my brain to explode.
ERROR: (gcloud.deployment-manager.deployments.create) Error in Operation
"ResourceErrorMessage":"Build failed: Build error details not available"
Hey, what’s going on? If you do fail, why not regularly? Why the build error is not available? It worked just fine before!
The reason was in the following. Cloud Functions can use different runtimes, and my initial configuration used the default one, which is
nodejs6. For testing purposes I configured the function to use
nodejs8 instead. After all, it’s hello-world application. It’s not using any
awaits or anything fancy, it should be able to run nodejs v0.1-alpha1 or whatever the most ancient version was there.
nodejs8 runtime is in beta and that’s enough to fail a Cloud Build step completely. Sometimes. I still have no idea why, but as soon as I reverted it back to
nodejs6, my unit test started to work again.
Deployment Manager and Dataproc.
Don’t delete that service account
Google’s Dataproc is basically a Hadoop cluster deployed on Google Compute Engine’s virtual machines, so theoretically nothing can go wrong here. Well…
Here’s a thing. Dataproc needs two Service Accounts to run. Usually those accounts exist by default and they also should have enough permissions to do Dataproc thing. In my case one of the service accounts had limited permissions and that was on purpose. So I had to create another one, give it
Dataproc.Worker role, accidentally break the configuration, delete the account and recreate it one more time under the same name.
Unfortunately, after those manipulations my Dataproc cluster still couldn’t be created because of the following error:
“ERROR: (gcloud.dataproc.clusters.create) INVALID_ARGUMENT:
……. is missing required permissions: [
]. Service Accounts must have either 'Dataproc/Dataproc Worker' role or all
permissions granted by the role.
See https://cloud.google.com/dataproc/docs/concepts/iam for further details.”
Why? I just gave you that role! OK, maybe it takes some time to propagate the changes across the cloud. I gave it few minutes, checked existing project-level policies via command line and Cloud Console, but everything was correct. The account was there, it had the role, but Dataproc didn’t see it that way.
After spending a little bit of time with Google, the search engine, I found an interesting bug description, which manifests itself in Dataproc when you tried to delete a service account and then create a new one with the same name. Just like I did. And the symptoms were the same. As soon as I created one more service account with slightly different name, the error vanished.
Your deployment’s better to succeed
Finally, I managed to mess up with Dataproc two more times in one go. Firstly, I created a huge Deployment Manager template that brought up a cluster, its service account and a role, networks and firewall rules in one step. Unfortunately, I missed one or two firewall rules that supposed to be there, so cluster wouldn’t work. However, it wasn’t the problem. The problem was that Deployment Manager went into eternal loop of waiting until cluster deployment succeeds, which would never happen, as underlying VMs couldn’t talk to each other. I waited for some time until it times out, but it didn’t. At that point deleting the pending deployment also wouldn’t work. Finally, after 30 minutes or so something did time out internally and deployment failed.
However, another problem came up. I still couldn’t delete the deployment. It was detecting some sort of race condition (seriously???) while deleting the service account and refused to continue. There’s a chance that this time it was me. But strange thing is that after fixing firewall rules my other deployment could be created and deleted successfully. Only firewall did change, not the service account setup. And a week later, that deployment is still there.
So this is me and Google. There’s a chance that some things that I saw are documented somewhere, so those are the features, not the bugs. Or maybe I’m doing something wrong. But it’s so enjoyable to imagine that it’s not only me who produces bugs, that I’ll refuse to admit that that might be the case.