I bring you: Epic Fails in Azure Architecture!
What is this? It’s a quick list of things which should be considered when architecting an application for Windows Azure (although a lot of this is relevant to other cloud technologies). These points are all results of a real world code audit, where each of these items was neglected.
Why? From real world experience, it feels like various aspects of Azure architecture might be worth emphasising, and also that the tight integration offered by Visual Studio might be shielding some developers and architects from thinking about PaaS/IaaS (aka “cloud”) specific aspects of application architecture.
I really enjoy performing code audits/reviews. They’re a chance to get a good look at the approaches and technology choices of another set of developers, which can give you a chance to challenge your own choices and approaches, all while (hopefully) adding value to the customer.
However with some technology stacks it can often be a challenge to work out what’s “obvious”, and what’s genuinely worth writing about as a gotcha. When writing your report you need to try and put yourself in someone else’s shoes and try to think objectively about how you word your recommendations – there’s a big difference in writing a report where you’re suggesting ways to make minor improvements versus one where you feel like the customer has been ripped off and that obvious things have been overlooked.
When performing a recent code audit on an application that was based heavily on Windows Azure we found that some of the Azure specific (and Cloud in general) aspects of the architecture had been completely overlooked - to the detriment of both the application and the customer. On top of that, the decisions made by the developers were actually making Azure look bad as a platform. To us as reviewers, it felt like some of the things that had been overlooked were obvious, but maybe they weren’t quite so obvious after all?
One of the attractive things about Azure for .NET developers is that it’s so well integrated with Visual Studio. There’s a possibility that this ease of getting started means it’s easy to forget that you’re building for the cloud as things look pretty similar to when you’re building for a standard server deployment paradigm. With Windows Azure, as with any other cloud technology, there are certain differences that need to be taken into account with the architecture, and taking these things into consideration when creating your initial application architecture can save time, money, and frustration.
When your application is hosted onsite you can usually guarantee a pretty reliable connection between pieces of your internal infrastructure. The main variable is usually the connection between your application and external users. However if you’re using a cloud based database, then there are more variables to consider. Small outages in connectivity will occur, and your application must be designed to expect them and to be able to handle them gracefully.
On a long enough timeline, Windows Azure applications will encounter dropped connections, and must be able to recover from them.
What to do: Ensure that transient connectivity has been discussed as part of your architecture. Ensure that developers understand the differences in database connectivity when working in the cloud, and that there are scenarios that need to be considered that they might not experience during day to day development (i.e. in their local development environment / using the Azure emulator). If you’re using Entity Framework for your data access layer, then implement the Transient Fault Handling Application Block. If you’re using NHibernate then you’ll want to look at this NHibernate specific version: NHibernate.SqlAzure.
More reading: Handling Transient Connections in Windows Azure
How to approach caching is a topic that can polarise opinion. Some developers like to throw around the phrase “don’t optimize early”, which can mean anything from “I don’t understand caching” to “Let’s cache at the end”. That’s all good and well, however in a cloud application caching data can be the difference between an application that performs well compared to something that crawls along and fails spectacularly. When your database is on the other side of a network request, and can be subjected to changes in service availability, then you want to ensure that you are doing everything in your power to reduce unnecessary requests to the database (on top of this, SQL Azure actively employs connection throttling, which makes this even more important). Caching is an obvious way you can do this, which means you should ensure that caching in covered in your application’s architecture, even if you put off doing your full optimisation passes until towards the end of the project.
The balance of how much caching to do and when to do it is something that will vary from team to team and application to application, however what’s key is that your architecture takes caching into consideration in some way from day one because it is much more important in a cloud application where anything that reduces the amount of connections or access to the database is going to affect performance greatly.
What to do: Ensure that caching has been considered in the architecture. Ensure that developers have a basic understanding of how connection management differs between a “standard” development paradigm.
More reading: Windows Azure SQL Database Connection Management
Different people have different opinions when it comes to the cloud and security, and they’re usually complete polar opposites. They either believe that everything in the cloud is completely insecure, or that the cloud is a magical land that can automatically address all their security concerns simply by making it all someone else’s problem. Neither is true. You applications and data can be as secure as anything that you’re self-hosting (which is to say both could probably be hacked if anyone truly wanted to, but still), however the cloud is not an automatic fix all for all your issues, and when designing an application you can’t simply use cloud computing as an excuse for not needing to consider security.
Sensitive information needs to be protected no matter where it is. Even Windows Azure. While the data centre might not be breached and may be secure, your application can still be breached. You have to always assume that you’re going to get hacked and ensure you’ve taken steps to protect sensitive data as much as possible.
Windows SQL Azure also adds another factor into this discussion, because with SQL Azure you no longer have the option of applying database wide encryption, meaning that any sensitive information that’s stored in SQL Azure may need an additional application controller layer of encryption. The same applies to BLOB storage – if you’re storing sensitive information there, then you need to take additional steps to protect it (and also to ensure that you’re applying appropriate levels of security at a container level).
Storing your data in the cloud does reduce some of the risks associated with physical security, and some associated with non-physical security (i.e. your system will always be patched, reducing risks of exploitations via Windows/framework vulnerabilities), however with any Internet connected application there is always risk of a breach, and as such you should always take steps to protect sensitive information as much as possible.
What to do: Build security into your application from the ground up. Identify sensitive information and where it will be stored (BLOB storage, SQL, etc.) and ensure it is appropriately protected. Run through basic breach scenarios (use some of the many examples out there of sites being compromised) and see how your application would compare.
Cloud uptime in general is high, and service reliability (and making it someone else’s problem) is often one of the key drivers behind moving to the cloud. However that alone is not an excuse for not considering the resilience of your application, and building in checks and safeguards to protect it from failures.
When architecting your application, consider the services/components you’re using, where they’re used, and run through basic scenarios about what your stakeholders might expect/want to happen when a particular service is unavailable. The most obvious example here is a web application which uses Azure BLOB storage, or the Azure Service Bus – if the web role was running, however the Service bus was unavailable, what would your users see? Would your application handle this gracefully, or would it fail horribly due to an assumption that all services are always available all the time? This may be acceptable for your application, but at the least it should be discussed and documented somewhere.
What to do: Know which services are needed by aspects of your application, and how an outage of a particular service will affect things. Where applicable, architect resilience into your application to ensure that outages are handled gracefully. The key takeaway is that while cloud services are usually highly resilient, outages can still occur, and it’s up to you to handle them.
More reading: Insulating Service Bus Applications Against Service Bus Outages and Disasters, • Details of the February 22nd 2013 Windows Azure Storage Disruption
As is the case with many cloud services, Azure’s SLA is often one of the key drivers behind moving to the cloud. However the catch is that Azure’s 99.95% SLA (for Web/Worker roles, other services have differing percentages) requires that each of your roles is running in 2 instances. 1 instance = no SLA guarantee = you’re missing out on some of the big benefits of running the cloud.
While most applications will run in 2 instances fine, there are a couple of examples worth highlighting that could cause issues and so are worth bearing in mind.
1. Session State – there are some additional steps you need to take to ensure anything you’re storing in session is available to both your instances. Failure to do this will cause you major issues. There’s some misleading information listed on this topic as the recommended approach has changed, however as at time of writing it’s fairly simple to enable this via Windows Azure Caching. Basically, some web.config tags and you should be good to go.
2. Worker roles performing background processing – you may not need the full SLA for your worker roles, in which case you can just run them in a single instance and ignore the SLA. If you do need to run 2 instances of a worker role then you need to ensure that it’s not going to double up on processing. Take a simple example of checking the database for records matching a specific criteria, and sending an email – if there’s more than 2 instances then your role needs to lock all records that it is working on to ensure only one copy of the email is sent.
What to do: A 99.95% SLA doesn’t come for free – make sure your architecture is designed with multiple instances in mind. You probably won’t perform all development using 2 instances (why would you?) but you should test early on to make sure that all aspects of your application work with 2 roles.
More reading: How to Use Windows Azure Caching, 2013 Azure recommended Session State for multiple instances
Billing and Pricing
As the Wu Tang clan wisely said, “C.R.E.A.M.” – cash rules everything around me, and when you’re being charged for every query or byte stored it has the potential to rule everything around you and your stakeholders too.
While I wouldn’t advocate architecting solely around billing, you would be wise to factor it into your architecture (or testing) in some way. There may be occasions throughout the development cycle where you’re able to reduce requests made or the amount of data transferred, and so being conscious of billing is going to make your customers a lot happier with the end solution.
What to do: Keep an eye out for aspects of your application which may generate large bills, maintain a general awareness of the pricing structures and perform a basic sanity check during your design phase.
Windows Azure is an excellent cloud platform, and there’s improvements being made and new features being added on a regular basis. It’s incredibly easy to get started working with Windows Azure due to the excellent tooling and high level of integration, however architects should ensure they take some time to ensure they’re architecting in a truly cloud centric manner.
Failing to fully understand the differences that developing in and for the cloud entails can lead to applications that perform badly, are insecure, and that make the choice of using Windows Azure look like a bad one.
Do you have an item which you think should be on the list? Post it below or contact me on Twitter and I’d be more than happy to update the list.
[SALES PITCH: If you’re working with Azure and want some advice, or if you have an application you feel needs its architecture reviewed, then feel free to get in touch with us at Ignition Development, we’d love to help you out.]
Tags: Microsoft, Windows Azure