Thursday, June 29, 2017

High Availability and Disaster Recovery Options for Dynamics 365 (CRM)

Planning the infrastructure for your Dynamics 365 environment is a critical step in the deployment process, especially when the software is mission critical to business needs and day-to-day operations. In scenarios like these, it is imperative to have high availability and even disaster recovery plans in place. So the question is: What options do you have when it comes to Dynamics 365 on premise? In this article, we will break it down from both an application (Dynamics 365) and database (SQL Server) perspective.

High Availability – Dynamics 365
For most companies, having high availability for their Dynamics 365 deployment is a must and the setup for this is similar to that of other web-based applications. Keep in mind that through this section we are strictly referring to Dynamcis 365 servers with the “Front End” or “Full” role – in other words, the servers that host the website in IIS. Having multiple “Back End”-only servers (those with the Async and Sandbox services) is also a good idea but those will natively load balance themselves if they are in the same deployment.

  • Network Load Balancing – Probably the most obvious method for high availability is running multiple Dynamics 365 servers in a load-balancing configuration. This will not only help with carrying the day-to-day load of traffic but also serve as the foundation for a good high availability option. If a server were to fail, just pop it out of the NLB pool. Load balancing comes in a few different flavors because of all the applications out there but mainly it comes down to two options – hardware and software.
    • Hardware Load Balancing involves the use of a third-party appliance such as F5 or NetScaler that sits in front of the Dynamics 365 servers to handle which server receives the user requests. While this is the preferred option as it does not create any additional processes for the servers to handle and usually provides a wider array of configuration options (e.g. monitoring, security, filtering, etc…), it can be more difficult and much more costly to implement.
    • Software Load Balancing on the other hand is relatively easy to setup and in the case of Windows NLB, is free, as you have already paid for your Windows Server licenses. Critics of software load balancing will contend that it is not “true” high-availability due to the lack of many key features found in hardware load balancing. The fact stands that Windows NLB is just Round Robin DNS so if a server were to go down, requests would still be sent to the faulted server until it is told otherwise.
    • As one can probably infer, this is simply a cost vs. ROI decision – which is the case with most IT situations. If the hardware option is being considered, make sure to reach out to your appliance vendor to ensure there are no known issues with it and Dynamics 365. It is also a good idea to inquire about documentation for setup. For more information about load balancing Dynamics 365 in general and the required configurations, check out this TechNet: https://technet.microsoft.com/en-us/library/hh699803.aspx. 
  • Backup Server(s) – While this method may be a little more rigid, it is still effective and again, has a couple options. The meat and potatoes of this route is having a server to fall back to in the event that your primary Dynamics 365 server(s) fails and NLB has not been implemented.
    • VM Snapshots – If you are reading this, you most likely understand the concept of VM snapshots so we will not dive too heavily into this. The idea is to have automatic snapshots taken automatically on a set interval that the Dynamics 365 server can be reverted to or new server built from in the event of failure. Reverting the existing server is the easier option but also the more risky of the two. Spinning up a new server is safer but more involved – and requires the shutdown of the existing server as you will need to use the same host name and IP address.
    • Laying in wait – We have had some customers opt for this route as the return to uptime is a little quicker and/or VMs were not being used. Dynamics 365 deployments can obviously be comprised of multiple servers – but not all need to be active or enabled. The concept here is to build a secondary Dynamics 365 server as part of the deployment but leave it disabled until it is needed. If the primary server were to fail, all that would be required is enabling of the server via the Deployment Manager and changes to DNS (and firewall rules, if applicable).


High Availability – SQL Server
Now that we have covered high availability from an application perspective, what about the database? Don’t worry! Microsoft has that covered too!

  • SQL Server Failover Cluster – There is nothing new about this concept. Two SQL Servers are setup to use a shared disk, one server goes down and the other picks up the load. When installing Dynamics 365 just make sure to point connection strings to the cluster name and not a single node and you should not have any issues but be sure to test failover prior to going live. Just to note - although you can install Dynamics 365 to a SQL Server cluster configured for either active-active or active-passive clustering, the cluster will function in an active-passive manner. While clustering is tried and true, Microsoft decided to go a step further with...
  • SQL Server Always On – New to SQL Server 2012 (Enterprise Edition), Always On introduced the concept of Availability Groups. An Availability Group is simply a container for a set of databases that fail over together. For purposes of Dynamics 365 databases, this is important considering there are always at least two databases per deployment. With Always On, the databases in the Availability Groups are synchronized amongst all the configured replicas. We won’t get into the details of setting up Always On (there are tons of articles out there already) but be sure to reference this TechNet for how to configure Dynamics 365 with SQL Always On: https://technet.microsoft.com/en-us/library/jj822357.aspx 


Disaster Recovery – Dynamics 365 and SQL Server
While having a single server fail is far more likely than seeing your entire data center swallowed into a pit only to be burnt up in the Earth’s core, it is a situation that should also be planned for! Unlike the sections regarding high availability, we need to think of a DR scenario in terms of the entire infrastructure – both Dynamics 365 and SQL Server – because if there is a disaster, chances are it is taking both so these plans go hand-in-hand. 

Over the years, we have implemented CRM/Dynamics 365 disaster recovery plans for numerous customers and as a result, have learned (what we consider) the best option – which is what we will be covering. This is not to say there are no other viable methods (such as VM snapshot migrations or SQL Log Shipping), but we will keep focus on our preferred option. In a nutshell, we are going to have a separate, self-contained Dynamics 365 environment in the DR data center that has data synced from the production SQL servers in the primary site.

SQL Server
Let us start from a database perspective since it is a little more complex and setting up Dynamics 365 for DR will require this information. As previously touched on, SQL Always On will come in handy yet again and recall the concept of Availability Groups as this will play heavy into this strategy.

For DR scenarios with Dynamics 365, we will need two Availability Groups in SQL AlwaysOn – one for the MSCRM_CONFIG database to be synchronized between the two primary data center nodes and another to synchronize the Organization_MSCRM database between not only the two primary data center SQL Servers but also a third SQL Server in your DR data center. The reason for this will become clear as you keep reading.


Dynamics 365
Going into the Dynamics 365 server installation(s), the third SQL node should already be built and ready with SQL Always On. Unlike the installation of the Dynamics 365 servers in the primary data center, you do NOT want to set the SQL connection to use the Availability Group but rather just the SQL Server located in the DR data center. The reasoning behind this is the MSCRM_CONFIG database – this database is specific per deployment of Dynamics 365 and two cannot exist in the same SQL server/instance (nor can the name be changed). Remember that this Dynamics 365 DR deployment is to be isolated from the primary.

At this point, you may be asking yourself how this all comes together. When the primary data center goes down (mainly both primary SQL nodes), the organization database will be failed over to the third node in DR. Once that happens, launch the deployment manager on the Dynamics 365 DR Server and import the organization database into the deployment. That should take about 5 minutes and then Dynamics 365 will be back up and running.

Great! We have the servers all setup in DR and are ready for that giant pit to open up beneath the data center! Now what? RUN THROUGH A TEST DISASTER RECOVERY SCENARIO! This cannot be stressed enough – you do not want to wait for a real disaster to find some small configuration issue messed up the entire plan.

Monday, June 5, 2017

URL Redirect Based on IP Address

As you are most likely aware, CRM/D365 has two URLs to access it when setup for IFD. The first URL, referred to as the “internal URL”, will allow authentication to be passed through and not display the ADFS login page for credentials. As the name suggests, this URL should only be available from within the network and not exposed out over the internet. The second URL, (not surprisingly) referred to as the “external URL”, will not pass through authentication and will display the ADFS login form for users to enter credentials. This URL is the one exposed out over the internet but is also available from within the network.

Customers often like the convenience of not having to enter credentials using the internal URL but do not like having two URLs to use. As a result, we are tasked with providing a solution to this dilemma. Back in the days of ADFS 2.0, we could easily handle this with a much simpler tweak in IIS of the ADFS server. Unfortunately, with the newer versions of ADFS no longer using IIS as its backbone this is no longer an option and a URL Rewrite rule now must be implemented on the CRM/D365 server. Here’s how:

1. Open IIS on the CRM/D365 Front End (web) Server and go to the URL Rewrite module of the CRM Website.


2. Click the link to “Add Rule(s)” and just select “Blank Rule” under Inbound rules.


3. Give the rule an applicable name and in the Match URL section, leave both the ”Requested URL” and “Using” fields as their defaults (shown in screen capture). In the “Pattern” field type in (.*).


4. This is where it gets fun. In the Conditions section, you will add three rules in the order shown below.

     a. For the first rule, change “crmorgname” to the actual name of your organization.
     b. For the second rule, change the pattern to match that of your internal IP range.
     c. Just enter the third rule as shown – no changes needed – ([^\.]*)\.(.*)


5. In the Action section, change the “Action type” to “Redirect” and for the Redirect URL enter your internal CRM/D365 URL followed by the syntax shown in the screenshot. Leave “Append query string” checked and set the Redirect type to “Permanent (301)”.


6. Apply the URL Rewrite rule and test it by going to the external URL from a machine that has an IP address matching that of the ranges specified on your network. If everything was setup correctly, the external URL will redirect to the internal URL and pass-through authentication will occur.

Wednesday, May 17, 2017

Tracing ADFS Logon Failures - Enabling ADFS Auditing

If you are ever faced with a situation where you are seeing a ton of logon failures in your ADFS logs and you’re not sure where they are coming from, you will soon learn that the basic logs do not provide any insight into their origins. But fear not! Some additional auditing can be enabled to help track down your problem child.

You will likely start with “Event 342 – The user name or password is incorrect” in the ADFS Admin logs – this is about as useless as it gets.


In my scenario, I was seeing about 1,500 of those events per minute. Needless to say, something was wrong but we had no idea what could possibly be trying to authenticate that much. Luckily, ADFS has some built-in auditing that can be of more use in situations like this. Start out by opening the ADFS Management Console and choose the option “Edit Federation Service Properties…” (it’s in the column on the right). Once in the properties screen, click on the “Events” tab. Here you should see 5 checkboxes – 2 of which are unchecked. Check those boxes (Success audits and Failure audits) and click OK.


Now that ADFS is setup for auditing, you need to tell the server to allow it. Head on over to the Local Security Policy of the ADFS server – this can be found from the Server Manager. Once here, drop down “Advanced Audit Policy Configuration” and then “System Audit Policies – Local Group Policy Object”. Locate the “Object Access” policy and then open the “Audit Application Generated” subcategory. Check the three boxes on this screen and then click OK.



You should now be all set to revisit your Event Viewer. Look into the Security events under the Windows Logs and you should now see events with ID 411 for “Classic Audit Failure” with the source as “AD FS Auditing”.


Go ahead and open one of those bad boys up…. Ahhhh finally some useful information! A quick look through will reveal the Client IP, which should be the machine sending the invalid authentication requests to ADFS.


The process from here is self-explanatory but head on over to the machine with the offending client IP address and find out what the problem is. In my case, there was a long-forgotten email router still setup which kept trying to access CRM. After figuring out the problem, it is recommended to disable auditing.

Wednesday, April 26, 2017

To Run The Email Router Configuration Manager, You Must Be A Local Administrator

Just had an unusual error message pop-up when trying to open the Email Router Configuration Manager following the update to Dynamics 365 (on-prem). It was something I had never seen before with previous versions of CRM – 2016 included. The message read:

To run the Email Router Configuration Manager, you must be a local administrator. Make sure that you are a member of the Administrator group before you start this program.


Well… this was an internal Dynamics 365 environment in which my account was a Domain Admin so I was a bit confused but I played along. Went ahead and added my account in as a local admin on the server and rebooted for good measure. That didn’t help. No matter how much I restarted the router service, rebooted, tried adding permissions, etc… it all fell on deaf ears. Googling the error message resulted in no meaningful results – which usually means 1 of 2 things – my router is completely hosed or I just uncovered a nice new bug with Dynamics 365!

After thinking about it for a few more minutes, I figured why not try running it AS AN ADMINISTRATOR!?!?! Sure enough, 3 clicks later and I was in… Easy workaround and I hope it saves some head-scratching!



Tuesday, March 28, 2017

Yet Another ADFS Looping Issue

I recently applied the Dynamics 365 Update to a CRM 2016 Service Pack 1 environment that was setup for IFD and ran into some unexpected behavior. This environment had about 15 organization setup in it. When trying to log on using the external URLs (e.g. orgname.domain.com) access worked without issue but using the internal URL for any organization (e.g. crm.domain.com/orgname) CRM and ADFS would run in a continuous loop. Sometimes it would error out, sometimes it would just keep looping forever.

Started off by checking event viewers on both the CRM and ADFS server. I noticed that on the times it did error out after a few loops instead of looping nonstop, it resulted in an event log similar to one I saw before caused by a bug with the 0.1 Update for CRM 2016 (http://blog.gagepennisi.com/2016/04/crm-2016-update-01-bug-with-adfs-for.html):

MSIS7042: The same client browser session has made '6' requests in the last '17' seconds. Contact your administrator for details.

So immediately I thought “Great, here we go again.” There was also nothing in the CRM event viewer.

Hours of trying different different troubleshooting tactics for similar situations (including this one with the same looping behavior but with the external URL - http://blog.gagepennisi.com/2016/01/adfs-logon-page-loop-issue.html) yielded nothing. Rebuilt the Relying Party Trusts, reconfigured IFD, created a self-signed certificate to verify it wasn’t a certificate issue, etc…

Instead of keeping on with the guess and check, I realized I needed to get more information around the problem so I started with a Fiddler trace – great web debugging tool, if you haven’t heard of it (http://www.telerik.com/fiddler). Unfortunately, in this case all it did was confirm the behavior I was seeing in the browser which was a constant redirect between D365 and ADFS.


After that, I decided to run a platform trace on the CRM server to see if CRM would give me any insight but nothing really stood out to me. Almost at my wit’s end, I decided to rope in my colleague Dan Francis to get another set of eyes on it. After some review of the platform trace he noticed one little line that seemed a bit odd:

>Multi-org sharable cache loading system and non-system metadata with build number 7.0.0.3543 and language 1033

CRM 2016 is version 8.0 and D365 is 8.2 so the fact that we were seeing any refernce to 7.0.X.XXXX (CRM 2015) was not inline with what was to be expected. We checked the Deployment Manger and realized there was an old, disabled organization from CRM 2015 that was not upgraded (because it was disabled at the time of the original CRM 2016 upgrade). After some deliberation, we decided to remove that organization from the deployment and boom! The internal URL began working as expected!

Basically what we were able to surmise from this is that when the internal URL is being used, all organizations – whether enabled or disabled – are checked for versioning and the lowest build number is used to load this “multi-org sharable cache” and apparently D365 doesn’t play nicely with older versions hanging around. Moral of the story – update/upgrade all of your orgs or just remove them.

Big thanks to Dan the Man!

Friday, January 27, 2017

CRM and TLS 1.0

While performing a routine installation of CRM 2016, I stumbled upon a new error during the system checks at the end of CRM installation wizard. The error was for the SQL Server check and read:

Could not connect to the following SQL Server: 'XXXXXX-SQL02'. Verify that the server is up and running and that you have SQL Server administrative credentials. [DBNETLIB][ConnectionOpen (SECDoClientHandshake()).]SSL Security error.


Having never seen this particular error, I resorted to my friend Google who led me down a wormhole or possible issues and resolutions related to certificates and protocols. Turns out that this was indeed related to a security protocol – TLS 1.0 to be exact. It was disabled on the SQL Server as part of our new security template for new servers. So with the help of one of Cloud Application Engineers, Spencer Ashworth, we enabled TLS 1.0. This is done through a simple registry change.

Browse to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\SecurityProviders\SCHANNEL\Protocols\TLS 1.0\Server\ and locate the DWORD named Enabled. Set the value to 1, close the registry and then reboot the server.


Following the reboot and re-run of the CRM installation wizard, the system checks all passed and CRM was able to be successfully installed. Big thanks to Spencer who was the one that recalled CRM’s requirement of TLS 1.0 and knew the resolution!

Thursday, January 12, 2017

CRM Diagnostics Page


When troubleshooting CRM performance issues everyone typically wants to point fingers at the servers – whether it be SQL, CRM, or even SSRS – but a commonly overlooked factor is the network. Not known to everyone is a tool built into CRM called the CRM Diagnostics Page. It is basically a very simple way to check network performance. To access the CRM Diagnostics Page simply go to:

http(s)://<YourCRMServerURL>/tools/diagnostics/diag.aspx

This will land you at a page that looks like the following:



On this page, you can clearly see the series of tests available but what are they?

1. Latency Test
This calculates the average time taken to download a small text file 20 times. The file downloaded from the CRM server is /_static/Tools/Diagnostics/smallfile.txt. CRM is designed to work best with latency under 150 milliseconds. Latency can be a huge factor in performance for offices far away the CRM server’s location.

2. Bandwidth Test
The bandwidth test downloads image files and records the speeds. The recorded speeds are averaged together and provided in the results summary. Bandwidth should ideally be higher than 50 KB/sec. This test, along with the Latency Test, are the most helpful on this page.

3. Browser Info
This is a JavaScript pull of the local browser details such as browser name, version, cookie status, operating system and the user-agent string.

4. IP Address
This is the IP address of the client computer as known to the server. The IP address is server-side dynamic and represents the IP address which was used to contact the server.

5. JavaScript tests
The series of JavaScript tests times loops and returns their execution time. They are essentially just memory/CPU stress tests on the client machine.

6. Organization Info
Provides general server information- organization name, time on server and client and URL.

Bonus! 
This is also a neat little secret: http(s)://<YourCRMServerURL>/home/home_debug.aspx.
It has some of the same information as the diagnostics page - Not overly helpful but can possibly come in handy troubleshooting. I find it just to be a nice, concise page for deployment information to help with documentation.