Last Tuesday, a friend called us asking if we could help him check one Lync Environment, because on a Front End server the Lync service wouldn’t start. This particular environment was familiar to us since we were part of the team that deployed it, so we knew it has only two Front End Servers. Even though this topology isn’t recommended by Microsoft, it is supported:
Topologies and components for Front End Servers, instant messaging, and presence in Lync Server 2013
Additional checks reveal that this Front End with issues was automatically updated (on Windows Server and Lync Server) a month ago, but the issue only started after a reboot.
In the Event Viewer > Applications and Services Logs > Lync Server, we found nothing relevant; only warning messages, and the same message over and over:
Log Name: Lync Server
Source: LS User Services
Date: 07/10/2014 11:39:26
Event ID: 32169
Task Category: LS User Services
Description: Server startup is being delayed because fabric pool manager is initializing.
Cause: This is normal when Pool is bootstrapped and indicates that the Front-End is waiting for a quorum of other Front-Ends to be started.
If this event recurs persistently, ensure that 85% of the Front-Ends configured for this Pool are up and running. For 2 or 3 machine Pools, initial cold-start of the Pool requires all machines to be started. If multiple Front-Ends have been recently decommissioned, run Reset-CsPoolRegistrarState -ResetType QuorumLossRecovery to enable the Pool to recover from Quorum Loss and make progress.
The Reset-CsPoolRegistrarState cmdlet had already been tried, so we checked the certificates again and everything was OK: Root CA installed, valid certificate, all FQDN’s in the SAN, CRL.
Just when we were ready to give up and renew the certificates, we decided to have a look at the Event Viewer->System and there we found an error message:
Log Name: System
Date: 07/10/2014 11:59:55
Event ID: 36870
Task Category: None
A fatal error occurred when attempting to access the SSL server credential private key. The error code returned from the cryptographic module is 0x8009030D. The internal error state is 10003.
By checking again the certificates, we noticed that only Local Administrators had permission to access the certificate private Key. This means that the service account couldn’t read/access the Lync Certificate private key. In order to verify Private Key permissions in the certificate, you should open Certificate Console, right click on the certificate, and then select Manage Private Keys:
After adding private key permissions to the Network Service on the Lync certificate, we restarted the services and all Lync Front End services started:
Further checking revealed that the certificate had more permissions (not just the Network Service). Using Lync Deployment Wizard and reassigning the same certificate will give the necessary permissions for Lync Server related objects on that certificate. This is the preferable way to solve this issue, as you can see in the image below:
The other Front End was also updated and we got a healthy Lync pool again.
We couldn’t figure out why the permissions on the certificate were changed, but if we requested a new certificate it would also solve this issue.
Fun Fact: in the Standard Edition Pool, the warning message is the same:
If this event recurs persistently, ensure that 85% of the Front-Ends configured for this Pool are up and running…