File server stops responding

A client was facing an issue with a file server that had impact on almost all of his employees and kept members of- not only – the IT department busy for almost two months.

For weeks we were in the dark having no idea which could cause the problem and none of our actions taken were successful.

The issue

Starting in November 2016 the main file server (Windows Server 2008R2), hosting the user homedrives and program data, suddenly stopped responding. It was impossible to save documents – Microsoft Word stopped responding, users were unable to log on etc.

After restarting the server everything worked again.

System and application logs on the server did not show anything suspicious.

The second time this happened was three days later. Again, nothing special was logged, neither on the server itself nor on VMWare host, network components or SAN storage. All other systems worked well.

From this time the issue occurred at least once a day. As we did not notice any indications like heavy RAM or processor usage or the amount of users being logged in, we had no chance to replicate the problem.

The environment

As mentioned before the file server was a virtual machine on a VMware ESX host. Data is stored on a SAN, most of the users are working on Thin Clients on a Citrix desktop.

Troubleshooting

Troubleshooting steps involved all of the system components and therefore many different teams.

More than one time we copied the data to new machines with newer operation system, but the error reoccurred. At least on Windows Server 2012R2 there is an event log SMB server, where we noticed a warning event 1020, indicating that communication with the underlying storage took longer than expected (“File system operation has taken longer than expected”). In the meanwhile we had opened a ticket at the Microsoft Premier Support, who advised us in several tuning steps which unfortunately did not lead to a solution.

We had all components checked: network devices and environment, storage, ESX hosts etc. – everything worked normally. Also at the same time no other components showed any problems.

The solution

Finally we separated the data using a standalone DFS on different fileservers to isolate the possible source of the error. It came out that there was a folder that hosted Word templates with VB script and every time a user of the development team with write access to the files opened a document based on a template, the event 1020 was logged.

Two steps helped solving the problem: optimization of the VB code and taking away all write access off the template folders for normal user accounts. Developers now have to use dedicated administrative accounts if they have to replace a template with a newer version.

Conclusion

If you notice the described issue on your file server, especially the 1020 event in the SMB log, proof if there are Word templates stored on that server and be sure that every user only has read access to that folder.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s