TL;DR: Multihomed Windows Server 2012R2 seems to use both interfaces for traffic to a specific network – routing table looks good, tracert behaves like expected. When all other routes to this network are removed, traffic works like a charm, so the connection itself is fine.
We have a backup server (Veeam, Windows Server 2012R2) that is connected to 2 different subnets. The basic network schematics is below. The server has an IP in VLAN1 (Server VLAN), and an additional IP in the VMWare Management VLAN2. We added both adapters because we wanted to prevent to route traffic through our gateway that has only 1G connection. NIC on VLAN1 has a default gateway set, NIC on VLAN2 has no default gateway set.
Last week we added a new CoreSwitch to the system, and we moved the conncetions to the ESX and the Backup system to this new switch (10G Fiber and Cupper). Since then we have an odd behavior: All traffic from the Backup Server to VLAN2, not only backup, also simple network file copy traffic, is splitted over both networks – I see outgoing traffic on both nics (both about 300mbit) in task manager. Wireshark on BackupProxy shows me that there are packets from both NICs of the backup server on port 445 (file copy test). Switch monitoring shows traffic on both interfaces, and the traffic on the VLAN1 interface will also go to the firewall.
Routing Table looks good. When I do tracerts from the backup server, all looks like it should. Not traffic is routed. But the backup data (both directions) and file copys will be routed over the VLAN1 interface. I also tried setting manual permanent routes with low metric, did not help.
The connection between the two servers in VLAN2 works – if I remove the gateway on BackupProxy, all traffic goes through VLAN2 and is not routed. But then I will lose connection of the server (Windows update, monitoring…), so this is no solution.
Overview of what I tried so far:
- added a default gateway for VLAN2 nic on the backup server – did not help.
- Removed the default gateway for the nic on the BackupProxy – worked, but no solution.
- manually added permanent routes for VLAN2 traffic – did not help.
- rebooted all servers – did not help.
- double and tripple checked the VLAN configuration on the switches, everything looks good.
- disabled IPv6, LMHOSTS lookup and NetBIOS over TCP/IP on all servers (was suggested in some possible solutions) – did not help.
EDIT: Did some more testing, since we also added a new 10G card into the backup server for the VLAN1 Nic (exact same model as the 10G card already used for VLAN2 nic). So before all traffic on VLAN1 was 1G, now it is 10G. And it seems that the problem has to do with this nic or with 10G – everytime I use the new 10G card on a 10G switch port, the problems start. 10G nic on 1G switch port, 1G nic on 10G switch port, 1G nic on 1G switch port – no problems.
EDIT 2: Did some more research, and found this forum entry – the problem sound similar to mine. And after a lot of work with Windows support, a MS TCP/IP engineer confirmes the behavior “is by design, to conform with the RFC’s”. Microsoft suggests to create a firewall rule to block outgoing traffic going to the other subnet through the wrong interface. I tried this, and it works when starting workloads from this server. But when I copy something from another server to this server, it still gets routed, even though it shouldn’t. When createing an incomming rule to block wrongly routed traffic, everything stops working at all. Any idea how to get the routing like I want it to be?