Tuesday, May 21, 2013

Detective story: sendto() and 10014 WSAEFAULT error


Time to time I read The Old New Thing blog (authored by Raymond Chen, one of Microsoft old-timers). On his brilliant notes, he often describes pathetic customers who observe unexpected Windows API behavior and pretend this is Microsoft's bug. What a naive people, I would exclaim. And every time few paragraphs later Raymond Chen explains that root of such issue is the customer failure: neglection to documentation details, wrong assumptions etc. Ironically, I got pretty strange behavior at WinSock API recently. So what should be my first reaction? Indeed, I started from myself...

Problems, we're at Houston!

It was one of that usual days. Familar tools, known environment, and nothing to threaten a problem. I just completed migration to IPv6 addresses on existing codebase and was eager to verify did I manage everything right way (due all my previous IPv6 experience was theoretical-only).

  • Windows 7:
    • IPv6... success
    • IPv4... success
  • Windows XP:
    • IPv6... success
    • IPv4... BOO! BANG! WOW!
Amazed, I jumped to the logs. Everything is in place, and only strange "-1" as a result of sendto() operation. In turn, errno reports zero. Successfull failure? Instantly I recalled "planned unplanned maintenance" - another funny story from Raymond, but helps not in my case.
I quickly tweaked the sources with WSAGetLastError() instead of errno macro and got "error 10014" aka WSAEFAULT. O'kay folks, something wrong with parameters. But what?

Blame me once, shame on you

The code was as trivial as "Hello Wrold" with minor tweaks, and it was apparently with a bug inside. I always eager to understand what is missed in the code. Especially in trivial code. Especially produced by my hands. Needless to say, it was a challenge accepted!

My first guess: memory layout? Hey, look: the main code with sendto() is in the main module (EXE) and the function returning destination IP address is in another module (DLL). It seems, I mixed up different memory managers, or heaps, or something. Quickly I created a local copy of destination address structure, passed it to sendto()... Success! Looks ma, it was easy! Let me show you the difference in memory blocks:

get_multicast_upnp_addr() returned 0x676084E0 as 239.255.255.250
"a" copy is 0x003E2860
get_sockaddr_len() returned 16
get_upnp_discovery_msg() returned 0x6760850C as M-SEARCH * HTTP/1.1
...
You see? get_multicast_upnp_addr() is DLL-located, it returns quite big address, 0x676084E0. Just compare it with local copy at variable "a", this is as small as 0x003E2860. That huge gap means different memory pools, this is why WinSock complaining with WSAEFAULT. Agreed? Agreed. So this totally expla... Sorry, what did you say? Why message body situated at that big addresses doesn't cause sendto() to fail, you ask? 0x6760850C... Err... Umm... I don't know.

Blame me twice, shame on me

Next day, I moved all the code into one single module. Compiled, launched on WinXP... WSAEFAULT!
I was screwed and smashed. Firewall? Antiviral software? No, no, no. I checked this on real machine, then on virtual images, then asked other people to confirm my tiny test program fails on XP. Everybody got WSAEFAULT result. I started to hear cracking sounds... It was the world crashing around me because I can't properly say "Hello World" to him.
Few days passed. Maybe, few weeks. It's really hard to track the time while world is crashing... Then, I got a miracle hope: this is not my fault, this is compiler! I recall I've heard that compilers may cause side-effects due overwhelming optimization. Since rarely happens, maybe, this is my case? I never trusted that MinGW GCC beast, I had to be more careful choosing the toolchain, what a dumb I was... Quickly I reconfigured project flags to "no optimization", launched... same error. Is MinGW innocent? O'kay, let step into WinSock disassembled code. Stack trace under the hood:

Thread [1] 0 (Suspended : Signal : SIGSEGV:Segmentation fault)
WSHTCPIP!WSHGetSockaddrType() at 0x71a912f4
0x71a52f9f
WSAConnect() at 0x71ab2fd7
main() at tests_main.c:77 0x401584

Oh my... First question was "why it does WSAConnect for UDP connectionless socket" but guys quickly reminded me about indirect binding if doing sendto() in such conditions. Yep, no problem here, legit call. But it fails! Lovely Visual Studio compiler, you're my only hope, help me please:
First-chance exception at 0x71a912f4 in SendtoBugXP.exe: 0xC0000005: Access violation writing location 0x00415744.

Please, please, fruity please... "Autos" debug window answered me that 0x00415744 location is address of my destination address sockaddr_in::sin_zero field.

Collapse

I don't remember much about how I figured this out. I'll just put it here:

const struct sockaddr_in addr_global = {
    AF_INET,
    htons(1900),
    {
            htonl(INADDR_UPNP_V4)
    },
    {0},
};
I love perfect clean code. And here is the price. You see that "const" qualifier at very beginning, right? This is the cause of all that two-weeks madness. One single word crashing the world. Funny?

Post-mortem

So we've got the root of the problem. Let do WinSock inspection to understand why WinXP causes that strange side effect.

  1. sendto() intends constant destination address: "const struct sockaddr *to". So far so good. According to the stack trace mentioned above, next call is WSAConnect.

  2. WSAConnect() intends constant destination address: "const struct sockaddr *name". Still nothing suspicious.

  3. Unknown internall call, just hex address... Double your attention!

  4. WSHGetSockaddrType() expects a non-constant address: "PSOCKADDR Sockaddr". Shame on you, anonymous dirty type-caster!

As you understand, there is no legal way in C/C++ to elevate constant variable to non-constant one. And instead of introducing two versions of WSHGetSockaddrType function (const and non-const), WinSock authors decided to use brute force and conditionlessly casted from one type to another. Then another bad thing that happens is WSHGetSockaddrType() calling RtlZeroMemory() to clear sockaddr_in::sin_zero field. So what would happen if one tries to write zeros to globally-allocated const memory? Access violation, indeed. I found a bug in Windows...

P.S. See StackOverflow topic as a part of community knowledgebase on this matter. As you can see, local const variables are not protected really (implementation-specific? have to check "C Language Standard") so only global const vars cause the bug. And I'm glad to read MSDN that WSHGetSockaddrType() function is obsolete for Windows Server 2003, Windows Vista, and later, and is no longer supported. Too late for WinXP, but thanks anyway.

P.P.S. You can try to fill sin_zero field with non-zero values just to observe how sendto() clears it back as a side-effect (at WinXP).

No comments:

Post a Comment