MFC CWnd 클래스의 SubclassWindow()를 사용함에 있어 주의해야 하는 제약 사항이 있다.
이미 CWnd 오브젝트에 맵핑되어 있는 hWnd를 파라미터로 하여 CWnd::SubclassWindow()를 호출하게 되면 충돌이 발생한다. 이런 경우는 서브클래싱을 통해 추가적인 기능을 제공하는 범용 클래스를 소스 레벨에서 통합시킬 때 발생할 수 있다.
정확히 어떤 경우인지 예를 들어 살펴보자.
class CExWnd : public CWnd
{
...
};
class CSomeTool : public CObject
{
CExWnd m_exWnd;
void ExtendWnd(CWnd* pWnd);
};
void CSomeTool::ExtendWnd(CWnd* pWnd)
{
m_exWnd.SubclassWindow(pWnd->GetSafeHwnd());
}
CSomeTool::ExtendWnd(CWnd* pWnd) 함수는 파라미터로 받은 pWnd의 윈도 핸들을 CExWnd로 서브클래싱을 시도하고 있다. 이때 pWnd가 permanent 오브젝트인 경우 CWnd::SubclassWindow() 함수에서 충돌이 발생한다. Permanent 오브젝트의 여부는 CWnd::FromHandlePermanent(HWND hWnd)를 통해 검사할 수 있다.
충돌의 원인은 MFC가 CWnd 오브젝트와 윈도 핸들을 맵핑시키는 메카니즘에서 기인한다.
MFC는 내부에 윈도 맵이라는 해쉬 테이블을 통해 윈도 핸들과 CWnd 오브젝트를 연관시켜 관리한다. 해쉬의 특성상 하나의 윈도 핸들에는 오직 하나의 CWnd 오브젝트만이 맵핑될 수 있다.
그런데 CWnd 오브젝트에 이미 맵핑되어 있는 윈도 핸들을 또 다른 CWnd 오브젝트인 CExWnd에 서브클래싱을 시도하게 되면 하나의 윈도 핸들에 두 개의 CWnd 오브젝트가 존재하는 상황이 발생하여 MFC의 내부 윈도 맵에 충돌이 발생하는 것이다.
단순히 이것만이 문제라면 SubclassWindow()를 호출하기 전에 CWnd::FromHandlePermanent()를 통해 얻은 permanent CWnd 오브젝트의 Detach() 함수를 먼저 호출하여 MFC 윈도 맵의 충돌을 피할 수 있다.
그러나 이것보다 더 심각한 문제가 있다.
MFC는 모든 CWnd 클래스에 대해 AfxWndProc()이라는 윈도 프로시저를 사용하고 있다. 즉, 서브클래싱 하고자 하는 permanent 윈도 오브젝트도 AfxWndProc()을 사용하고 있고 서브클래싱된 CExWnd 클래스도 AfxWndProc()을 사용하고 있다. CWnd::SubclassWindow()는 파라미터로 전달된 윈도의 윈도 프로시저를 상위 윈도 프로시저(oldWndProc)로 저장하고 자신의 윈도 프로시저(AfxWndProc)를 설정한다. AfxWndProc()에서는 CallWindowProc()을 호출하여 서브클래싱 되기 전의 상위 윈도 프로시저(oldWndProc)를 호출한다. 그런데 서브클래싱된 CExWnd는 상위 윈도 프로시저(oldWndProc)에 AfxWndProc을 담고 있으므로 AfxWndProc이 무한 반복 호출되는 상황이 발생한다.
이것은 MFC의 내부 구조상 어쩔 수 없는 제약사항이다.
해결책은 두 가지가 있을 수 있다.
첫째는 범용 클래스를 소스 레벨에서 통합하지 말고 별도의 DLL로 빌드하여 통합할 것.
DLL로 빌드하면 CExWnd는 DLL 컨텍스트에서 관리되는 MFC 윈도 맵과 AfxWndProc을 사용하게 된다. 파라미터로 넘겨지는 윈도 핸들이 permanent CWnd 오브젝트의 핸들이라 할지라도 Exe 컨텍스트에서 사용되는 CWnd 오브젝트라면 충돌이 일어나지 않는다.
둘째는 직접 WNDPROC 타입의 콜백 함수(fnExWndProc)를 만들어 SetWindowLongPtr(hWnd, GWLP_WNDPROC, fnExWndProc)로 직접 서브클래싱할 것.
당연하다. MFC를 사용하지 않는 서브클래싱이면 아무 문제도 없다.
2009년 9월 18일 금요일
2009년 9월 16일 수요일
128-bit MMX
I’m quite sure that Intel would not like to see SSE2 named 128-bit MMX. In fact, MMX has a bad reputation: the Intel marketing hype pushed it as an universal solution to multimedia requirements, but at the same time the gaming industry switched from mostly 2D games to Virtual Reality-like 3D games that were not accelerated by MMX. Bad press coverage spread the news that MMX was meaningless as it did not improve the Quake frame-rate. That would be correct if the only applications worth running were 3D games, but the overly simplified vision of the world shared by most hardware sites missed several points: in fact, MMX instructions are constantly used to perform a wide array of tasks. PhotoShop users surely remember the performance boost given by MMX, but it should be made clear that each time you play a MP3, view a JPEG image in your browser or play a MPEG video a lot of MMX instructions are executed. Today all multimedia applications are built on MMX instructions, and they are the key to run computing-intensive tasks such as speech recognition on commonplace PCs.
Writing MMX code is still very hard, as you have to go back to assembler, but the performance benefits are rewarding. The support offered by current compilers is barebone. There are a few attempts to write C++ compilers that can automatically turn normal C code into vector MMX code, but they deal only with limited complexity loop vectorization and place too many constraints on the parallelizable code; in general, they appear notably less mature than vectorizing compilers available in the supercomputing domain.
So we cannot expect to have SSE2 enabled compilers anytime soon. This will not stop large companies that sell shrinkwrap software from exploiting SSE2 instructions as they can afford the required development time, but small-scale software firms are not likely to use SSE2 until the appearance of better development tools. In my opinion, the Pentium 4 scenario closely resembles the Pentium MMX one, where lack of software support made the additional investment for the Pentium MMX over plain old Pentium quite useless.
We have just analyzed the dark side of SSE2, i.e. difficult programming; now we can go on and delve into the technical details.
SSE2 extends MMX by using 128-bit registers instead of 64-bit ones, effectively doubling the level of parallelism. We may be tempted to replace MMX register names with SSE2 ones (e.g. turning MM0 into XMM0), recompile it and see it running at twice the speed. Unfortunately, it would not work, actually it would not even compile. These are the steps required to migrate MMX code to SSE2:
1) replace MMX register names with SSE2 ones, e.g. MM0 becomes XMM0;
2) replace MOVQ instructions with MOVAPD (if the memory address is 16-byte aligned) or MOVUPD (if the memory address is not aligned);
3) replace PSHUFW, which is a SSE extension to MMX, by a combination of the following instructions: PSHUFHW, PSHUFLW, PSHUFD;
4) replace PSLLQ and PSRLQ with PSLLDQ and PSRLDQ respectively;
5) update loop counters and numeric memory offsets, since we work on 128 bits at once instead of 64.
Looks easy, doesn’t it? Actually, it is not that simple. Replacing 64-bit shifts with 128-bit ones is trivial, but SSE2 expects memory references to be 16-byte aligned: while the MOVUPD instruction lets you load unaligned memory blocks at the expense of poor performance (so it should be not used unless strictly necessary), every instruction that uses a memory source operand, e.g. a PADDB MM0,[EAX], is a troublesome spot. Using unaligned memory references raises a General Protection fault, but avoiding GPF requires quite a lot of work. First of all, the memory allocators used in current compiler do not align data blocks on 16-bytes boundaries, so you will have to build a wrapper function around the malloc() function that allocates a slightly larger block than required and correctly aligns the resulting pointer (note: the Processor Pack for Visual C++ features an aligned_malloc() function that supports user-definable alignment of allocated blocks). Then you will have to find out all the lines in your source code where the code blocks that are processed with SSE2 instructions get allocated, and replace the standard allocation call with an invocation to your wrapper function: this is fairly easy if you have access to all the source code of your app, but impossible when third-party libraries allocate misaligned memory blocks; in this case, contact the software vendor and ask for an update.
If your MMX routine spills some variables onto the stack, we are in for more trouble, as we have to force the alignment of the stack, and it requires the modification of the entry and exit code of the routine.
The easiest way to fix a PSHUFW instruction is parting it in two, a PSHUFHW and a PSHUFLW, each operating respectively on the high and low 64-bit halves of the 128-bit register.
Here is the list of SSE2 instructions that extend MMX (adapted from Intel’s documentation):
Writing MMX code is still very hard, as you have to go back to assembler, but the performance benefits are rewarding. The support offered by current compilers is barebone. There are a few attempts to write C++ compilers that can automatically turn normal C code into vector MMX code, but they deal only with limited complexity loop vectorization and place too many constraints on the parallelizable code; in general, they appear notably less mature than vectorizing compilers available in the supercomputing domain.
So we cannot expect to have SSE2 enabled compilers anytime soon. This will not stop large companies that sell shrinkwrap software from exploiting SSE2 instructions as they can afford the required development time, but small-scale software firms are not likely to use SSE2 until the appearance of better development tools. In my opinion, the Pentium 4 scenario closely resembles the Pentium MMX one, where lack of software support made the additional investment for the Pentium MMX over plain old Pentium quite useless.
We have just analyzed the dark side of SSE2, i.e. difficult programming; now we can go on and delve into the technical details.
SSE2 extends MMX by using 128-bit registers instead of 64-bit ones, effectively doubling the level of parallelism. We may be tempted to replace MMX register names with SSE2 ones (e.g. turning MM0 into XMM0), recompile it and see it running at twice the speed. Unfortunately, it would not work, actually it would not even compile. These are the steps required to migrate MMX code to SSE2:
1) replace MMX register names with SSE2 ones, e.g. MM0 becomes XMM0;
2) replace MOVQ instructions with MOVAPD (if the memory address is 16-byte aligned) or MOVUPD (if the memory address is not aligned);
3) replace PSHUFW, which is a SSE extension to MMX, by a combination of the following instructions: PSHUFHW, PSHUFLW, PSHUFD;
4) replace PSLLQ and PSRLQ with PSLLDQ and PSRLDQ respectively;
5) update loop counters and numeric memory offsets, since we work on 128 bits at once instead of 64.
Looks easy, doesn’t it? Actually, it is not that simple. Replacing 64-bit shifts with 128-bit ones is trivial, but SSE2 expects memory references to be 16-byte aligned: while the MOVUPD instruction lets you load unaligned memory blocks at the expense of poor performance (so it should be not used unless strictly necessary), every instruction that uses a memory source operand, e.g. a PADDB MM0,[EAX], is a troublesome spot. Using unaligned memory references raises a General Protection fault, but avoiding GPF requires quite a lot of work. First of all, the memory allocators used in current compiler do not align data blocks on 16-bytes boundaries, so you will have to build a wrapper function around the malloc() function that allocates a slightly larger block than required and correctly aligns the resulting pointer (note: the Processor Pack for Visual C++ features an aligned_malloc() function that supports user-definable alignment of allocated blocks). Then you will have to find out all the lines in your source code where the code blocks that are processed with SSE2 instructions get allocated, and replace the standard allocation call with an invocation to your wrapper function: this is fairly easy if you have access to all the source code of your app, but impossible when third-party libraries allocate misaligned memory blocks; in this case, contact the software vendor and ask for an update.
If your MMX routine spills some variables onto the stack, we are in for more trouble, as we have to force the alignment of the stack, and it requires the modification of the entry and exit code of the routine.
The easiest way to fix a PSHUFW instruction is parting it in two, a PSHUFHW and a PSHUFLW, each operating respectively on the high and low 64-bit halves of the 128-bit register.
Here is the list of SSE2 instructions that extend MMX (adapted from Intel’s documentation):
2009년 9월 14일 월요일
load image from bitmap file
다음 코드에서는 LoadImage API 를 사용하여, DIBSection 같이 비트맵 로드 를, DIBSection 색 테이블에서 색상표를 만듭니다. 색 테이블이 있을 경우 하프톤 색상표가 사용됩니다:
BOOL LoadBitmapFromBMPFile( LPTSTR szFileName, HBITMAP *phBitmap,
HPALETTE *phPalette )
{
BITMAP bm;
*phBitmap = NULL;
*phPalette = NULL;
// Use LoadImage() to get the image loaded into a DIBSection
*phBitmap = (HBITMAP)LoadImage( NULL, szFileName, IMAGE_BITMAP, 0, 0,
LR_CREATEDIBSECTION | LR_DEFAULTSIZE | LR_LOADFROMFILE );
if( *phBitmap == NULL )
return FALSE;
// Get the color depth of the DIBSection
GetObject(*phBitmap, sizeof(BITMAP), &bm );
// If the DIBSection is 256 color or less, it has a color table
if( ( bm.bmBitsPixel * bm.bmPlanes ) <= 8 )
{
HDC hMemDC;
HBITMAP hOldBitmap;
RGBQUAD rgb[256];
LPLOGPALETTE pLogPal;
WORD i;
// Create a memory DC and select the DIBSection into it
hMemDC = CreateCompatibleDC( NULL );
hOldBitmap = (HBITMAP)SelectObject( hMemDC, *phBitmap );
// Get the DIBSection's color table
GetDIBColorTable( hMemDC, 0, 256, rgb );
// Create a palette from the color tabl
pLogPal = (LOGPALETTE *)malloc( sizeof(LOGPALETTE) + (256*sizeof(PALETTEENTRY)) );
pLogPal->palVersion = 0x300;
pLogPal->palNumEntries = 256;
for(i=0;i<256;i++)
{
pLogPal->palPalEntry[i].peRed = rgb[i].rgbRed;
pLogPal->palPalEntry[i].peGreen = rgb[i].rgbGreen;
pLogPal->palPalEntry[i].peBlue = rgb[i].rgbBlue;
pLogPal->palPalEntry[i].peFlags = 0;
}
*phPalette = CreatePalette( pLogPal );
// Clean up
free( pLogPal );
SelectObject( hMemDC, hOldBitmap );
DeleteDC( hMemDC );
}
else // It has no color table, so use a halftone palette
{
HDC hRefDC;
hRefDC = GetDC( NULL );
*phPalette = CreateHalftonePalette( hRefDC );
ReleaseDC( NULL, hRefDC );
}
return TRUE;
}
다음 코드에서는 LoadBitmapFromBMPFile 함수를 사용하여 방법을 보여 줍니다:
case WM_PAINT:
{
PAINTSTRUCT ps;
HBITMAP hBitmap, hOldBitmap;
HPALETTE hPalette, hOldPalette;
HDC hDC, hMemDC;
BITMAP bm;
hDC = BeginPaint( hWnd, &ps );
if( LoadBitmapFromBMPFile( szFileName, &hBitmap, &hPalette ) )
{
GetObject( hBitmap, sizeof(BITMAP), &bm );
hMemDC = CreateCompatibleDC( hDC );
hOldBitmap = (HBITMAP)SelectObject( hMemDC, hBitmap );
hOldPalette = SelectPalette( hDC, hPalette, FALSE );
RealizePalette( hDC );
BitBlt( hDC, 0, 0, bm.bmWidth, bm.bmHeight,
hMemDC, 0, 0, SRCCOPY );
SelectObject( hMemDC, hOldBitmap );
DeleteObject( hBitmap );
SelectPalette( hDC, hOldPalette, FALSE );
DeleteObject( hPalette );
}
EndPaint( hWnd, &ps );
}
break;
BOOL LoadBitmapFromBMPFile( LPTSTR szFileName, HBITMAP *phBitmap,
HPALETTE *phPalette )
{
BITMAP bm;
*phBitmap = NULL;
*phPalette = NULL;
// Use LoadImage() to get the image loaded into a DIBSection
*phBitmap = (HBITMAP)LoadImage( NULL, szFileName, IMAGE_BITMAP, 0, 0,
LR_CREATEDIBSECTION | LR_DEFAULTSIZE | LR_LOADFROMFILE );
if( *phBitmap == NULL )
return FALSE;
// Get the color depth of the DIBSection
GetObject(*phBitmap, sizeof(BITMAP), &bm );
// If the DIBSection is 256 color or less, it has a color table
if( ( bm.bmBitsPixel * bm.bmPlanes ) <= 8 )
{
HDC hMemDC;
HBITMAP hOldBitmap;
RGBQUAD rgb[256];
LPLOGPALETTE pLogPal;
WORD i;
// Create a memory DC and select the DIBSection into it
hMemDC = CreateCompatibleDC( NULL );
hOldBitmap = (HBITMAP)SelectObject( hMemDC, *phBitmap );
// Get the DIBSection's color table
GetDIBColorTable( hMemDC, 0, 256, rgb );
// Create a palette from the color tabl
pLogPal = (LOGPALETTE *)malloc( sizeof(LOGPALETTE) + (256*sizeof(PALETTEENTRY)) );
pLogPal->palVersion = 0x300;
pLogPal->palNumEntries = 256;
for(i=0;i<256;i++)
{
pLogPal->palPalEntry[i].peRed = rgb[i].rgbRed;
pLogPal->palPalEntry[i].peGreen = rgb[i].rgbGreen;
pLogPal->palPalEntry[i].peBlue = rgb[i].rgbBlue;
pLogPal->palPalEntry[i].peFlags = 0;
}
*phPalette = CreatePalette( pLogPal );
// Clean up
free( pLogPal );
SelectObject( hMemDC, hOldBitmap );
DeleteDC( hMemDC );
}
else // It has no color table, so use a halftone palette
{
HDC hRefDC;
hRefDC = GetDC( NULL );
*phPalette = CreateHalftonePalette( hRefDC );
ReleaseDC( NULL, hRefDC );
}
return TRUE;
}
다음 코드에서는 LoadBitmapFromBMPFile 함수를 사용하여 방법을 보여 줍니다:
case WM_PAINT:
{
PAINTSTRUCT ps;
HBITMAP hBitmap, hOldBitmap;
HPALETTE hPalette, hOldPalette;
HDC hDC, hMemDC;
BITMAP bm;
hDC = BeginPaint( hWnd, &ps );
if( LoadBitmapFromBMPFile( szFileName, &hBitmap, &hPalette ) )
{
GetObject( hBitmap, sizeof(BITMAP), &bm );
hMemDC = CreateCompatibleDC( hDC );
hOldBitmap = (HBITMAP)SelectObject( hMemDC, hBitmap );
hOldPalette = SelectPalette( hDC, hPalette, FALSE );
RealizePalette( hDC );
BitBlt( hDC, 0, 0, bm.bmWidth, bm.bmHeight,
hMemDC, 0, 0, SRCCOPY );
SelectObject( hMemDC, hOldBitmap );
DeleteObject( hBitmap );
SelectPalette( hDC, hOldPalette, FALSE );
DeleteObject( hPalette );
}
EndPaint( hWnd, &ps );
}
break;
피드 구독하기:
글 (Atom)