Skip to content

Conversation

@SicongLiu2000
Copy link
Contributor

Add nvarchar (SQL_C_WCHAR) Support to .NET Core C# Language Extension

Summary

This PR adds full support for nvarchar(n) and nchar(n) SQL data types (SQL_C_WCHAR) to the .NET Core C# language extension. Previously, the extension only supported ANSI character types (varchar/char). With this change, Unicode string data can now be passed to and returned from C# external scripts via sp_execute_external_script.

Why This Change Is Needed

The C# language extension previously lacked support for Unicode character types, which limited its ability to handle multilingual data. Many SQL Server applications use nvarchar columns to store Unicode text (e.g., Chinese, Arabic, Cyrillic characters). This change enables full Unicode support for:

  • Input data columns (@input_data_1)
  • Output data columns (result sets)
  • Input/output parameters (@params)

What Changed

Core Data Type Support

File Changes
src/managed/utils/Sql.cs Added DotNetWChar to DataTypeSize dictionary with MinUtf16CharSize (2 bytes)
src/managed/utils/InteropUtils.cs Added UTF16PtrToStr() overloads for converting unmanaged UTF-16 strings to managed strings
src/managed/utils/DataSetUtils.cs Added UTF8ByteSplitToArray() method for proper multi-byte UTF-8 character handling

Input Data Handling

File Changes
src/managed/CSharpInputDataSet.cs Added case for SqlDataType.DotNetWChar that reads UTF-16 encoded data and correctly handles byte-to-character length conversion
src/managed/CSharpDataSet.cs Exposed Columns property to allow metadata propagation to output dataset

Output Data Handling

File Changes
src/managed/CSharpOutputDataSet.cs - Modified ExtractColumns() to accept input column metadata for preserving nvarchar types
- Added DotNetWChar case in ExtractColumn() to emit UTF-16 data
- Added GetUnicodeStringArray() method for building UTF-16 output buffers
- Updated GetStrLenNullMap() to report correct byte lengths for UTF-16 strings
- Fixed pointer pinning to use GCHandle.Alloc() with GCHandleType.Pinned instead of fixed statements
src/managed/CSharpSession.cs Pass input column metadata to ExtractColumns() to preserve data type information

Parameter Handling

File Changes
src/managed/CSharpParamContainer.cs - Added DotNetWChar case in AddParamValue() for reading Unicode input parameters
- Added DotNetWChar case in ReplaceParamValue() for writing Unicode output parameters
- Added ReplaceUnicodeStringParam() method for UTF-16 byte conversion

Documentation

File Changes
README.md Updated supported data types list to include SQL_C_WCHAR and nvarchar(n)

Test Coverage

File Changes
test/include/CSharpExtensionApiTests.h Added GetWStringOutputParam() test helper declaration
test/src/managed/CSharpTestExecutor.cs Added CSharpTestExecutorWStringParam class for Unicode output parameter testing
test/src/native/CSharpInitParamTests.cpp Expanded InitWStringParamTest with comprehensive nchar/nvarchar test cases including Unicode characters (Chinese, Cyrillic)
test/src/native/CSharpGetOutputParamTests.cpp Added GetWStringOutputParamTest and GetWStringOutputParam() helper for testing Unicode output parameters

Bug Fixes Included

UTF-8 Multi-byte Character Handling

Fixed a bug in CSharpInputDataSet.cs where multi-byte UTF-8 characters (e.g., Euro symbol = 3 bytes) were incorrectly split. The previous implementation:

  1. Converted the entire UTF-8 byte buffer to a .NET string
  2. Used Substring() with byte lengths

This failed because multi-byte UTF-8 characters become single characters in .NET strings, causing byte-based offsets to be wrong.

Fix: Added UTF8ByteSplitToArray() in DataSetUtils.cs that processes raw UTF-8 bytes directly, splitting by byte offsets first and then decoding each segment independently.

Pointer Pinning Memory Safety

Fixed potential memory corruption in CSharpOutputDataSet.RetrieveColumns() by replacing fixed statements with proper GCHandle.Alloc(..., GCHandleType.Pinned) to ensure arrays remain pinned for the lifetime of the native call.

Testing

Unit Tests (Native C++)

All 54 native unit tests pass, including new nvarchar-specific tests:

[==========] 54 tests from 1 test suite ran.
[  PASSED  ] 54 tests.

Key tests:

  • InitWStringParamTest - Tests nchar/nvarchar input parameters with various sizes and Unicode characters
  • GetWStringOutputParamTest - Tests nvarchar output parameters with truncation and null handling
  • GetStringResultsTest - Validates UTF-8 string handling (includes multi-byte character fix)

E2E Tests (TestShell)

All 6 E2E tests pass:

✓ EmptyInputDataWithPassThroughScript
✓ EmptyPayLoad
✓ InvalidScript
✓ NullOutputData
✓ EmptyStringPayload
✓ NvarcharPassthrough

The NvarcharPassthrough test specifically validates end-to-end nvarchar column handling.

Data Flow Diagram

SQL Server                    C# Extension
-----------                   ------------
nvarchar column    -->    UTF-16 bytes (SQL_C_WCHAR)
                               |
                               v
                          CSharpInputDataSet.AddColumns()
                               |
                               v
                          Interop.UTF16PtrToStr()
                               |
                               v
                          StringDataFrameColumn (managed string)
                               |
                               v
                          User DataFrame processing
                               |
                               v
                          CSharpOutputDataSet.ExtractColumns()
                               |
                               v
                          GetUnicodeStringArray() --> UTF-16 char[]
                               |
                               v
                          UTF-16 bytes    -->    nvarchar result

Breaking Changes

None. This is a backwards-compatible addition. Existing varchar/char columns continue to work as before.

Dependencies

No new dependencies added.

How to Test Manually

-- Create the external language (if not already created)
CREATE EXTERNAL LANGUAGE Dotnet
FROM (
    CONTENT = N'<path-to>\dotnet-core-CSharp-lang-extension.zip',
    FILE_NAME = 'nativecsharpextension.dll'
);

-- Test nvarchar pass-through
EXEC sp_execute_external_script
    @language = N'Dotnet',
    @script = N'YourLibrary.YourExecutorClass',
    @input_data_1 = N'SELECT N''Hello Unicode: 中文'' AS TextValue',
    @output_data_1_name = N'OutputDataSet'
WITH RESULT SETS ((TextValue NVARCHAR(100)));

Checklist

  • Code compiles without warnings
  • All existing unit tests pass
  • All existing E2E tests pass
  • New unit tests added for nvarchar parameter handling
  • Documentation updated (README.md)
  • No breaking changes to existing functionality

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive support for Unicode string data types (nvarchar and nchar, SQL_C_WCHAR) to the .NET Core C# language extension for SQL Server. Previously, the extension only supported ANSI character types (varchar/char). The implementation enables full Unicode data exchange through input columns, output columns, and input/output parameters.

Key Changes:

  • Added SQL_C_WCHAR data type support with UTF-16 encoding/decoding throughout the data pipeline
  • Implemented UTF-8 byte-level string splitting to fix multi-byte character handling bugs
  • Updated memory pinning strategy from fixed statements to GCHandle.Alloc with GCHandleType.Pinned for proper lifetime management
  • Added comprehensive test coverage for nvarchar parameters with Unicode characters (Chinese, Cyrillic)

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
Sql.cs Added DotNetWChar enum value and MinUtf16CharSize constant to DataTypeSize dictionary
InteropUtils.cs Added UTF16PtrToStr() overloads for converting unmanaged UTF-16 pointers to managed strings
DataSetUtils.cs Added UTF8ByteSplitToArray() method to correctly split UTF-8 byte buffers without character/byte offset issues
CSharpInputDataSet.cs Added DotNetWChar case for reading UTF-16 input data with proper byte-to-character conversion; updated DotNetChar to use byte-level splitting
CSharpOutputDataSet.cs Modified ExtractColumns() to accept input column metadata; added DotNetWChar case with GetUnicodeStringArray(); updated GC pinning from fixed to GCHandle
CSharpParamContainer.cs Added DotNetWChar cases in AddParamValue() and ReplaceParamValue(); added ReplaceUnicodeStringParam() helper method
CSharpSession.cs Updated to pass input column metadata to ExtractColumns() for type preservation
CSharpDataSet.cs Exposed Columns property to enable metadata propagation
README.md Updated supported data types list to include SQL_C_WCHAR and nvarchar(n)
CSharpExtensionApiTests.h Added GetWStringOutputParam() test helper declaration
CSharpTestExecutor.cs Added CSharpTestExecutorWStringParam class for Unicode output parameter testing
CSharpInitParamTests.cpp Expanded InitWStringParamTest with comprehensive nchar/nvarchar test scenarios including Unicode characters
CSharpGetOutputParamTests.cpp Added GetWStringOutputParamTest and GetWStringOutputParam() helper for validating Unicode output parameters

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@monamaki
Copy link
Contributor

Fixed a bug in CSharpInputDataSet.cs where multi-byte UTF-8 characters (e.g., Euro symbol € = 3 bytes) were incorrectly split. The previous implementation: Converted the entire UTF-8 byte buffer to a .NET string

What will happen in case of a very large string? what was happening in the past and then after this code?

@monamaki
Copy link
Contributor

How did you find the following issue?

Fixed a bug in CSharpInputDataSet.cs where multi-byte UTF-8 characters (e.g., Euro symbol € = 3 bytes) were incorrectly split.

Do we have a test to cover it now?

Copy link
Contributor

@monamaki monamaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see inline comments.

@SicongLiu2000
Copy link
Contributor Author

Fixed a bug in CSharpInputDataSet.cs where multi-byte UTF-8 characters (e.g., Euro symbol € = 3 bytes) were incorrectly split. The previous implementation: Converted the entire UTF-8 byte buffer to a .NET string

What will happen in case of a very large string? what was happening in the past and then after this code?

The previous implementation converted the entire UTF-8 byte buffer to a single .NET string, then tried to chunk it by character positions. This caused two issues:

Multi-byte character corruption: The byte offsets in strLenOrNullMap represent UTF-8 byte lengths, but after converting to a .NET string (UTF-16), character positions no longer aligned with byte boundaries. For example, "€" is 3 bytes in UTF-8 but 1 character in .NET, so the chunking logic would split at wrong positions.

Memory inefficiency for large strings: A 1GB UTF-8 buffer would create a ~2GB intermediate string (since .NET strings are UTF-16), causing ~3GB peak memory usage before the string was even chunked.

The fix processes each row's bytes individually using the exact byte length from strLenOrNullMap[i], which correctly handles multi-byte characters and avoids creating a large intermediate string.

Copy link
Contributor

@Aniruddh25 Aniruddh25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting on fixing some copilot suggestions.

Changed void*[] and int*[] arrays to IntPtr[] arrays because pointer
arrays are not blittable and cannot be pinned with GCHandleType.Pinned
on .NET 6. IntPtr is blittable and works correctly across all .NET
versions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants