Problem
Statement
Testing Azure Data Lake (ADL)
U-SQL script is little different than testing t-sql or C# code for various
reasons. Just want to share what options
my team tried and what we chose. You can understand these options and decide
which one to pick up based on your needs.
In my case one of the source data
was in the form of XML, so apart from USQL scripts we had to develop custom
extractors, reducers and UDF for further processing. For testing of any
reducers, UDF, and custom extractors, you may not have any issues, it is normal
automation of C#.Net code. But for USQL procedures or scripts, it is little
different.
What makes it different:
·
USQL Job – Every script submission is a Job with
4 steps (Prepare, Queue, execute and Finalize)
·
Very limited DML - USQL does not support Update
or Delete DML, · Initialize and Cleanup for tests
·
Invoking USQL procedure from C# code is not
possible.
·
Output is always a file.
·
Debugging USQL is little tricky.
·
USQL is still maturing language
Pre-requisite: –
LocalRunHelper.exe
(ADL Tools for Visual Studio) is the solution for testing USQL scripts. It
enables to have your code tested before running it on the Azure Data Lake.
Testing USQL on
ADL will be costly and it is obvious we want to test it locally (on developer’s
computer) before trying out anything on ADL.
To test the USQL locally or on the build agent, we have to use
LocalRunHelper executable which comes as part of Azure Data Lake tools (ADL
Tools).
·
Like SSDT for SQL, we have Azure Data Lake and
Stream Analytics VSTS extension from Microsoft, which enables easy authoring,
debugging and tuning of Azure Data Lake Analytics queries, including U-SQL
scripts and Azure Stream Analytics jobs.
o You
can refer this link for initial set up - https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-sdk
o VS2013/2015
- https://www.microsoft.com/en-us/download/details.aspx?id=49504
o VS2017
- Azure Data Lake Tools for Visual Studio is supported in Visual Studio 2017
15.3 or above. You can install this
using Visual Studio installer.
·
We have
started using ADL Tools version 2.2.9000.1 and recently upgraded to 2.3.3000.5
(latest version as of 7/15/2018 is 2.3.4000.3).
You can create
a USQL test project and USQL unit test file
You may need to copy the necessary files (ADL Tools and CppSDK) as part of the solution and
have your test project App.config refers to this path.
<add key="dataRoot" value="%UserProfile%\AppData\Local\USQLDataRoot\" />
<add key="CppSdkDir" value="{AppDomain.BaseDirectory}\..\..\..\..\..\External\ADLTools\ScopeCppSDK" />
<add key="USQLDir" value="{AppDomain.BaseDirectory}\..\..\..\..\..\External\ADLTools\2.3.3000.5\U-SQLSDK" /><add key="postDeploymentScripts" value="{AppDomain.BaseDirectory}\..\..\..\..\ADLA\Scripts\Post-Deployment\" />
<add key="localTestScripts" value="{AppDomain.BaseDirectory}\..\..\..\TestUSqlScripts\" />
<add key="logFilePath" value="%UserProfile%\AppData\Local\USQLDataRoot\LocalRunLog.txt" />
</appSettings>
Local
environment setup: -
Before running the
tests locally or on build agent, you will have to create database, tables,
procedures, etc. If there are any custom
assemblies used, make sure the register them.
Finally copy the test data and expected results file under ‘USQLDataRoot’
folder path.
Note: Make sure
all team members using default data root path.
We used PS
script for initial environment setup and copying the test data and expected
results.
param
(
$cppSDKPath = "..\..\..\External\ADLTools\ScopeCppSDK",
$localRunSDKPath = "..\..\..\External\ADLTools\2.3.3000.2\U-SQLSDK",
$buildConfiguration = "debug"
)
# Resolve relative paths
$cppSDKFullPath = resolve-path $cppSDKPath
$localRunSDKFullPath = resolve-path $localRunSDKPath
$dataRootPath =
Join-Path $ENV:UserProfile "AppData\Local\USQLDataRoot"
$workDirPath =
Join-Path $ENV:UserProfile "AppData\Local\USQLDataRoot"
$localRunHelperExe = Join-Path $localRunSDKFullPath "LocalRunHelper.exe"
$intialLoadDataRootPath =
Join-Path $dataRootPath "Data"
if(-Not (Test-Path $intialLoadDataRootPath))
{
mkdir $intialLoadDataRootPath
}
if(-Not (Test-Path $workDirPath))
{
mkdir $workDirPath
}
# Copy data files for initial load.
Copy-Item "..\..\ADLA\Data\*.csv" $intialLoadDataRootPath -Recurse -Force
# Initialize database and schema
$output = [string] (& $localRunHelperExe run -Script "..\..\ADLA\database\InitializeDB.usql" -WorkDir $workDirPath -DataRoot $dataRootPath -CppSdk $cppSDKFullPath 2>&1)
Write-Host "$output"
#### Deploy tables
$excludelist = @('initializedb.usql')
$varfiles = get-childitem -path "..\..\ADLA\dbo\Tables\*" -filter "*.usql" -recurse -exclude $excludelist
$varfileset = foreach ($file in $varfiles)
{
write-host "file- '$file'"
$output = [string] (& $localrunhelperexe run -Script "$file" -WorkDir $workDirPath -DataRoot $dataRootPath -CppSdk $cppSDKFullPath 2>&1)
write-host "$output"
}
##### Copy and Register Assemblies
##### RegisterAssembly.usql has
the code to register the assembly in the database
$assemblyPath = "..\..\Extractors\bin\" + $buildConfiguration + "\*"
Copy-Item $assemblyPath $dataRootPath -Force
$output = [string] (& $localRunHelperExe run -Script "RegisterAssembly.usql" -WorkDir $workDirPath -DataRoot $dataRootPath -CppSdk $cppSDKFullPath 2>&1)
Write-Host "$output"
### Deploy usql procedures except for the exclude list
$excludelist = @('initializedb.usql')
$varfiles = get-childitem -path "..\..\ADLA\gtr\Procedures\*" -filter "*.usql" -recurse -exclude $excludelist
$varfileset = foreach ($file in $varfiles)
{
write-host "file- '$file'"
$output = [string] (& $localrunhelperexe run -Script "$file" -WorkDir $workDirPath -DataRoot $dataRootPath -CppSdk $cppSDKFullPath 2>&1)
write-host "$output"
}
# Copy Test data
Copy-Item "..\TestData" $dataRootPath -Recurse -Force
Copy-Item "..\ExpectedResult" $dataRootPath -Recurse -Force
# Create TestResults directory
$testResultsPath = Join-Path $dataRootPath "\TestResults"
if(-Not (Test-Path $testResultsPath))
{
mkdir $testResultsPath
}
Options
considered: -
We have considered few options for test approach and data
validations. Want to share those options and you can chose which works for you
or leave some comments for improvements.
Test options
1.
Complete all the (test) data processing thru
power shell and then run tests for just for data validation.
You will run all the data
processing procedures (to be tested) using PS script and then run the tests.
Test will include only USQL test script (extract scripts to create the test
result file), data comparison logic and asserts. in test project.
2.
Run USQL as part of test itself and have asserts
to confirm procedure ran successfully, and then run the data validation scripts
as part of same test.
We went with this approach;
samples tests below provide more details.
In either case, have to make sure
to minimize the number of calls to the procedures before running the tests. Try
to combine the test data for multiple tests and process them once. Otherwise
running tests may take hours to complete.
Data Validation options
1.
Compare Expected and Actual result files using after
hash. Below sample code gives the methods for file asserts.
/// <summary>/// Class with methods for file asserts
/// </summary>
public static class FileAssert
{
public static string GetFileHash(string filename)
{
Assert.IsTrue(File.Exists(filename));
using (var hash = new SHA1Managed())
{
var clearBytes = File.ReadAllBytes(filename);
var hashedBytes = hash.ComputeHash(clearBytes);
return ConvertBytesToHex(hashedBytes);
}
}
static string ConvertBytesToHex(byte[] bytes)
{var sb = new StringBuilder();
for (var i = 0; i < bytes.Length; i++)
{
sb.Append(bytes[i].ToString("x"));
}
return sb.ToString();
}
public static void
AreEqual(string filename1, string filename2)
{string hash1 = GetFileHash(filename1);
string hash2 = GetFileHash(filename2);
Assert.AreEqual(hash1, hash2);
}
public static void
AreEqual(long hash1, long hash2,string message){
Assert.AreEqual(hash1, hash2, message);
}
}
2.
Have USQL to do data compare - we created the
test USQL scripts to use FULL OUTER JOIN with expected and actual result set.
In this case any mismatch in data will be return to output file and we can
check the file size to make sure test script did not output any rows.
Sample Test USQL script: -
USE DATABASE [adladb];// Test script for SampleTable
DECLARE @ExpectedDataFile string = @"/ExpectedResult/SampleTable/TestER.csv";
DECLARE @ActualDataFile string = @"/TestResults/TestResult_FullOuter.csv";
DECLARE @ActualResultFile string = @"/TestResults/TestResult.csv";
@er = EXTRACT
Id int,
Name string,
Fare decimal?,
SliceDate DateTime
FROM @ExpectedDataFile
USING Extractors.Csv();
@ar =
SELECT act.Id,act.Name,
act.Fare,
act.SliceDate
FROM adladb.dbo.SampleTable AS act;
@res =
SELECT *
FROM @ar;
OUTPUT @res
TO @ActualResultFile
USING Outputters.Csv();
//FULL OUTER JOIN
@ar =SELECT act.*, exp.Id
FROM adladb.dbo.SampleTable AS act
FULL OUTER JOIN
@er AS exp
ON act.Id == exp.Id
AND act.Name == exp.Name
WHERE (act.Id IS NULL OR exp.Id IS NULL)
OR ( act.Fare != exp.Fare OR
string.IsNullOrEmpty(act.Name) != string.IsNullOrEmpty(exp.Name)
);
@res =
SELECT *FROM @ar;
OUTPUT @res
TO @ActualDataFile
USING Outputters.Csv();
Note: Above script outputs both test result (read as is) and test result from Full Outer join. You may not need both, just one of them depending on the approach you chose.
3.
Compare Cell by Cell comparison of .csv
(expected and actual files). This will
give you options to find where and which data causing the failure. We used
below code to do the comparison. We have just used for sample tests, so feel
free to use and share your feedback for any improvements.
/// <summary>
/// Compares the csv files data cell by
cell/// </summary>
public class FileCompare
{
public static DiffData CsvFileCompare(string expectedFile, string actualFile)
{
var expectedFileReader = new StreamReader(expectedFile);
var actualFileReader = new StreamReader(actualFile);
var diffData = new DiffData();
if (expectedFileReader.BaseStream.Length == 0 || actualFileReader.BaseStream.Length == 0)
{
var rowDiffData = new RowDiffData();
rowDiffData.RowNumber = 0;
rowDiffData.ColumnNumber = 0;
rowDiffData.ExpectedData = "Either expected file or actual file is empty";
rowDiffData.ActualData = "Either expected file or actual file is empty";
diffData.Differences.Add(rowDiffData);
return diffData;
}
var rowNumber = 0;
while (!expectedFileReader.EndOfStream && !actualFileReader.EndOfStream)
{
var expectedRecord = expectedFileReader.ReadLine();
var expectedSplitString = expectedRecord.Split(new string[] { "," }, StringSplitOptions.None);
var actualRecord = actualFileReader.ReadLine();
var actualSplitString = actualRecord.Split(new string[] { "," }, StringSplitOptions.None);
var expectedRecordColumnCount = expectedSplitString.Length;
var actualRecordColumnCount = actualSplitString.Length;
var minColumnsForComparisonLength = (expectedRecordColumnCount < actualRecordColumnCount) ? expectedRecordColumnCount : actualRecordColumnCount;
for (var i = 0; i < minColumnsForComparisonLength; i++)
{var expectedData = expectedSplitString[i];
var actualData = actualSplitString[i];
if (!expectedData.Equals(actualData))
{var rowDiffData = new RowDiffData();
rowDiffData.RowNumber = rowNumber;
rowDiffData.ColumnNumber = i;
rowDiffData.ExpectedData = expectedData;
rowDiffData.ActualData = actualData;
diffData.Differences.Add(rowDiffData);
}
}
if (!(expectedRecordColumnCount == actualRecordColumnCount))
{
var pendingColumnsSplitString = (expectedRecordColumnCount < actualRecordColumnCount) ? actualSplitString : expectedSplitString;
var isExpectedPending = (expectedRecordColumnCount < actualRecordColumnCount) ? false : true;
for (var i = minColumnsForComparisonLength; i < pendingColumnsSplitString.Length; i++)
{
var pendingColumnData = pendingColumnsSplitString[i];
var rowDiffData = new RowDiffData();
rowDiffData.RowNumber = rowNumber;
rowDiffData.ColumnNumber = i;
if (isExpectedPending)
{
rowDiffData.ExpectedData = pendingColumnData;
rowDiffData.ActualData = "Actual file does not have this column";
}
else
{
rowDiffData.ExpectedData = "Expected file does not have this column";
rowDiffData.ActualData = pendingColumnData;
}
diffData.Differences.Add(rowDiffData);
}
}
rowNumber++;
}
while (!expectedFileReader.EndOfStream)
{var expectedRecord = expectedFileReader.ReadLine();
var expectedSplitString = expectedRecord.Split(new string[] { "," }, StringSplitOptions.None);
for (var i = 0; i < expectedSplitString.Length; i++)
{var expectedData = expectedSplitString[i];
var rowDiffData = new RowDiffData();
rowDiffData.RowNumber = rowNumber;
rowDiffData.ColumnNumber = i;
rowDiffData.ExpectedData = expectedData;
rowDiffData.ActualData = "Actual file does not have this row";
diffData.Differences.Add(rowDiffData);
}
}
while (!actualFileReader.EndOfStream)
{var actualRecord = actualFileReader.ReadLine();
var actualSplitString = actualRecord.Split(new string[] { "," }, StringSplitOptions.None);
for (var i = 0; i < actualSplitString.Length; i++)
{
var actualData = actualSplitString[i];
var rowDiffData = new RowDiffData();
rowDiffData.RowNumber = rowNumber;
rowDiffData.ColumnNumber = i;
rowDiffData.ExpectedData = "Expected file does not have this row";
rowDiffData.ActualData = actualData;
diffData.Differences.Add(rowDiffData);
}
}
return diffData;
}
}
public class DiffData
{public DiffData()
{
Differences = new List<RowDiffData>();
}
public string Summary
{get
{
if (Differences.Count == 0)
{
return "No differences found";
}
return $"{Differences.Count} differences found";
}
}
public List<RowDiffData> Differences
{ get; set; }
}
public class RowDiffData
{public int RowNumber { get; set; }
public int ColumnNumber { get; set; }
public string ExpectedData { get; set; }
public string ActualData { get; set; }
}
Sample Test: -
Below class initialize method truncated the
table(s) and invokes to procedure to be tested and processes the test data.
// Use ClassInitialize to run code before running the first test
in the class[ClassInitialize()]
public static void MyClassInitialize(TestContext testContext)
{
// Truncate the table(s) as needed.
bool isSuccess = TestUtils.TryRunUSQLScript(ConfigurationUtility.localRunHelperPath, ConfigurationUtility.localInvokeScripts + @"TruncateTable.usql", ConfigurationUtility.dataRootPath, ConfigurationUtility.cppSdkDir, ConfigurationUtility.logFilePath);
// Run the script which will invoke the procedure to be tested
isSuccess =
TestUtils.TryRunUSQLScript(ConfigurationUtility.localRunHelperPath,
ConfigurationUtility.localInvokeScripts + @"Invoke_spProcessADLA.usql", ConfigurationUtility.dataRootPath,
ConfigurationUtility.cppSdkDir, ConfigurationUtility.logFilePath);Assert.IsTrue(isSuccess);
}
NOTE:
Scripts which invokes the procedure in ALDA database is part of your
USQL test project. These scripts are used only to call the procedures to be
tested. Below test method has sample
code for using asserts for all three options discussed in above section, you
may not do all of them.
[TestMethod]
public void TestspProcessXMLtoADLA_AirTicket(){
string ExpectedResult = ConfigurationUtility.dataRootPath + @"ExpectedResult\TesttER.csv";
string outputFile1 = ConfigurationUtility.dataRootPath + @"TestResults\TestResult.csv";
//// Delete an existing output file to avoid false results
if (File.Exists(outputFile1)){
File.Delete(outputFile1);
}
////Delete the
actual result file
string outputFile2 =
ConfigurationUtility.dataRootPath + @"TestResults\TestResult_FO.csv";//// Delete an existing output file to avoid false results
if (File.Exists(outputFile2))
{
File.Delete(outputFile2);
}
////Execute the test script which creates the test result file(s)
bool isSuccess = TestUtils.TryRunUSQLScript(ConfigurationUtility.localRunHelperPath, ConfigurationUtility.localTestScripts + @"Test.usql", ConfigurationUtility.dataRootPath, ConfigurationUtility.cppSdkDir, ConfigurationUtility.logFilePath);
Assert.IsTrue(isSuccess);
//Hash Compare files
string ActualResult = outputFile1;FileAssert.AreEqual(ExpectedResult, ActualResult);
FileInfo fi = new FileInfo(ExpectedResult);
long size = fi.Length;
Assert.IsTrue(size > 0); // Just to make sure file is not empty
fi = new FileInfo(outputFile2);
size =
fi.Length;FileAssert.AreEqual(size, 0); // FULL OUTER should result empty result set
//Assert using FileCompare
DiffData
dataDiff = FileCompare.CsvFileCompare(ExpectedResult, ActualResult);Assert.AreEqual(dataDiff.Differences.Count, 0);
}
It is better if you run all the USQL scripts once before
running tests to makes sure no compilation issues. If you want to debug any
script or if you want to run it from command prompt, you can use
LocalRunHelper.exe
From Command prompt check the available options for
LocalRunHelper.
You can also run it from the folder where you copied to
include in your solution. From command prompt you can use below command to run
a USQL script.
LocalRunHelper run
-Script
C:\ADL\Code\USQLTest\TestUsqlScripts\Test_spLoadConfigurationData.usql -DataRoot C:\Users\xxxxx\AppData\Local\USQLDataRoot\
-CppSDK C:\ADL\External\ADLTools\ScopeCppSDK
-MessageOut C:\Users\sobulis\AppData\Local\USQLDataRoot\LocalRunLog.txt
Utilities: -
We have used
few other utility classes to help testing./// <summary>
/// AppSettings values
/// </summary>
public static class ConfigurationUtility
{
public static string dataRootPath = Environment.ExpandEnvironmentVariables(ConfigurationManager.AppSettings["dataRoot"]);
public static string cppSdkDir = Path.GetFullPath(ConfigurationManager.AppSettings["CppSdkDir"]);
public static string uSQLDir = Path.GetFullPath(ConfigurationManager.AppSettings["USQLDir"]);
public static string localRunHelperPath = uSQLDir + @"\LocalRunHelper.exe";
public static string postDeploymentScripts = Path.GetFullPath(ConfigurationManager.AppSettings["postDeploymentScripts"]);
public static string localTestScripts = Path.GetFullPath(ConfigurationManager.AppSettings["localTestScripts"]);
public static string logFilePath = Environment.ExpandEnvironmentVariables(ConfigurationManager.AppSettings["logFilePath"]);
}
Below code
actually executes to USQL script by using the LocalRunHelper.exe
/// <summary>/// Class contains the methods to execute USQL script using LocalRunHelper
/// </summary>
public class TestUtils
{
public static bool TryRunUSQLScript(string localRunHelperPath, string scriptPath, string dataRoot, string cppSDKDir, string logFilePath)
{
string command = "run";
int exitCode = Execute(localRunHelperPath, command, scriptPath, dataRoot, cppSDKDir, logFilePath);
return exitCode == 0;
}
private static int Execute(string localRunHelperPath, string command, string scriptPath, string dataRoot, string cppSDKDir, string logFilePath)
{
Process process = new Process();
process.StartInfo.FileName = localRunHelperPath;
string[] input = new string[9];
input[0] = command;
input[1] = "-Script";
input[2] = scriptPath;
input[3] = "-DataRoot";
input[4] = dataRoot;
input[5] = "-CppSDK";
input[6] = cppSDKDir;
input[7] = "-MessageOut";
input[8] = logFilePath;
process.StartInfo.Arguments = genArg(input);
process.Start();
process.WaitForExit();
return process.ExitCode;
}
private static string genArg(string[] input)
{if (input == null || input.Length == 0)
{
return null;
}
return string.Join(" ", input);
}
}
Issues faced: -
·
Recently some folks in my team faced
problem running the USQL automation tests (using LocalRunHelper) after
upgrading to ADL tools version 2.3.4 Thought it could be helpful if anyone
facing similar problem.
“*** Error : (-1,-1) 'E_CSC_SYSTEM_INTERNAL: Internal
error! The ObjectManager found an invalid number of fixups. This usually
indicates a problem in the Formatter.'
*** Compile failed !”
·
When we submit a job (just ‘select’ from a table
to .csv file) from visual studio, we got empty result set. But when we ran same
script from PowerShell or Test, able to
get the result. It helped us to understand the versions are causing
issues and we thought cleaning up datarootpath might help.
After
trying out few unsuccessful options, decided to clean up the files in
USQLDataRoot including localrunmetadata and _catalog_ folder. Still, when I
submit a job to create a database, there was no error, but it didn’t create the
database. We had some power shell scripts to setup the database and other
objects. Ran the powershell script, which created the database and procedures
using localRunHelper. Then we were able
to run the tests successfully. One more thing to double check, make sure build
platform is set to “x64”.
If you get any issues similar to it, please
share it here https://stackoverflow.com/questions/51181503/usql-unit-testing-with-adl-tools-for-vs-2017-error-after-upgrading-to-2-3-4000
Things to note:
·
U-SQL SDK only support x64 environment, make
sure to set build platform target as x64 and set your test environment as x64.· Make sure to copy all dependency files (correct versions) as part of your solution.
· Do not change default data root path.
· Before running any USQL script from test, make sure it runs successfully. Exclude any audit columns like UpdatedTime, ProcessedOn, etc from data comparison.
Next Steps.
·
Explore the options to reduce the test execution
time.· Improve the File compare utility based on feedback from teams.
References: -
https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-sdkhttps://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-data-lake-tools-local-run
Contributors: -
Reviewers: -