Sunday, July 22, 2018

USQL (Azure Data Lake) Test Automation



Problem Statement

Testing Azure Data Lake (ADL) U-SQL script is little different than testing t-sql or C# code for various reasons.  Just want to share what options my team tried and what we chose. You can understand these options and decide which one to pick up based on your needs.

In my case one of the source data was in the form of XML, so apart from USQL scripts we had to develop custom extractors, reducers and UDF for further processing. For testing of any reducers, UDF, and custom extractors, you may not have any issues, it is normal automation of C#.Net code. But for USQL procedures or scripts, it is little different. 

What makes it different: 
·        USQL Job – Every script submission is a Job with 4 steps (Prepare, Queue, execute and Finalize)
      ·        Very limited DML - USQL does not support Update or Delete DML,
      ·        Initialize and Cleanup​ for tests 
·        Invoking USQL procedure​ from C# code is not possible.
·        Output is always a file.
·        Debugging​ USQL is little tricky.
·        USQL is still maturing language 

Pre-requisite: –

LocalRunHelper.exe (ADL Tools for Visual Studio) is the solution for testing USQL scripts. It enables to have your code tested before running it on the Azure Data Lake.

Testing USQL on ADL will be costly and it is obvious we want to test it locally (on developer’s computer) before trying out anything on ADL.  To test the USQL locally or on the build agent, we have to use LocalRunHelper executable which comes as part of Azure Data Lake tools (ADL Tools).

·        Like SSDT for SQL, we have Azure Data Lake and Stream Analytics VSTS extension from Microsoft, which enables easy authoring, debugging and tuning of Azure Data Lake Analytics queries, including U-SQL scripts and Azure Stream Analytics jobs.



o   VS2017 - Azure Data Lake Tools for Visual Studio is supported in Visual Studio 2017 15.3 or above.  You can install this using Visual Studio installer.

·          We have started using ADL Tools version 2.2.9000.1 and recently upgraded to 2.3.3000.5 (latest version as of 7/15/2018 is 2.3.4000.3). 




You can create a USQL test project and USQL unit test file


USQL Project

















You may need to copy the necessary files (ADL Tools and CppSDK) as part of the solution and 
have your test project App.config refers to this path. 

Sample appsetting in App.config file…
  <appSettings>
    <add key="dataRoot" value="%UserProfile%\AppData\Local\USQLDataRoot\" />





    <add key="CppSdkDir" value="{AppDomain.BaseDirectory}\..\..\..\..\..\External\ADLTools\ScopeCppSDK" />
    <add key="USQLDir" value="{AppDomain.BaseDirectory}\..\..\..\..\..\External\ADLTools\2.3.3000.5\U-SQLSDK" />
    <add key="postDeploymentScripts" value="{AppDomain.BaseDirectory}\..\..\..\..\ADLA\Scripts\Post-Deployment\" />
    <add key="localTestScripts" value="{AppDomain.BaseDirectory}\..\..\..\TestUSqlScripts\" />
   <add key="logFilePath" value="%UserProfile%\AppData\Local\USQLDataRoot\LocalRunLog.txt" />
  </appSettings>


Local environment setup: -





Before running the tests locally or on build agent, you will have to create database, tables, procedures, etc.  If there are any custom assemblies used, make sure the register them.  Finally copy the test data and expected results file under ‘USQLDataRoot’ folder path.
Note: Make sure all team members using default data root path.

We used PS script for initial environment setup and copying the test data and expected results.
       param
       (
         $cppSDKPath = "..\..\..\External\ADLTools\ScopeCppSDK",
         $localRunSDKPath = "..\..\..\External\ADLTools\2.3.3000.2\U-SQLSDK",
         $buildConfiguration = "debug"
       )
       # Resolve relative paths
       $cppSDKFullPath = resolve-path $cppSDKPath
       $localRunSDKFullPath = resolve-path $localRunSDKPath
       $dataRootPath =  Join-Path $ENV:UserProfile "AppData\Local\USQLDataRoot"
       $workDirPath =  Join-Path $ENV:UserProfile "AppData\Local\USQLDataRoot"
       $localRunHelperExe = Join-Path $localRunSDKFullPath "LocalRunHelper.exe"
       $intialLoadDataRootPath =  Join-Path $dataRootPath "Data"

       if(-Not (Test-Path $intialLoadDataRootPath))
       {
             mkdir $intialLoadDataRootPath
       }
       if(-Not (Test-Path $workDirPath))
       {
             mkdir $workDirPath
       }
       # Copy data files for initial load.
       Copy-Item "..\..\ADLA\Data\*.csv" $intialLoadDataRootPath -Recurse -Force
       # Initialize database and schema
       $output = [string] (& $localRunHelperExe run -Script "..\..\ADLA\database\InitializeDB.usql" -WorkDir $workDirPath -DataRoot $dataRootPath -CppSdk $cppSDKFullPath 2>&1)
       Write-Host "$output"
    
       #### Deploy tables
       $excludelist = @('initializedb.usql')
       $varfiles = get-childitem -path "..\..\ADLA\dbo\Tables\*" -filter "*.usql" -recurse -exclude $excludelist
       $varfileset = foreach ($file in $varfiles)
       {
             write-host "file- '$file'"
          $output = [string] (& $localrunhelperexe run -Script "$file" -WorkDir $workDirPath -DataRoot $dataRootPath -CppSdk $cppSDKFullPath 2>&1)
            write-host "$output"
       }
    
       ##### Copy and Register Assemblies
           ##### RegisterAssembly.usql has the code to register the assembly in the database
       $assemblyPath = "..\..\Extractors\bin\" + $buildConfiguration + "\*" 
       Copy-Item $assemblyPath $dataRootPath -Force
       $output = [string] (& $localRunHelperExe run -Script "RegisterAssembly.usql" -WorkDir $workDirPath -DataRoot $dataRootPath -CppSdk $cppSDKFullPath 2>&1)
       Write-Host "$output"
      ### Deploy usql procedures except for the exclude list
       $excludelist = @('initializedb.usql')
    $varfiles = get-childitem -path "..\..\ADLA\gtr\Procedures\*" -filter "*.usql" -recurse -exclude $excludelist
       $varfileset = foreach ($file in $varfiles)
       {
             write-host "file- '$file'"
             $output = [string] (& $localrunhelperexe run -Script "$file" -WorkDir $workDirPath -DataRoot $dataRootPath -CppSdk $cppSDKFullPath 2>&1)
             write-host "$output"
       }
       # Copy Test data
       Copy-Item "..\TestData" $dataRootPath -Recurse -Force
       Copy-Item "..\ExpectedResult" $dataRootPath -Recurse -Force
       # Create TestResults directory   
       $testResultsPath = Join-Path $dataRootPath "\TestResults"
       if(-Not (Test-Path $testResultsPath))
       {
             mkdir $testResultsPath
       }

Options considered: -
We have considered few options for test approach and data validations. Want to share those options and you can chose which works for you or leave some comments for improvements.

Test options

1.      Complete all the (test) data processing thru power shell and then run tests for just for data validation.

You will run all the data processing procedures (to be tested) using PS script and then run the tests. Test will include only USQL test script (extract scripts to create the test result file), data comparison logic and asserts. in test project.

2.      Run USQL as part of test itself and have asserts to confirm procedure ran successfully, and then run the data validation scripts as part of same test.
           We went with this approach; samples tests below provide more details. 

In either case, have to make sure to minimize the number of calls to the procedures before running the tests. Try to combine the test data for multiple tests and process them once. Otherwise running tests may take hours to complete.

Data Validation options

1.      Compare Expected and Actual result files using after hash. Below sample code gives the methods for file asserts.
       /// <summary>
    /// Class with methods for file asserts
    /// </summary>
    public static class FileAssert
    {
        public static string GetFileHash(string filename)
        {
            Assert.IsTrue(File.Exists(filename));
            using (var hash = new SHA1Managed())
            {
                var clearBytes = File.ReadAllBytes(filename);
                var hashedBytes = hash.ComputeHash(clearBytes);
                return ConvertBytesToHex(hashedBytes);
            }
        }

        static string ConvertBytesToHex(byte[] bytes)
        {
            var sb = new StringBuilder();
            for (var i = 0; i < bytes.Length; i++)
            {
                sb.Append(bytes[i].ToString("x"));
            }
            return sb.ToString();
        }

        public static void AreEqual(string filename1, string filename2)
        {
            string hash1 = GetFileHash(filename1);
            string hash2 = GetFileHash(filename2);
            Assert.AreEqual(hash1, hash2);

        }
        public static void AreEqual(long hash1, long hash2,string message)
        {
           Assert.AreEqual(hash1, hash2, message);
        }
    }

2.      Have USQL to do data compare - we created the test USQL scripts to use FULL OUTER JOIN with expected and actual result set. In this case any mismatch in data will be return to output file and we can check the file size to make sure test script did not output any rows.

Sample Test USQL script: -
           USE DATABASE [adladb];
           // Test script for SampleTable
      DECLARE @ExpectedDataFile string =  @"/ExpectedResult/SampleTable/TestER.csv";
      DECLARE @ActualDataFile string =  @"/TestResults/TestResult_FullOuter.csv";
          DECLARE @ActualResultFile string =  @"/TestResults/TestResult.csv";
     @er = EXTRACT
            Id int,
            Name string,
            Fare decimal?,
            SliceDate DateTime
    FROM @ExpectedDataFile
    USING Extractors.Csv();

@ar =
 SELECT      act.Id,
            act.Name,
            act.Fare,
            act.SliceDate
    FROM adladb.dbo.SampleTable AS act;

@res =
    SELECT *
    FROM @ar;
OUTPUT @res
   TO @ActualResultFile
    USING Outputters.Csv();

//FULL OUTER JOIN
@ar =
SELECT  act.*, exp.Id
    FROM adladb.dbo.SampleTable AS act
    FULL OUTER JOIN
             @er AS exp
         ON  act.Id == exp.Id
         AND act.Name == exp.Name
       WHERE (act.Id IS NULL OR exp.Id IS NULL)
       OR  ( act.Fare != exp.Fare  OR
            string.IsNullOrEmpty(act.Name) != string.IsNullOrEmpty(exp.Name)
 );

@res =
    SELECT *
    FROM @ar;
OUTPUT @res
   TO @ActualDataFile
    USING Outputters.Csv();

Note: Above script outputs both test result (read as is) and test result from Full Outer join. You may not need both, just one of them depending on the approach you chose.

3.      Compare Cell by Cell comparison of .csv (expected and actual files).  This will give you options to find where and which data causing the failure. We used below code to do the comparison. We have just used for sample tests, so feel free to use and share your feedback for any improvements.

    /// <summary>
    /// Compares the csv files data cell by cell
    /// </summary>
public class FileCompare
    {
        public static DiffData CsvFileCompare(string expectedFile, string actualFile)
        {
            var expectedFileReader = new StreamReader(expectedFile);
            var actualFileReader = new StreamReader(actualFile);
            var diffData = new DiffData();
            if (expectedFileReader.BaseStream.Length == 0 || actualFileReader.BaseStream.Length == 0)
            {
                var rowDiffData = new RowDiffData();
                rowDiffData.RowNumber = 0;
                rowDiffData.ColumnNumber = 0;
                rowDiffData.ExpectedData = "Either expected file or actual file is empty";
                rowDiffData.ActualData = "Either expected file or actual file is empty";
                diffData.Differences.Add(rowDiffData);
                return diffData;
            }
            var rowNumber = 0;
            while (!expectedFileReader.EndOfStream && !actualFileReader.EndOfStream)
            {
               var expectedRecord = expectedFileReader.ReadLine();
                var expectedSplitString = expectedRecord.Split(new string[] { "," }, StringSplitOptions.None);
                var actualRecord = actualFileReader.ReadLine();
                var actualSplitString = actualRecord.Split(new string[] { "," }, StringSplitOptions.None);
                var expectedRecordColumnCount = expectedSplitString.Length;
                var actualRecordColumnCount = actualSplitString.Length;
                var minColumnsForComparisonLength = (expectedRecordColumnCount < actualRecordColumnCount) ? expectedRecordColumnCount : actualRecordColumnCount;

                for (var i = 0; i < minColumnsForComparisonLength; i++)
                {
                    var expectedData = expectedSplitString[i];
                    var actualData = actualSplitString[i];

                   if (!expectedData.Equals(actualData))
                    {
                       var rowDiffData = new RowDiffData();
                       rowDiffData.RowNumber = rowNumber;
                        rowDiffData.ColumnNumber = i;
                        rowDiffData.ExpectedData = expectedData;
                        rowDiffData.ActualData = actualData;
                        diffData.Differences.Add(rowDiffData);
                    }
                }
                if (!(expectedRecordColumnCount == actualRecordColumnCount))
                {
                    var pendingColumnsSplitString = (expectedRecordColumnCount < actualRecordColumnCount) ? actualSplitString : expectedSplitString;
                    var isExpectedPending = (expectedRecordColumnCount < actualRecordColumnCount) ? false : true;
                    for (var i = minColumnsForComparisonLength; i < pendingColumnsSplitString.Length; i++)
                    {
                       var pendingColumnData = pendingColumnsSplitString[i];
                        var rowDiffData = new RowDiffData();
                        rowDiffData.RowNumber = rowNumber;
                        rowDiffData.ColumnNumber = i;
                        if (isExpectedPending)
                        {
                            rowDiffData.ExpectedData = pendingColumnData;
                            rowDiffData.ActualData = "Actual file does not have this column";
                        }
                        else
                        {
                            rowDiffData.ExpectedData = "Expected file does not have this column";
                            rowDiffData.ActualData = pendingColumnData;
                        }
                        diffData.Differences.Add(rowDiffData);
                    }
                }
                rowNumber++;
            }

            while (!expectedFileReader.EndOfStream)
            {
                var expectedRecord = expectedFileReader.ReadLine();
                var expectedSplitString = expectedRecord.Split(new string[] { "," }, StringSplitOptions.None);

               for (var i = 0; i < expectedSplitString.Length; i++)
                {
                    var expectedData = expectedSplitString[i];
                    var rowDiffData = new RowDiffData();
                    rowDiffData.RowNumber = rowNumber;
                    rowDiffData.ColumnNumber = i;
                    rowDiffData.ExpectedData = expectedData;
                    rowDiffData.ActualData = "Actual file does not have this row";
                    diffData.Differences.Add(rowDiffData);
                }
            }

            while (!actualFileReader.EndOfStream)
            {
                var actualRecord = actualFileReader.ReadLine();
                var actualSplitString = actualRecord.Split(new string[] { "," }, StringSplitOptions.None);

                for (var i = 0; i < actualSplitString.Length; i++)
                {
                    var actualData = actualSplitString[i];
                    var rowDiffData = new RowDiffData();
                    rowDiffData.RowNumber = rowNumber;
                    rowDiffData.ColumnNumber = i;
                    rowDiffData.ExpectedData = "Expected file does not have this row";
                    rowDiffData.ActualData = actualData;
                    diffData.Differences.Add(rowDiffData);
                }
            }
            return diffData;
        }
  
    }

    public class DiffData
    {
        public DiffData()
        {
            Differences = new List<RowDiffData>();
        }

        public string Summary
        {
            get
            {
                if (Differences.Count == 0)
                {
                    return "No differences found";
                }
                return $"{Differences.Count} differences found";
            }
        }

        public List<RowDiffData> Differences { get; set; }
    }

    public class RowDiffData
    {
       public int RowNumber { get; set; }
       public int ColumnNumber { get; set; }
       public string ExpectedData { get; set; }
       public string ActualData { get; set; }
    }

Sample Test: -

Below class initialize method truncated the table(s) and invokes to procedure to be tested and processes the test data.
        // Use ClassInitialize to run code before running the first test in the class
        [ClassInitialize()]
        public static void MyClassInitialize(TestContext testContext)
        {
            // Truncate the table(s) as needed.
            bool isSuccess = TestUtils.TryRunUSQLScript(ConfigurationUtility.localRunHelperPath, ConfigurationUtility.localInvokeScripts + @"TruncateTable.usql", ConfigurationUtility.dataRootPath, ConfigurationUtility.cppSdkDir, ConfigurationUtility.logFilePath);

            // Run the script which will invoke the procedure to be tested
        isSuccess = TestUtils.TryRunUSQLScript(ConfigurationUtility.localRunHelperPath, ConfigurationUtility.localInvokeScripts + @"Invoke_spProcessADLA.usql", ConfigurationUtility.dataRootPath, ConfigurationUtility.cppSdkDir, ConfigurationUtility.logFilePath);
            Assert.IsTrue(isSuccess);
        }

NOTE:  Scripts which invokes the procedure in ALDA database is part of your USQL test project. These scripts are used only to call the procedures to be tested.  Below test method has sample code for using asserts for all three options discussed in above section, you may not do all of them.

       [TestMethod]
        public void TestspProcessXMLtoADLA_AirTicket()
        {
            string ExpectedResult = ConfigurationUtility.dataRootPath + @"ExpectedResult\TesttER.csv";
           string outputFile1 = ConfigurationUtility.dataRootPath + @"TestResults\TestResult.csv";

            //// Delete an existing output file to avoid false results
            if (File.Exists(outputFile1))
            {
                File.Delete(outputFile1);
            }

            ////Delete the actual result file
            string outputFile2 = ConfigurationUtility.dataRootPath + @"TestResults\TestResult_FO.csv";

            //// Delete an existing output file to avoid false results
            if (File.Exists(outputFile2))
            {
                File.Delete(outputFile2);
            }

            ////Execute the test script which creates the test result file(s)
            bool isSuccess = TestUtils.TryRunUSQLScript(ConfigurationUtility.localRunHelperPath, ConfigurationUtility.localTestScripts + @"Test.usql", ConfigurationUtility.dataRootPath, ConfigurationUtility.cppSdkDir, ConfigurationUtility.logFilePath);

           Assert.IsTrue(isSuccess);

            //Hash Compare files
            string ActualResult = outputFile1;
            FileAssert.AreEqual(ExpectedResult, ActualResult);
            FileInfo fi = new FileInfo(ExpectedResult);
            long size = fi.Length;
            Assert.IsTrue(size > 0); // Just to make sure file is not empty

            fi = new FileInfo(outputFile2);
            size = fi.Length;
            FileAssert.AreEqual(size, 0); // FULL OUTER should result empty result set

            //Assert using FileCompare
            DiffData dataDiff = FileCompare.CsvFileCompare(ExpectedResult, ActualResult);
            Assert.AreEqual(dataDiff.Differences.Count, 0);
        }

It is better if you run all the USQL scripts once before running tests to makes sure no compilation issues. If you want to debug any script or if you want to run it from command prompt, you can use LocalRunHelper.exe

From Command prompt check the available options for LocalRunHelper.

Sample parameters: -

You can also run it from the folder where you copied to include in your solution. From command prompt you can use below command to run a USQL script. 

LocalRunHelper run
-Script C:\ADL\Code\USQLTest\TestUsqlScripts\Test_spLoadConfigurationData.usql
-DataRoot C:\Users\xxxxx\AppData\Local\USQLDataRoot\
-CppSDK C:\ADL\External\ADLTools\ScopeCppSDK
-MessageOut C:\Users\sobulis\AppData\Local\USQLDataRoot\LocalRunLog.txt

Utilities: -
We have used few other utility classes to help testing.
    /// <summary>
    /// AppSettings values
    /// </summary>
    public static class ConfigurationUtility 
    {
        public static string dataRootPath = Environment.ExpandEnvironmentVariables(ConfigurationManager.AppSettings["dataRoot"]);
        public static string cppSdkDir = Path.GetFullPath(ConfigurationManager.AppSettings["CppSdkDir"]);
        public static string uSQLDir = Path.GetFullPath(ConfigurationManager.AppSettings["USQLDir"]);
        public static string localRunHelperPath = uSQLDir + @"\LocalRunHelper.exe";
        public static string postDeploymentScripts = Path.GetFullPath(ConfigurationManager.AppSettings["postDeploymentScripts"]);
        public static string localTestScripts = Path.GetFullPath(ConfigurationManager.AppSettings["localTestScripts"]);
        public static string logFilePath = Environment.ExpandEnvironmentVariables(ConfigurationManager.AppSettings["logFilePath"]);
   }

Below code actually executes to USQL script by using the LocalRunHelper.exe
    /// <summary>
    /// Class contains the methods to execute USQL script using LocalRunHelper
    /// </summary>
    public class TestUtils
    {
        public static bool TryRunUSQLScript(string localRunHelperPath, string scriptPath, string dataRoot, string cppSDKDir, string logFilePath)
        {
            string command = "run";
            int exitCode = Execute(localRunHelperPath, command, scriptPath, dataRoot, cppSDKDir, logFilePath);
            return exitCode == 0;
        }

        private static int Execute(string localRunHelperPath, string command, string scriptPath, string dataRoot, string cppSDKDir, string logFilePath)
        {
            Process process = new Process();
            process.StartInfo.FileName = localRunHelperPath;
            string[] input = new string[9];
            input[0] = command;
            input[1] = "-Script";
            input[2] = scriptPath;
            input[3] = "-DataRoot";
            input[4] = dataRoot;
            input[5] = "-CppSDK";
            input[6] = cppSDKDir;
            input[7] = "-MessageOut";
            input[8] = logFilePath;
            process.StartInfo.Arguments = genArg(input);
            process.Start();
            process.WaitForExit();
            return process.ExitCode;
        }

        private static string genArg(string[] input)
        {
            if (input == null || input.Length == 0)
            {
                return null;
            }
            return string.Join(" ", input);
        }
    }


Issues faced: -

·        Recently some folks in my team faced problem running the USQL automation tests (using LocalRunHelper) after upgrading to ADL tools version 2.3.4 Thought it could be helpful if anyone facing similar problem.

“*** Error : (-1,-1) 'E_CSC_SYSTEM_INTERNAL: Internal error! The ObjectManager found an invalid number of fixups. This usually indicates a problem in the Formatter.'
      *** Compile failed !”

·        When we submit a job (just ‘select’ from a table to .csv file) from visual studio, we got empty result set. But when we ran same script from PowerShell or Test, able to  get the result. It helped us to understand the versions are causing issues and we thought cleaning up datarootpath might help.

After trying out few unsuccessful options, decided to clean up the files in USQLDataRoot including localrunmetadata and _catalog_ folder. Still, when I submit a job to create a database, there was no error, but it didn’t create the database. We had some power shell scripts to setup the database and other objects. Ran the powershell script, which created the database and procedures using localRunHelper.  Then we were able to run the tests successfully. One more thing to double check, make sure build platform is set to “x64”.




Things to note: ​
      ·        U-SQL SDK only support x64 environment, make sure to set build platform target as x64 and set your test environment as x64​.

      ·        Make sure to copy all dependency files (correct versions) as part of your solution. ​
       ·        Do not change default data root path.
      ·        Before running any USQL script from test, make sure it runs successfully.  Exclude any audit columns like UpdatedTime, ProcessedOn, etc from data comparison.

Next Steps.
       ·        Explore the options to reduce the test execution time.
       ·        Improve the File compare utility based on feedback from teams.

References: -
https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-sdk
https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-data-lake-tools-local-run

Contributors: -

Reviewers: -