Microsoft has taken another step towards making analysis of big data easier with the introduction of their U-SQL language, the new query language designed to run on the Azure Data Lake Store. Announced in September of this year, the Data Lake Store is a huge storage space for analyzing large quantities of unstructured data. The preview version of the Data Lake was released at the start of December, offering users a new, faster and simpler method of analyzing data of any kind.
Microsoft are clearly aiming to champion the areas of big data and analytics, as back in July CEO Satya Nadella announced Cortana Analytics– a means for managing both on-premises and cloud data. The suite is integrated with the Windows 10 virtual assistant of the same name (which was based off the AI character of the same name from the Halo franchise). Offered as a monthly subscription, Cortana Analytics boasts the following features:
- Actionable analytics
- Perceptual intelligence
- Fast and flexible
- Secure and scalable
- Personal digital assistant – Cortana
With Cortana Analytics offering predictive analytics and the Azure Data Lake providing plentiful storage for all your unstructured data, you would be forgiven for thinking Microsoft wouldn’t be in need of a new language as well, especially considering the existing big data language SQL is still used by many developers.
What’s wrong with SQL?
Standard SQL-based languages are easy to use, familiar to a wide range of developers and are a powerful tool for many types of analytics. A declarative approach natively processes the scaling, parallel execution and optimizations for you.
The issue, however, is that their extensibility models and support for non-structured data and files are often ‘bolted-on’ and as such are a lot more difficult to use. Tasks such as exploring your data in a file, for example, become a lot more time consuming, as it would require creating cataloge objects to arrange file data or remote sources before you can query them.
Although SQL-based languages are proficient, they are complex both to build and maintain, and have varying degrees of consistency in the programming models. They necessitate a lot of time and effort dedicated to them, and even so this will not guarantee a complete final result.
Introducing U-SQL
Addressing these issues, Microsoft built their U-SQL language “from the ground up”. U-SQL is an evolution of the declarative SQL language, allowing for native extensibility through user code written in C#. This allows for total unification in a number of areas: unifying the declarative and custom imperative coding experience, and unifying the experience around extending your language capabilities.
U-SQL is a bit like a giant fishing rod. The Azure Data Lakehas been on the radar for a while, but with the addition of U-SQL there is now clarity on how useful information can be found from the petabytes (1 million GB) of corporate data in the lake.
Simplifying big data analysis
“We’ve heard that many data engineers struggle to process data with today’s tools… …Microsoft’s goal is to make big data technology simpler and more accessible to the greatest number of people possible.”
-Oliver Chiu, Microsoft Product Marketing Manager for Hadoop
Most organizations will have at least some form of big data already – be it customer purchasing records, audio or media files, a myriad of different files – but don’t have the means to actually make any use of it. U-SQL is planning on changing that. Combining standard SQL keywords with syntactic C# expressions, a programmer can arrange the data from an unstructured source or use SQL from a single script. Users can then aggregate the data into the desired form, and write the output to a file or table. Plus, U-SQL will be instantly familiar to those who have some experience with SQL and C#, thus minimizing the time it would take for a developer to learn additional languages.
When it comes to the core capabilities of the language, some of the standout areas are as follows:
- Unifies declarative queries with the expressiveness of your user code.
- Unifies the querying of both structured and unstructured data.
- Unifies local and remote queries.
- Increases productivity and agility from the beginning.
Not quite perfect
There are a couple of limitations when it comes to U-SQL and the Azure Data Lake Store, as the Data Lake is unable to address all big data use cases. When it comes to machine-learning or stream processing within the Microsoft cloud, you will have to familiarize yourself on other Azure technologies.
The specialized nature of U-SQL also means that at this time, it’s unknown whether it will be available for non-Azure (or non-Microsoft) platforms. All we can do is hope the answer is yes!