LINQ Deep Dive: Part I
Tips for Delving in to LINQ
LINQ (language-integrated query) enables developers to work at a higher abstraction layer by making querying a first-class citizen in C#. LINQ provides the same query model for different domains, such as LINQ to Objects, LINQ to XML, and LINQ to SQL.
Although the querying concept is the same at the syntax level, we are dealing with different domains that have totally different behaviors, which LINQ cannot abstract. For instance, when you write a LINQ query that is applied on SQL Server, you rarely consider optimizing the query. SQL Server has a built-in optimizer to execute the query to yield maximum performance in the most efficient manner. However, LINQ to Objects does not have that richness; therefore, it is left to the developer to write queries that can be executed by the compiler in the most efficient manner. I뭠l share in this two-part series some of my experiences and pain that I뭭e encountered working with LINQ on a daily basis. I뭠l also demonstrate some common gotchas to avoid, as well as hidden features in LINQ that can help you write concise and expedient LINQ queries.
Tip One: Apply Filters Before a Join to Improve Query Performance
A LINQ to SQL query is translated into SQL by the LINQ provider and sent to SQL Server for execution. As a developer, you don뭪 have to worry about the order in which the join and filter clause needs to be applied to return data faster. SQL Server has a built-in query optimizer that looks at query execution plans and indexes to determine the fastest possible way to read data, such as doing a table scan or using an index or hash join to retrieve data. LINQ to Objects queries are not endowed with any such optimizers. Therefore, it is essential to understand query behaviors to try to limit the result by applying filters before join clauses, causing the projected output to be joined against a smaller subset. Joining against fewer rows offers better performance. Figure 1 shows an example that illustrates performance benefits gained by putting a where clause before joins.
Figure 1: Applying a where clause before a join makes queries run faster.
Figure 1 contains two integer collections, nums1 and nums2, which are generated using the Range extension method on the Enumerable class. The Range method creates a sequence of numbers specified within a range. The LINQ query in Figure 1 uses a join clause to find common integers in nums1 and nums2. Because I뭢 only looking for numbers that fall below the 500,000 range, I also applied a where filter to limit my search. To calculate query execution time, I recorded the start and end time of each LINQ query. Both queries in Figure 1 accomplish the same task, the difference being the first query applies the where clause after joining the two collections, and the second query limits the rows to be joined with the nums2 collection by applying the where clause before the join statement. Figure 2 shows the time taken by both queries.
Figure 2: Results from executing the query in Figure 1 show an increase of performance when you apply a where clause before the join statement.
Results from Figure 2 confirm there is a magnitude of speed difference when you apply a where clause before the join statement. Applying the where clause before took only 6.8 seconds, compared to the original query, which took 16.5 seconds. Remember, when executing queries on in-memory objects, understanding your data is essential so you can limit the search results before joining it to another table. Most of these capabilities are provided for you by SQL Server when you write LINQ to SQL queries, but in LINQ to Objects, developers are responsible for writing queries that can perform optimally.
Tip Two: Filter LINQ Queries Using Object Reference Comparison
You are not required to explicitly specify the primary key column in the where clause of the query if the LINQ query is filtered based on a primary key. Comparing object references will ensure that the query you send to the database gets filtered based on the primary key column. This sometimes removes noise from the code and makes it easier to read. Figure 3 demonstrates using object comparison to apply a where clause to retrieve only products in the Beverages category.
Figure 3: Applying a where filter to return products in the Beverages category.
As Figure 3 shows, I returned products that belong to the Beverages category. Instead of filtering products based on CategoryID, I filtered the results by comparing the category on the product object with the category I retrieved from the database. The object reference comparison gets converted to a primary key comparison on SQL Server. This can be confirmed by the SQL capture displayed below the LINQ query in Figure 3. Notice that the Products table is filtered based on the CategoryID column.
Tip Three: Apply Aggregates to Empty Collections in LINQ to SQL Queries
LINQ to SQL has support for aggregate operators, like Sum, Count, and Average. For collections implementing IQueryable, aggregate operations are performed on the database. Certain aggregate operators like Sum and Max would cause LINQ to SQL queries to throw an exception. For instance, to determine the total sales an employee has made, you would group the Orders table by employee and sum the order total for each order to get their TotalSales. But what happens when there are employees in the database who have not created an order? Those employees will not have any records present in the Orders table, and SQL Server will return null for their sum (instead of 0).
The implementation on SQL Server for calculating the sum on empty rows differs from how LINQ to Objects applies Sum on empty collections. For instance, if an employee has not created an Order, getting its Orders would return an empty collection of Orders. When you apply Sum on an empty collection, you would end up with 0 totals. However, for the same scenario, SQL Server would return a null value for TotalSales. So be aware that if the aggregation is performed on SQL Server, the type that is going to hold the sum must be a nullabe type. If the query is performed in memory, you can be assured that results will not be null. Let뭩 look at an example of comparing both LINQ to SQL and LINQ to Objects implementations for getting total revenue for each employee (see Figure 4).
Figure 4: An aggregate operation applied on an empty collection on the database requires casting to nullable types.
As Figure 4 shows, I calculated the total revenue generated by each employee. In the first section of the code, I preloaded all the Orders and OrderDetails for all employees by making use of DataLoadOptions. By using DataLoadOptions, I뭢 telling the LINQ to SQL query engine that while you are retrieving employees, also fetch all the orders for that employee, and, for each Order, also retrieve its OrderDetails. By preloading all the orders from the database, I can force all the aggregate operations to be performed in memory. The query in Figure 4 displays the employee name, and for each employee I am flattening the list of OrderDetails. Because each Order has many OrderDetails, I make use of the SelectMany operator to flatten the hierarchy of OrderDetails into one collection that contains all the OrderDetails for all the orders created by a particular Employee. Once I뭭e obtained all the OrderDetails, I sum the OrderDetails total for each order by multiplying the quantity and price available on the OrderDetails entity.
Figure 5 shows the output of printing the anonymous type on the screen. Notice that we have an employee named zeeshan that has no orders created; hence, the sales total shows a value of 0. For the same query to work in LINQ to SQL, I must cast results returned from the Sum operation to a nullable decimal type. SQL Server returns null for TotalSales when there is no order created by an employee. If you do not make your TotalSales nullable, LINQ to SQL will throw an exception complaining null values cannot be assigned to types that are not defined as nullable.
Figure 5: Results from a Sum operation applied on SQL Server using LINQ to SQL and in memory using LINQ to Objects.
Tip Four: Delay Loading a Property in LINQ to SQL
LINQ to SQL by default supports lazy loading of entity refs and EntitySets. This lazy-loading behavior allows the developer to load only the data needed for processing. If you don뭪 require additional collections, there is no need to incur the performance degradation and memory consumption for loading additional collections from the database.
But what happens if there are certain columns in your table entity that are expensive to load and you rarely need them in your user interface. It would be beneficial in these scenarios to take advantage of lazy loading on certain columns of the table. One way to achieve this is to write a LINQ query that returns an anonymous type that contains only the data you need to display on the Presentation layer. However, an anonymous type cannot be passed across tiers in a strongly typed manner. Another option is to partly populate your LINQ entity with only columns you need to display. In this option, you must at least populate your LINQ entity with a column marked as the primary key, because object tracking uses primary key columns to track entities. Both options have weaknesses, because if you don뭪 load the column at the time you run the query, those columns never get loaded � and you don뭪 get a second chance to load the property unless you call Refresh to refresh the entity from the database.
The LINQ to SQL designer provides a Delay Loaded property on every column in the table. When you set the Delay Loaded property to True on a column, the LINQ to SQL engine will exclude that column from the query when it makes a request to the database. Only when you access that column will LINQ make a request to SQL Server to get only that column value. It is useful in scenarios when you have defined columns as XML data type, varchar(max), or varbinary. Figure 6 shows setting the Delay Loaded property on the LINQ to SQL designer.
Figure 6: Code generated by the LINQ to SQL designer when you set Delay Loaded to True.
Figure 6 shows a preview where I marked the Picture column with Delay Loaded set to True. On changing the setting in the Properties window, the LINQ to SQL designer will create a private variable of type System.Data.Linq<T>, where T is the type of property (Binary in this case). System.Data.Linq.Link has the smartness to enable lazy loading on the property. After setting the Picture column to be lazy loaded, querying the Categories table will exclude the Picture column. Figure 7 confirms the behavior by printing the SQL sent to the database. Notice that our query does not include any reference to the Picture column.
Figure 7: The SQL query sent by LINQ to SQL excluded the Picture column.
Setting lazy loading in the designer is a good option, but it applies lazy-loading behavior for the property for all the queries in your application. What happens when you have a genuine need to have the properties not be lazy loaded on certain queries. In those cases you can use the familiar LoadWith option to load properties. LoadWith not only works with foreign key tables, but you also can use that to load lazy-loaded properties. Figure 8 demonstrates loading the Picture property when the Category entity is retrieved. Therefore, this time, my SQL-generated code includes the Picture column, despite the fact that the Picture property on the Category entity has deferred-loading enabled.
Figure 8: The Picture column is loaded when we fetch the Category entity from the database.
Tip Five: Use Table-valued Functions with Eager Loading Turned On
User-defined functions provide composability for LINQ to SQL queries compared to stored procedures. When you call a stored procedure from your datacontext, the collection returned is IEnumerable<T>. If further transformations are applied on the query, such as a where or order by clause, they get executed in memory. These kinds of queries are deceiving, because they compile fine and do not throw exceptions at run time. Because part of the query gets executed on SQL Server and any transformations applied are done in memory, you could end up bringing lots of rows on the Web server, which could increase your network traffic and increase memory consumption. However, table-valued functions return IQueryable, and are marked with the IsComposable attribute set to True. This allows us to further compose our queries by joining it to another table and adding where, order by, or any other LINQ operators. Then the entire query gets sent to SQL Server for execution.
Table-valued functions work well, but they tend to give incorrect results if you turn on eager loading for child collections. For instance, if I have a table-valued function that returns categories and I뭢 eager-loading products for the category as well, then when I print the categories it will return the correct number of categories, but they all would be the same. Figure 9 shows a table-valued function named BvgAndSeaFood that returns two categories (Beverages and Produce). To use the function, I dragged the function on the categories table to the LINQ to SQL designer to indicate that my function returns a collection of Category objects. In the query, I made use of DataLoadOptions to eagerly-load products for the categories returned from my function named BvgAndSeaFood. Once I뭭e configured my DataLoadOptions, I assigned the object to the datacontext. On printing the categories to the output screen, both times the Beverages category is printed. If you don뭪 use DataLoadOptions to preload products, the results printed on the console will be Beverages and Produce, as dictated by my table-valued functions. This leads me to believe there is something wrong with the implementation of table-valued functions when you turn on eager loading.
Figure 9: This query returns the Beverages category both times, even though our function dictates returning Beverages and Produce.
As mentioned earlier, one way to get around the problem is to join the Category table and select a category from the Category table. Figure 10 shows the working version where I get correct categories with no duplicates printed on the console window.
Figure 10: Joining categories returned from the function to the Category table removes duplicates of the Beverages category.
Tip Six: Put Joins in the Correct Order in a LINQ to Objects Query
If you are doing lots of joins in memory using LINQ to Objects, you must to be aware of the order in which joins are applied. For instance, if you have two sequences in memory and you have to join those sequences, always use the smaller sequence on the right side of your query and the larger sequence on the left side. It makes a considerable difference in performance. Figure 11 shows the time taken to join a large collection to a small collection. To create my first array, num1, I used the Range operator. The Range operator takes in the start position from where to start generating numbers; the second parameter indicates how many numbers to generate. Looking at Figure 11, you can see that num1 is a large array. I then created my second array using the same Range operator, but gave it a small count, which resulted in a smaller array.
Figure 11: A join is optimized because the smaller array is on the right side and the larger array is on the left side.
I then joined both the sequences with the smaller sequence nums2 on the right side and the larger sequence nums1 on the left side. The time taken to execute the query was 0.18 seconds. If I were to rewrite the query slightly differently by putting the smaller array on the left and the larger array on the right, it would dramatically increase the query execution time to 2.6 seconds (see Figure 12). Not only does the query time increase, but the memory consumption goes up because the LINQ query engine must load the entire collection on the right side � and if that happens to be a larger collection, it increases the query time and occupies more memory.
Figure 12: The larger array on the right side of the join causes the query time to increase.
Tip Seven: Compose a LINQ Query Inside a Loop
LINQ queries can be composed by dynamically building a lambda statement and appending it to the existing query. For instance, you can loop through an array and for each iteration build a lambda statement and add it to an existing query. If the lambda statement makes use of a looping variable, results would be inconsistent if the looping variable is not captured correctly. These side effects return misleading results and introduce bugs that are very difficult to find.
The problem can be explained by a simple demonstration. In Figure 13, I have an array of cities containing London and Paris. I also have a collection of customers where some customers are from London and some are from Paris. I am looping through my array of cities and building a LINQ query. Inside the loop, I am creating a lambda statement that matches the city in the array and picking the first customer. When I run the query I expect to get two customers, one from London and one from Paris. However, the output on the console window only shows a customer with CustID:3 from Paris. The result is incorrect because the variable city remains in scope and changes with every iteration of the loop. The last iteration of the loop returns the value of Paris, so all predicates using the city parameter in their lambda expression get affected and use the new value of Paris.
Figure 13: The side effects if the looping variable is not captured correctly when building LINQ queries dynamically.
To solve the problem we must declare another variable inside the loop that captures the value of the looping variable. When the looping variable changes as you iterate the loop, our local variable is not affected. The local variable goes out of scope at the end of each iteration of the loop and, hence, is not affected by the changing value of the looping variable. Consequently, with each iteration of the loop we create a lambda statement that captures a different local variable.
To avoid the side effect of the city variable being changed, I can declare another variable, _city, inside the loop, which captures the value of the city at that moment and passes that to the lambda statements (see Figure 14). Once that iteration ends, the local variable _city goes out of scope and does not cause any side effects. The output printed on the console confirms we are getting two customers, one from London and one from Paris.
Figure 14: Declare a local variable to avoid the side effects of looping variables and, thus, return correct customers from London and Paris.
LINQ queries make the code more readable because the code specifies what needs to be done and leaves to the compiler the underlying details of how it needs to be done. Although LINQ abstracts different programming domains, such as LINQ to Objects, LINQ to XML, and LINQ to SQL, it cannot hide the complexity and different behavior of each domain. It is essential that we understand how each domain interprets the LINQ query and ensure that a particular query works optimally for the domain we are targeting.
In this installment we explored different query behaviors specific to each domain, such as how where clauses are executed in memory as compared to being applied on the database, and what performance tradeoffs we must consider. We also learned how aggregate behavior varies in each domain where LINQ to Objects would return 0 for empty collections and LINQ to SQL would return null. It is important that we understand these differences and ensure that our query returns correct results for the specific domain we are targeting.
To improve query performance we learned how to delay-load a particular column in the database that is not used often. We also discussed some of the constraints of using table-valued functions when eager-loading is turned on, and how to overcome those inconsistencies. We also talked about how to put joins in the proper order to ensure a query runs faster and does not occupy lots of memory. Finally, we learned that when you are composing queries inside a loop, you must take precaution to capture the looping variable correctly; otherwise, your LINQ expressions will be filtering data based on the last value of the looping variable, resulting in incorrect results.
We뭠l continue our exploration in Part II by introducing users to caching features offered by the object tracking service, and in which scenarios caching is used and how it ensures object reference consistency. We뭠l also discuss how LINQ to SQL automatically maps dynamic SQL to entities, and how a non-default constructor can prevent composing queries from other queries. We뭠l also learn how to understand and identify if the query is being sent to a database for execution, or if it is being applied in memory.
Source code accompanying this article is available for download.
Zeeshan Hirani is a senior developer at http://CheaperThanDirt.com. He specializes mainly in ASP.NET, AJAX, and leveraging OR-mapping solutions like LINQ and Entity Framework in business applications. He has written several articles for http://CodeProject.com and is an active member in the user-group community in Dallas/Fort Worth. He is a frequent blogger. One of his greatest achievements was writing a 500-page learning guide that walks you through different concepts and modeling scenarios supported by Entity Framework. The learning guide can be downloaded athttp://weblogs.asp.net/zeeshanhirani/archive/2008/12/18/my-christmas-present-to-the-entity-framework-community.aspx. You can find his blog at http://weblogs.asp.net/zeeshanhirani. Contact him with any questions at email@example.com.