Optimize IN clause with LINQ Entity Framework with Ranges

Problem

Suppose you have this simple query :

var ids = new long[]{1,3,4,5,8,9,10,78,34,23...};
Context.MyTable.Where(a=>ids.Contains(a)).ToList()

if ids contains one million number, Linq provider will write a giant query like :

SELECT id, field1, field2 WHERE
id IN (1,3,4,5,8,9,10,78,34,23 ...) // until one million numbers

You will have an exception because it’s too much for the SQL provider to execute this query.
You’ll probably think about these 2 solutions :

  • Inserting all theses datas in a temporary table (with a third party library), then joining
  • Splitting your queries in multiple chunks

Both are bad solutions because you have to Upload millions of number to the SQL Server.

Solution

To resolve this problem, you have to group the numbers by ranges.

Range 0 : 1,2,3,4,5
Range 1 : 19,20
Range 2 : 45,46,47,48,49,50
…..
Range 10 : 125558,125559,125560,125561,125562

IList<Range> GroupArrayToRanges(IEnumerable<int> values)
{
    var res = values.Order().Select((num, index) =>
    new
    {
        Id = num,
        Rank = index,
        Group = num - index
    })
    .GroupBy(a => a.Group)
    .Select(a => new Range(a.Min(a => a.Id), a.Max(a => a.Id)))
    .ToList();
    return res;
}

Now you have the groups, you need to rewrite the correct LINQ query.

var ids = new int[]{1,3,4,5,8,9,10,78,34,23...};
var ranges = GroupArrayToRanges(ids);
Context.MyTable.Where(a=>
   (a>=ranges[0].Start && a<=ranges[0].End) 
|| (a>=ranges[1].Start && a<=ranges[1].End)
|| (a>=ranges[2].Start && a<=ranges[2].End)
...
|| (a>=ranges[10].Start && a<=ranges[10].End)  
).ToList()

Building this query dynamically would need the use of Expression Trees because of logical OR.
If we want to build this query without Expression, we need to convert to an AND query, by including the whole range first and excluding ranges like this :

var ids = new int[]{1,3,4,5,8,9,10,78,34,23...};
var ranges = GroupArrayToRanges(ids);
Context.MyTable.Where(a=>
   (a>=ranges[0].Start && a<=ranges[10].End) // include whole range
&& !(a>ranges[0].End && a<ranges[1].Start) // exclude
&& !(a>ranges[1].End && a<ranges[2].Start) // exclude
&& !(a>ranges[2].End && a<ranges[3].Start) // exclude
...
&& !(a>ranges[9].End && a<ranges[10].Start)  
).ToList()

Now you can enjoy an elegant solution without uploading tons of data.
The only drawback of the solution is it will not work if your data cannot be grouped by interval like 1,3,6,8,12,14,16,29,31 => it will create 9 groups…
But if you have 1,2,3,4…..1001045 data, that will perfectly fit !

Go further with SQL and Linq

This is the equivalent way to do in SQL

SELECT 
     MIN([t].[Id]) AS [Min], 
     MAX([t].[Id]) AS [Max]
FROM (
    SELECT [s].[Id], [s].[Id] - (
        SELECT COUNT(*)
        FROM [SuperTable] AS [s0]
        WHERE [s0].[Id] < [s].[Id]) AS [Key]
    FROM [SuperTable] AS [s]
) AS [t]
GROUP BY [t].[Key]

-- if you really need performance, you would better go with ROW_NUMBER

SELECT 
     MIN([t].[Id]) AS [Min], 
     MAX([t].[Id]) AS [Max]
FROM (
    SELECT 
	[s].[Id], 
	s.Id - (ROW_NUMBER() OVER(ORDER BY id ASC)) AS [Key]
    FROM [SuperTable] AS [s]

) AS [t]
GROUP BY [t].[Key]

-- if you have an old MYSQL server, you can emulate ROW_NUMBER with a session variable

SELECT 
     MIN(t.id) AS Min, 
     MAX(t.id) AS Max
FROM (
    SELECT 
	s.id, 
	s.id - s.RowNumber as `Key`
    FROM (    
	     SELECT id,  ((@row_number := @row_number + 1)) as RowNumber
		FROM
		T_USERS
        ORDER BY id
    ) AS s   
) AS t
GROUP BY t.Key;

And the Linq (SQL transcriptable) Syntax (without ROW_NUMBER function as it’s not supported by current LINQ providers):

var q = from s in context.Table
        orderby s.Id
        select new
        {
            Id = s.Id,           
            Rank = (from o in context.SuperTables where o.Id < s.Id select o).Count(),
            Group = s.Id - (from o in context.SuperTables where o.Id < s.Id select o).Count(),
        } into sub2
        group sub2 by sub2.Group into sub3
        select new
        {
            Min = sub3.Min(a => a.Id),
            Max = sub3.Max(a => a.Id)
        };

In a non SQL context, you would probably rewrite Select by using Select((s,index)=> syntax to perform ranking like ROW_NUMBER, it’s more efficient.

int[] data = new int[] { 1, 2, 3, 4, 5, 7, 8, 10, 11, 12 };

var res = data.Order().Select((num, index) =>
    new
    {
        Id = num,
        Rank = index,
        Group = num - index
    })
    .GroupBy(a => a.Group)
    .Select(a =>
    new
    {
        Min = a.Min(a => a.Id),
        Max = a.Max(a => a.Id)
    })
    .ToList();

Update : Use ROW_NUMBER within Entity Framework Core

I developed my own extension to support Row_Number within EntityFramework Core for SqlServer,Sqlite,Mysql,PostreSql here :

https://www.nuget.org/packages?q=Webrox.EntityFrameworkCore
Github repository : https://github.com/Poppyto/Webrox.EntityFrameworkCore

Then use it like that in order to get the id intervals :

 var res = context.Table
.Select((a, index) =>
 new
 {
	 Id = a.Id,
	 Group = a.Id - EF.Functions.RowNumber(EF.Functions.OrderBy(a.Id))
 })
 .GroupBy(a => a.Group)
 .Select(a =>
 new
 {
	 Min = a.Min(a => a.Id),
	 Max = a.Max(a => a.Id)
 })
 .ToList();
This entry was posted in Non classé and tagged , , , , , . Bookmark the permalink.

Comments are closed.