HPE Ezmeral Software platform

MapR-DB JSON query with UNION ALLs seems to sequentially read from table

 
MarkRajcok
Occasional Advisor

MapR-DB JSON query with UNION ALLs seems to sequentially read from table

I have a MapR-DB JSON table that I'm querying as follows:

 

select date_as_int, sum(xflows), sum(xpackets), sum(xbytes), count(distinct(ip)) 
from (
  select ip, cast(split(_id, '-')[2] as INT) as date_as_int, f as xflows, p as xpackets, b as xbytes from dfs.`/my_tbl` where (_id between '6-1723-1625947200' and '6-1723-1626058801') 
  UNION ALL
  ...
) group by date_as_int;

 

There are 20 "select ip ..." subselects in the query (i.e., the "..." part has the other 19).  Each subselect scans a different set of contiguous _id values.
Table rows are small.  Here's a sample:

 

|             _id             |    b     |  f  |   p   |    ip    |
+-----------------------------+----------+-----+-------+----------+
| 6-5500-1625097600-1.2.3.255 | 38414458 | 261 | 47396 | 16909311 |

 

The first major fragment in the query profile (03-xx-xx) indicates that there is only one minor fragment, hence it is performing all 20 subselects (i.e., all of the MapRDB scans, the projects and the union all operators associated with those subselects).  This is really slowing down the query. 

The 20 subselects match about 1.4 M records, which are aggregated down into the final 32 records/results.  It is taking more than 10 seconds for the first major fragment (03-xx-xx) to finish, which seems to imply that the 20 MapRDB table scans are happening in sequence, rather than in parallel.

Is there a Drill planner or execution option that could make the MapRDB table scans (i.e., the subselects) for this query run in parallel?
Or, is there a way to rewrite the query to "trick" the optimizer into doing the subselect work in parallel?

The table has about 105M rows, a physical size of 5.8GB, 4 regions (region size of 4G). 
I also tried a table with a region size of 256M, which has 28 regions. 
The query runs the same (i.e., slow) against both tables.

2 REPLIES 2
tdunning
HPE Pro

Re: MapR-DB JSON query with UNION ALLs seems to sequentially read from table

Mark,

What happens with trial queries that are rewritten forms of your query?

- first, what happens if you rewrite this as a simple aggregation over the union all rather than the nested query?

- second, what happens if you rewrite the UNION ALL into a sub-query that uses a WHERE exp1 OR exp2 ? 

- third, what happens if you rewrite the nested queries using common table expressions (CTE):

with t1 as (
  select ip, cast(split(_id, '-')[2] as INT) as date_as_int, f as xflows, p as xpackets, b as xbytes from dfs.`/my_tbl` where (_id between '6-1723-1625947200' and '6-1723-1626058801') 
  UNION ALL
  ...
),
select date_as_int, sum(...), sum(...) from t1 group by date_as_int

 

My own expectation is the options #1 and #3 will not matter at all, but #2 might make a big difference. 

My guess is that your UNION is inside a view which might make this rewrite hard, but let's get some progress before facing that challenge.

 

I work for HPE
MarkRajcok
Occasional Advisor

Re: MapR-DB JSON query with UNION ALLs seems to sequentially read from table

Hi Ted,
Thanks for responding.

I'm not sure how to try #1.  Note that the count(distinct(ip)) in the outer query prevents us from doing any aggregation in the subselects.  Our query is doing the following: given "n" port-protocol pairs, find all scan records (in my_tbl) that match those pairs for a certain timeframe (the 3rd part of the _id field is the timestamp), then sum (per hour) the flows, packets, and bytes of those records, and (this is the "kicker" so to speak) get a count of the unique IPs.  So if IP 1.2.3.4 was in hour 02 for two different port-protocol pairs, that IP would get counted once (not twice).

Regarding #2--where expr1 or expr2--we discovered a few months ago that Drill doesn't optimize this well for MapR-DB _id matching.  If expr1 is _id between '1-200' and '1-201' and expr2 is _id between '900-200' and '900-202', then Drill sets up the scan from the lowest value (1-200) all the way through the highest value (900-202), rather than issue two separate scans for the two distinct ranges. 
(This is why we constructed our query to use UNION ALLs, to force Drill to do the two separate small scans, which is much more efficient than the one (potentially) huge scan.)   More details on #2 are below, as I think you'll find it interesting.

I tried #3 (CTE), and only one minor fragment is still used in the first major fragement/stage.

More details on #2:

explain plan for select ... from (
select ip, cast(split(_id, '-')[2] as INT) as date_as_int, f as xflows, p as xpackets, b as xbytes from dfs.`/my_tbl` where (_id between '1-200' and '1-202') or (_id between '900-200' and '900-202')
) group by date_as_int;
...
"tableName" : "/my_tbl",
"indexDesc" : null,
"startRow" : "AzEtMjAw",
"stopRow" : "AzkwMC0yMDIA",  // <--- this is likely very far from startRow, hence a scan of a signficant portion of the table

explain plan for select _id from dfs.`/my_tbl` where _id between '1-200' and '1-202';
...
"startRow" : "AzEtMjAw",
"stopRow" : "AzEtMjAyAA==",   // very close to startRow

explain plan for select _id from dfs.`/my_tbl` where _id between '900-200' and '900-202';

"startRow" : "AzkwMC0yMDA=",
"stopRow" : "AzkwMC0yMDIA",  // close to startRow, but far from the 1-200 query's start row

So from the 3 explains, it looks like Drill scans the entire _id range (1-200 through 900-202) when using OR.