- Community Home
- >
- Software
- >
- HPE Ezmeral Software platform
- >
- MapR-DB JSON query with UNION ALLs seems to sequen...
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-02-2021 03:40 PM - last edited on 08-03-2021 03:42 PM by support_s
08-02-2021 03:40 PM - last edited on 08-03-2021 03:42 PM by support_s
MapR-DB JSON query with UNION ALLs seems to sequentially read from table
I have a MapR-DB JSON table that I'm querying as follows:
select date_as_int, sum(xflows), sum(xpackets), sum(xbytes), count(distinct(ip))
from (
select ip, cast(split(_id, '-')[2] as INT) as date_as_int, f as xflows, p as xpackets, b as xbytes from dfs.`/my_tbl` where (_id between '6-1723-1625947200' and '6-1723-1626058801')
UNION ALL
...
) group by date_as_int;
There are 20 "select ip ..." subselects in the query (i.e., the "..." part has the other 19). Each subselect scans a different set of contiguous _id values.
Table rows are small. Here's a sample:
| _id | b | f | p | ip |
+-----------------------------+----------+-----+-------+----------+
| 6-5500-1625097600-1.2.3.255 | 38414458 | 261 | 47396 | 16909311 |
The first major fragment in the query profile (03-xx-xx) indicates that there is only one minor fragment, hence it is performing all 20 subselects (i.e., all of the MapRDB scans, the projects and the union all operators associated with those subselects). This is really slowing down the query.
The 20 subselects match about 1.4 M records, which are aggregated down into the final 32 records/results. It is taking more than 10 seconds for the first major fragment (03-xx-xx) to finish, which seems to imply that the 20 MapRDB table scans are happening in sequence, rather than in parallel.
Is there a Drill planner or execution option that could make the MapRDB table scans (i.e., the subselects) for this query run in parallel?
Or, is there a way to rewrite the query to "trick" the optimizer into doing the subselect work in parallel?
The table has about 105M rows, a physical size of 5.8GB, 4 regions (region size of 4G).
I also tried a table with a region size of 256M, which has 28 regions.
The query runs the same (i.e., slow) against both tables.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-02-2021 08:59 PM
08-02-2021 08:59 PM
Re: MapR-DB JSON query with UNION ALLs seems to sequentially read from table
Mark,
What happens with trial queries that are rewritten forms of your query?
- first, what happens if you rewrite this as a simple aggregation over the union all rather than the nested query?
- second, what happens if you rewrite the UNION ALL into a sub-query that uses a WHERE exp1 OR exp2 ?
- third, what happens if you rewrite the nested queries using common table expressions (CTE):
with t1 as (
select ip, cast(split(_id, '-')[2] as INT) as date_as_int, f as xflows, p as xpackets, b as xbytes from dfs.`/my_tbl` where (_id between '6-1723-1625947200' and '6-1723-1626058801')
UNION ALL
...
),
select date_as_int, sum(...), sum(...) from t1 group by date_as_int
My own expectation is the options #1 and #3 will not matter at all, but #2 might make a big difference.
My guess is that your UNION is inside a view which might make this rewrite hard, but let's get some progress before facing that challenge.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-03-2021 04:29 PM - edited 08-04-2021 09:23 AM
08-03-2021 04:29 PM - edited 08-04-2021 09:23 AM
Re: MapR-DB JSON query with UNION ALLs seems to sequentially read from table
Hi Ted,
Thanks for responding.
I'm not sure how to try #1. Note that the count(distinct(ip)) in the outer query prevents us from doing any aggregation in the subselects. Our query is doing the following: given "n" port-protocol pairs, find all scan records (in my_tbl) that match those pairs for a certain timeframe (the 3rd part of the _id field is the timestamp), then sum (per hour) the flows, packets, and bytes of those records, and (this is the "kicker" so to speak) get a count of the unique IPs. So if IP 1.2.3.4 was in hour 02 for two different port-protocol pairs, that IP would get counted once (not twice).
Regarding #2--where expr1 or expr2--we discovered a few months ago that Drill doesn't optimize this well for MapR-DB _id matching. If expr1 is _id between '1-200' and '1-201' and expr2 is _id between '900-200' and '900-202', then Drill sets up the scan from the lowest value (1-200) all the way through the highest value (900-202), rather than issue two separate scans for the two distinct ranges.
(This is why we constructed our query to use UNION ALLs, to force Drill to do the two separate small scans, which is much more efficient than the one (potentially) huge scan.) More details on #2 are below, as I think you'll find it interesting.
I tried #3 (CTE), and only one minor fragment is still used in the first major fragement/stage.
More details on #2:
explain plan for select ... from (
select ip, cast(split(_id, '-')[2] as INT) as date_as_int, f as xflows, p as xpackets, b as xbytes from dfs.`/my_tbl` where (_id between '1-200' and '1-202') or (_id between '900-200' and '900-202')
) group by date_as_int;
...
"tableName" : "/my_tbl",
"indexDesc" : null,
"startRow" : "AzEtMjAw",
"stopRow" : "AzkwMC0yMDIA", // <--- this is likely very far from startRow, hence a scan of a signficant portion of the table
explain plan for select _id from dfs.`/my_tbl` where _id between '1-200' and '1-202';
...
"startRow" : "AzEtMjAw",
"stopRow" : "AzEtMjAyAA==", // very close to startRow
explain plan for select _id from dfs.`/my_tbl` where _id between '900-200' and '900-202';
"startRow" : "AzkwMC0yMDA=",
"stopRow" : "AzkwMC0yMDIA", // close to startRow, but far from the 1-200 query's start row
So from the 3 explains, it looks like Drill scans the entire _id range (1-200 through 900-202) when using OR.