【hive】nested select * from(select * from order by..)tmp多级嵌套问题

#0.缘起
现象描述：执行如下sql语句，因为含有`order by limit`，正常执行时(asc)，结果是`增序`的；加上代理且开启脱敏后顺序变成`降序`,执行图如下:
```
select * from (
select t.rn
  , t.rec_site_name
    , t.ratio_site_over
    , t.onum / 10000 onum
    , t.num_site_over / 10000 num_site_over
    , t.stipulated_deadline
    , b.categorycode
    , c.sum_latetime / c.num_latecar
    , c.sum_waittime / c.num_latecar
    , c.num_latecar / c.num_latecar
from yygk.yygk_top30_submit_abnormal_site t
left join dim.baseorganize b on t.rec_site_name = b.fullname and b.deletionstatecode = 0
left join tmp.tmp_jean_sitecarinfo c on b.leaguename = c.full_name 
where t.ds = '20200610'
order by categorycode asc, rn asc
limit 1000
) dataclub_sql_limit300000 limit 300000;
```
> hive直接执行

<center>
![正常](https://leanote.com/api/file/getImage?fileId=5ee1f9c3ab64416aa70008a8)
</center>

> proxy-hive开启脱敏

<center>
![yichang](https://leanote.com/api/file/getImage?fileId=5ee1f9ddab64416aa70008aa)
</center>
# 1.原生hive二级嵌套本来就会改变顺序结构(不通过代理，直连hive)

在对hive进行脱敏时，针对hive sql中含有`order by/group by/sort by`等语句时，会在`原语句`的基础上添加`select * from(原语句) tmp`,但客户如果执行一级嵌套语句(含有order by limit)时，改写后将变成二级嵌套，此时顺序会变掉,传统oracle数据库也是同样问题，顺序也会变掉。

一级嵌套
```
select * from(select .. from.. order by ... limit ...)tmp limit ..
```
二级嵌套
```
select * from(select * from(select .. from.. order by ...limit ...)tmp limit ...) tmp
```
## 1.1 原本语句
```
SELECT id,phone,email  FROM sen_t1 ORDER BY id desc;
```
<center>
![0](https://leanote.com/api/file/getImage?fileId=5ee2d319ab644168ab0012be)
</center>

## 1.2 一级嵌套

```
select * from (SELECT id,phone,email  FROM sen_t1 ORDER BY id desc )tmp limit 5;
```
<center>
![1](https://leanote.com/api/file/getImage?fileId=5ee2d383ab644168ab0012c2)
</center>

## 1.3 二级嵌套
```
select * from(select * from (SELECT id,phone,email  FROM sen_t1 ORDER BY id desc )tmp limit 5)tmp2;
```
<center>
![2](https://leanote.com/api/file/getImage?fileId=5ee2d41eab644168ab0012ca)
</center>
**NOTICE:** **`此时顺序已经变化，由降序变为增序`**

---

**tips:如果把一级嵌套里limit 5 去掉，结果又是正常顺序**
```
select * from(select * from (SELECT id,phone,email  FROM sen_t1 ORDER BY id desc )tmp )tmp2;
```
<center>
![1](https://leanote.com/api/file/getImage?fileId=5ee2d545ab644168ab0012d3)
</center>

---

观察这两个语句①` explain select * from(select * from (SELECT id,phone,email  FROM sen_t1 ORDER BY id desc )tmp limit 5)tmp2;` and ②`explain select * from(select * from (SELECT id,phone,email  FROM sen_t1 ORDER BY id desc )tmp )tmp2;`这两个语句的执行计划，可以看到①因为里面多了`limit 5`字段而多了一次mapreduce过程，且在多的一次mapreduce过程中,Reduce Output Operator中sort order: 为空，这可能是导致①返回结果顺序变化的原因。

> 执行计划 explain select * from(select * from (SELECT id,phone,email  FROM sen_t1 ORDER BY id desc )tmp limit 5)tmp2;
```
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1
  Stage-0 depends on stages: Stage-2

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: sen_t1
            Statistics: Num rows: 8 Data size: 592 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: id (type: int), phone (type: string), email (type: string)
              outputColumnNames: _col0, _col1, _col2
              Statistics: Num rows: 8 Data size: 592 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: _col0 (type: int)
                sort order: -
                Statistics: Num rows: 8 Data size: 592 Basic stats: COMPLETE Column stats: NONE
                TopN Hash Memory Usage: 0.1
                value expressions: _col1 (type: string), _col2 (type: string)
      Reduce Operator Tree:
        Select Operator
          expressions: KEY.reducesinkkey0 (type: int), VALUE._col0 (type: string), VALUE._col1 (type: string)
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 8 Data size: 592 Basic stats: COMPLETE Column stats: NONE
          Limit
            Number of rows: 5
            Statistics: Num rows: 5 Data size: 370 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

Stage: Stage-2
    Map Reduce
      Map Operator Tree:
          TableScan
            Reduce Output Operator
              sort order: 
              Statistics: Num rows: 5 Data size: 370 Basic stats: COMPLETE Column stats: NONE
              TopN Hash Memory Usage: 0.1
              value expressions: _col0 (type: int), _col1 (type: string), _col2 (type: string)
      Reduce Operator Tree:
        Select Operator
          expressions: VALUE._col0 (type: int), VALUE._col1 (type: string), VALUE._col2 (type: string)
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 5 Data size: 370 Basic stats: COMPLETE Column stats: NONE
          Limit
            Number of rows: 5
            Statistics: Num rows: 5 Data size: 370 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 5 Data size: 370 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
```

>执行计划:explain select * from (SELECT id,phone,email  FROM sen_t1 ORDER BY id desc )tmp limit 5;
```
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: sen_t1
            Statistics: Num rows: 8 Data size: 592 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: id (type: int), phone (type: string), email (type: string)
              outputColumnNames: _col0, _col1, _col2
              Statistics: Num rows: 8 Data size: 592 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: _col0 (type: int)
                sort order: -
                Statistics: Num rows: 8 Data size: 592 Basic stats: COMPLETE Column stats: NONE
                TopN Hash Memory Usage: 0.1
                value expressions: _col1 (type: string), _col2 (type: string)
      Reduce Operator Tree:
        Select Operator
          expressions: KEY.reducesinkkey0 (type: int), VALUE._col0 (type: string), VALUE._col1 (type: string)
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 8 Data size: 592 Basic stats: COMPLETE Column stats: NONE
          Limit
            Number of rows: 5
            Statistics: Num rows: 5 Data size: 370 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 5 Data size: 370 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Stage: Stage-0
    Fetch Operator
      limit: 5
      Processor Tree:
        ListSink
```

>执行计划:explain select * from(select * from (SELECT id,phone,email  FROM sen_t1 ORDER BY id desc )tmp )tmp2;
```
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: sen_t1
            Statistics: Num rows: 8 Data size: 592 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: id (type: int), phone (type: string), email (type: string)
              outputColumnNames: _col0, _col1, _col2
              Statistics: Num rows: 8 Data size: 592 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: _col0 (type: int)
                sort order: -
                Statistics: Num rows: 8 Data size: 592 Basic stats: COMPLETE Column stats: NONE
                value expressions: _col1 (type: string), _col2 (type: string)
      Reduce Operator Tree:
        Select Operator
          expressions: KEY.reducesinkkey0 (type: int), VALUE._col0 (type: string), VALUE._col1 (type: string)
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 8 Data size: 592 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 8 Data size: 592 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
```

信息安全从业人员^_^

导航

最近发表

友情链接