一个最近的开发需求,导出某ES索引上某嵌套字段值

promotion_details.promotion_name

符合以“官方立减”文字结尾的所有数据导出,因为es索引的结构的关系,目前es索引结构的原因,数据需要处理下之后再导出,目前es索引数据的结构大致如下

{
  "_index": "my_elasticsearch_index",
  "_type": "order",
  "_id": "3602478673110456764",
  "_version": 1699331966001,
  "found": true,
  "_source": {
    "adjust_fee": "0.00",
    "alipay_point": "0",
    "available_confirm_fee": "0.00",
    "buyer_alipay_no": "****",
    ......
    "oms_orders": [
      {
        "cart_item_no": "3602478673111456764",
        "order_item_id": "210100024285392788",
        ......
      },
      ......
    ],
    "promotion_details": [
      {
        "promotion_name": "2023天猫双11抢先购官方立减",
        ......
      },
    ],
    "sub_orders": [
      {
        "oid": "3602478673111456764",
        ......
      },
      ......
    ],
    "tid": "3602478673110456764",
    "trade_from": "WAP,WAP",
    "type": "fixed",
    ......
  }
}

tid为主订单单号、sub_orders.oid为子单单号、oms_orders.cart_item_no对应等于sub_orders.oid,有一条oid子单记录就对应有一条cart_item_no记录。

而数据导出到excel需要将一条tid主订单记录展开为对应n条子单oid记录,即如上这条es索引数据需要解开为子单信息如下,有几条子单信息就展开为几条数据。

[
  {
    "oid": "3602478673111456764",
    "tid": "3602478673110456764",
    "cart_item_no": "3602478673111456764",
    "order_item_id": "210100024285392788",
    "promotion_details": [
      {
        "promotion_name": "2023天猫双11抢先购官方立减",
        ......
      },
      ......
    ],
    "adjust_fee": "0.00",
    "alipay_point": "0",
    "available_confirm_fee": "0.00",
    "buyer_alipay_no": "****",
    "trade_from": "WAP,WAP",
    "type": "fixed",
    ......
  },
  {
    "oid": "3602478673112456764",
    "tid": "3602478673110456764",
    ......
  }
]

所以,首先我需要将这部分数据从es中筛选出来,之后再重新进行数据清洗成所需格式之后,再进行导出到excel操作

DSL语句

首先需要筛选所需数据,则写出对应DSL语句。因为 promotion_details.promotion_name 字段的类型为keyword,进行模糊匹配则需要使用wildcard语句(https://blog.csdn.net/weixin_43859729/article/details/108134329),如果你使用的text类型的字段,则需要是match语句匹配模糊搜索(https://zhuanlan.zhihu.com/p/592767668

又因为是在嵌套结构中所以最终写出的DSL语句是,

get /my_elasticsearch_index/my_elasticsearch_index_type/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "pay_time": {
              "gte": "2023-10-30 00:00:01.0",
              "lte": "2023-11-16 00:01:00.0"
            }
          }
        },
        {
          "nested": {
            "path": "promotion_details",
            "query": {
              "bool": {
                "must": [
                  {
                    "wildcard": {
                      "promotion_details.promotion_name": "*官方立减"
                    }
                  }
                ]
              }
            }
          }
        }
      ]
    }
  }
}

将es导出的数据同步到HIVE表中,此处使用公司平台专用同步工具,不做赘述

HIVE数据JSON数组展开

接下来需要将HIVE表数据中的json数组展开。因为数据结构的关联关系,所以需要先展开子单字段,将本来一行的主单数据展开为多行的子单数据。之后再次对oms_orders字段再次进行展开,并对展开后的结果进行过滤,过滤掉展开后oid子单号不等于cart_item_no的记录即可。

在这里对HIVE表数据进行展开需要先对HIVE表中的sub_orders字段的JSON数组进行解析,比如我们有如下这样的sub_orders字段

select '[{"oid":"3602478673111456764","total_fee":"46.60"},{"oid":"3602478673111456765","total_fee":"11.31"},{"oid":"3602478673111456766","total_fee":"0.89"}]' AS sub_orders

接下来我们需要对sub_orders字段进行解析,思路很清楚

1.使用regexp_replace函数去除掉开头的“[{”字符,并去除掉结尾的“}]”字符

2.对剩下的字符串使用split函数用“},{”字符串进行分割

3.对分割后的array结果使用explode函数进行展开

可以得到如下结果

SELECT CONCAT('{', SUB_ORDER, '}') AS SUB_ORDER
FROM (
         SELECT explode(split(regexp_replace(regexp_replace(A.SUB_ORDERS, '\\[\\{', ''), '}]', ''),
                              '},\\{')) AS SUB_ORDER
         FROM (
                  SELECT '[{"oid":"3602478673111456764","total_fee":"46.60"},{"oid":"3602478673111456765","total_fee":"11.31"},{"oid":"3602478673111456766","total_fee":"0.89"}]' AS SUB_ORDERS
              ) A
     ) B

关联主单

子单数据展开之后,就可以跟主单所需字段进行关联。此处需要使用 HIVE的lateral view

lateral view语法基本使用方法和作用可以看下这样的demo代码

select 'cheungq' as id,code
lateral view explode(split('1,2,3,4',',')) as code

那么,我们可以将主单和子单信息按照如下语句的方式执行组合起来

SELECT TID,
       STATUS,
       PAY_TIME,
       concat('{', SUB_ORDER, '}') as SUB_ORDER
FROM (
         SELECT TID,
                PAY_TIME,
                SUB_ORDERS,
                STATUS
         FROM EXPORT_ORDER_TEMP
     ) A
         LATERAL VIEW
             explode(split(
                     regexp_replace(regexp_replace(SUB_ORDERS, '\\[\\{', ''), '}]', ''),
                     '},\\{')) TMP_SUB_ORDERS as SUB_ORDER

执行后可得到主单tid、status、pay_time和分别sub_order组合的结果

至此一步,我们则需要在此结果上再次使用get_json_object函数,获取json对象中的某个字段的值(《最近写HIVE SQL的一点笔记【1】》中有相关提及)

则,我们对SQL再次修改如下

SELECT TID,
       get_json_object(sub_order, '$.oid')    AS OID,
       get_json_object(sub_order, '$.status') AS SUB_STATUS,
       STATUS,
       PAY_TIME
FROM (
         SELECT TID,
                STATUS,
                PAY_TIME,
                concat('{', SUB_ORDER, '}') as SUB_ORDER
         FROM (
                  SELECT TID,
                         PAY_TIME,
                         SUB_ORDERS,
                         STATUS
                  FROM EXPORT_ORDER_TEMP
              ) A
                  LATERAL VIEW
                      explode(split(
                              regexp_replace(regexp_replace(SUB_ORDERS, '\\[\\{', ''), '}]', ''),
                              '},\\{')) TMP_SUB_ORDERS as SUB_ORDER
     ) B

至此,我们可以把子订单信息展开并和主订单信息组合成一条hiveSQL数据

再次关联cart_item_no记录

有了以上oid展开的操作之后,我们可以在此结果之后再次继续相同的操作,将oms_orders字段同样展开处理。但是展开之后有一个问题,之前在ElasticSearch中的关联关系是oms_orders.cart_item_no等于sub_orders.oid,是一一对应的关系,而直接用LATERAL VIEW展开关联之后必然有不对应的数据关系。所以我们仍需要对展开后的结果做一个过滤的处理,判断展开后的oid=cart_item_no的记录进行保留

SELECT TID,
       OID,
       SUB_STATUS,
       STATUS,
       PAY_TIME,
       OMS_ORDER_ID
FROM (
         SELECT TID,
                OID,
                SUB_STATUS,
                STATUS,
                PAY_TIME,
                get_json_object(OMS_ORDER, '$.cart_item_no')           as CART_ITEM_NO,
                NVL(get_json_object(OMS_ORDER, '$.order_id'), '')      as OMS_ORDER_ID
         FROM (
                  SELECT TID,
                         OID,
                         SUB_STATUS,
                         STATUS,
                         PAY_TIME,
                         concat('{', OMS_ORDER, '}') as OMS_ORDER
                  FROM (
                           SELECT TID,
                                  get_json_object(sub_order, '$.oid')               as OID,
                                  get_json_object(sub_order, '$.status')            as SUB_STATUS,
                                  STATUS,
                                  PAY_TIME,
                                  OMS_ORDERS
                           FROM (
                                    SELECT TID,
                                           STATUS,
                                           PAY_TIME,
                                           OMS_ORDERS,
                                           concat('{', SUB_ORDER, '}') as SUB_ORDER
                                    FROM (
                                             SELECT TID,
                                                    PAY_TIME,
                                                    NVL(OMS_ORDERS, '[{}]') AS OMS_ORDERS,
                                                    SUB_ORDERS,
                                                    STATUS
                                             FROM EXPORT_ORDER_TEMP
                                         ) A
                                             LATERAL VIEW
                                                 explode(split(
                                                         regexp_replace(regexp_replace(SUB_ORDERS, '\\[\\{', ''), '}]', ''),
                                                         '},\\{')) TMP_SUB_ORDERS as SUB_ORDER
                                ) B
                       ) C
                           LATERAL VIEW
                               explode(split(regexp_replace(regexp_replace(OMS_ORDERS, '\\[\\{', ''), '}]', ''),
                                             '},\\{')) TMP_OMS_ORDERS as OMS_ORDER
              ) D
     ) E
WHERE (E.OID = E.CART_ITEM_NO OR E.CART_ITEM_NO IS NULL)

我们对比下es中原来的数据可以看下确认下正确性

{
  "_source": {
    "oms_orders": [
      {
        "order_item_id": "210100021805281686",
        "order_id": "210100021805261686"
      },
      {
        "cart_item_no": "3626335551669894932",
        "order_id": "210100021805261686"
      }
    ],
    "pay_time": "2023-11-11 23:59:50.0",
    "status": "WAIT_BUYER_CONFIRM_GOODS",
    "sub_orders": [
      {
        "oid": "3626335551669894932",
        "status": "WAIT_BUYER_CONFIRM_GOODS",
        ......
      },
      {
        "oid": "3626335551670894932",
        "status": "WAIT_BUYER_CONFIRM_GOODS",
        ......
      }
    ],
    "tid": "3626335551668894932",
    ......
  }
}