一个最近的开发需求,导出某ES索引上某嵌套字段值
promotion_details.promotion_name
符合以“官方立减”文字结尾的所有数据导出,因为es索引的结构的关系,目前es索引结构的原因,数据需要处理下之后再导出,目前es索引数据的结构大致如下
{
"_index": "my_elasticsearch_index",
"_type": "order",
"_id": "3602478673110456764",
"_version": 1699331966001,
"found": true,
"_source": {
"adjust_fee": "0.00",
"alipay_point": "0",
"available_confirm_fee": "0.00",
"buyer_alipay_no": "****",
......
"oms_orders": [
{
"cart_item_no": "3602478673111456764",
"order_item_id": "210100024285392788",
......
},
......
],
"promotion_details": [
{
"promotion_name": "2023天猫双11抢先购官方立减",
......
},
],
"sub_orders": [
{
"oid": "3602478673111456764",
......
},
......
],
"tid": "3602478673110456764",
"trade_from": "WAP,WAP",
"type": "fixed",
......
}
}
tid为主订单单号、sub_orders.oid为子单单号、oms_orders.cart_item_no对应等于sub_orders.oid,有一条oid子单记录就对应有一条cart_item_no记录。
而数据导出到excel需要将一条tid主订单记录展开为对应n条子单oid记录,即如上这条es索引数据需要解开为子单信息如下,有几条子单信息就展开为几条数据。
[
{
"oid": "3602478673111456764",
"tid": "3602478673110456764",
"cart_item_no": "3602478673111456764",
"order_item_id": "210100024285392788",
"promotion_details": [
{
"promotion_name": "2023天猫双11抢先购官方立减",
......
},
......
],
"adjust_fee": "0.00",
"alipay_point": "0",
"available_confirm_fee": "0.00",
"buyer_alipay_no": "****",
"trade_from": "WAP,WAP",
"type": "fixed",
......
},
{
"oid": "3602478673112456764",
"tid": "3602478673110456764",
......
}
]
所以,首先我需要将这部分数据从es中筛选出来,之后再重新进行数据清洗成所需格式之后,再进行导出到excel操作
DSL语句
首先需要筛选所需数据,则写出对应DSL语句。因为 promotion_details.promotion_name
字段的类型为keyword,进行模糊匹配则需要使用wildcard
语句(https://blog.csdn.net/weixin_43859729/article/details/108134329),如果你使用的text类型的字段,则需要是match语句匹配模糊搜索(https://zhuanlan.zhihu.com/p/592767668)
又因为是在嵌套结构中所以最终写出的DSL语句是,
get /my_elasticsearch_index/my_elasticsearch_index_type/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"pay_time": {
"gte": "2023-10-30 00:00:01.0",
"lte": "2023-11-16 00:01:00.0"
}
}
},
{
"nested": {
"path": "promotion_details",
"query": {
"bool": {
"must": [
{
"wildcard": {
"promotion_details.promotion_name": "*官方立减"
}
}
]
}
}
}
}
]
}
}
}
将es导出的数据同步到HIVE表中,此处使用公司平台专用同步工具,不做赘述
HIVE数据JSON数组展开
接下来需要将HIVE表数据中的json数组展开。因为数据结构的关联关系,所以需要先展开子单字段,将本来一行的主单数据展开为多行的子单数据。之后再次对oms_orders字段再次进行展开,并对展开后的结果进行过滤,过滤掉展开后oid子单号不等于cart_item_no的记录即可。
在这里对HIVE表数据进行展开需要先对HIVE表中的sub_orders字段的JSON数组进行解析,比如我们有如下这样的sub_orders字段
select '[{"oid":"3602478673111456764","total_fee":"46.60"},{"oid":"3602478673111456765","total_fee":"11.31"},{"oid":"3602478673111456766","total_fee":"0.89"}]' AS sub_orders
接下来我们需要对sub_orders字段进行解析,思路很清楚
1.使用regexp_replace函数去除掉开头的“[{”
字符,并去除掉结尾的“}]”
字符
2.对剩下的字符串使用split函数用“},{”
字符串进行分割
3.对分割后的array结果使用explode函数进行展开
可以得到如下结果
SELECT CONCAT('{', SUB_ORDER, '}') AS SUB_ORDER
FROM (
SELECT explode(split(regexp_replace(regexp_replace(A.SUB_ORDERS, '\\[\\{', ''), '}]', ''),
'},\\{')) AS SUB_ORDER
FROM (
SELECT '[{"oid":"3602478673111456764","total_fee":"46.60"},{"oid":"3602478673111456765","total_fee":"11.31"},{"oid":"3602478673111456766","total_fee":"0.89"}]' AS SUB_ORDERS
) A
) B
关联主单
子单数据展开之后,就可以跟主单所需字段进行关联。此处需要使用 HIVE的lateral view
lateral view
语法基本使用方法和作用可以看下这样的demo代码
select 'cheungq' as id,code
lateral view explode(split('1,2,3,4',',')) as code
那么,我们可以将主单和子单信息按照如下语句的方式执行组合起来
SELECT TID,
STATUS,
PAY_TIME,
concat('{', SUB_ORDER, '}') as SUB_ORDER
FROM (
SELECT TID,
PAY_TIME,
SUB_ORDERS,
STATUS
FROM EXPORT_ORDER_TEMP
) A
LATERAL VIEW
explode(split(
regexp_replace(regexp_replace(SUB_ORDERS, '\\[\\{', ''), '}]', ''),
'},\\{')) TMP_SUB_ORDERS as SUB_ORDER
执行后可得到主单tid、status、pay_time和分别sub_order组合的结果
至此一步,我们则需要在此结果上再次使用get_json_object函数,获取json对象中的某个字段的值(《最近写HIVE SQL的一点笔记【1】》中有相关提及)
则,我们对SQL再次修改如下
SELECT TID,
get_json_object(sub_order, '$.oid') AS OID,
get_json_object(sub_order, '$.status') AS SUB_STATUS,
STATUS,
PAY_TIME
FROM (
SELECT TID,
STATUS,
PAY_TIME,
concat('{', SUB_ORDER, '}') as SUB_ORDER
FROM (
SELECT TID,
PAY_TIME,
SUB_ORDERS,
STATUS
FROM EXPORT_ORDER_TEMP
) A
LATERAL VIEW
explode(split(
regexp_replace(regexp_replace(SUB_ORDERS, '\\[\\{', ''), '}]', ''),
'},\\{')) TMP_SUB_ORDERS as SUB_ORDER
) B
至此,我们可以把子订单信息展开并和主订单信息组合成一条hiveSQL数据
再次关联cart_item_no记录
有了以上oid展开的操作之后,我们可以在此结果之后再次继续相同的操作,将oms_orders字段同样展开处理。但是展开之后有一个问题,之前在ElasticSearch中的关联关系是oms_orders.cart_item_no等于sub_orders.oid,是一一对应的关系,而直接用LATERAL VIEW展开关联之后必然有不对应的数据关系。所以我们仍需要对展开后的结果做一个过滤的处理,判断展开后的oid=cart_item_no的记录进行保留
SELECT TID,
OID,
SUB_STATUS,
STATUS,
PAY_TIME,
OMS_ORDER_ID
FROM (
SELECT TID,
OID,
SUB_STATUS,
STATUS,
PAY_TIME,
get_json_object(OMS_ORDER, '$.cart_item_no') as CART_ITEM_NO,
NVL(get_json_object(OMS_ORDER, '$.order_id'), '') as OMS_ORDER_ID
FROM (
SELECT TID,
OID,
SUB_STATUS,
STATUS,
PAY_TIME,
concat('{', OMS_ORDER, '}') as OMS_ORDER
FROM (
SELECT TID,
get_json_object(sub_order, '$.oid') as OID,
get_json_object(sub_order, '$.status') as SUB_STATUS,
STATUS,
PAY_TIME,
OMS_ORDERS
FROM (
SELECT TID,
STATUS,
PAY_TIME,
OMS_ORDERS,
concat('{', SUB_ORDER, '}') as SUB_ORDER
FROM (
SELECT TID,
PAY_TIME,
NVL(OMS_ORDERS, '[{}]') AS OMS_ORDERS,
SUB_ORDERS,
STATUS
FROM EXPORT_ORDER_TEMP
) A
LATERAL VIEW
explode(split(
regexp_replace(regexp_replace(SUB_ORDERS, '\\[\\{', ''), '}]', ''),
'},\\{')) TMP_SUB_ORDERS as SUB_ORDER
) B
) C
LATERAL VIEW
explode(split(regexp_replace(regexp_replace(OMS_ORDERS, '\\[\\{', ''), '}]', ''),
'},\\{')) TMP_OMS_ORDERS as OMS_ORDER
) D
) E
WHERE (E.OID = E.CART_ITEM_NO OR E.CART_ITEM_NO IS NULL)
我们对比下es中原来的数据可以看下确认下正确性
{
"_source": {
"oms_orders": [
{
"order_item_id": "210100021805281686",
"order_id": "210100021805261686"
},
{
"cart_item_no": "3626335551669894932",
"order_id": "210100021805261686"
}
],
"pay_time": "2023-11-11 23:59:50.0",
"status": "WAIT_BUYER_CONFIRM_GOODS",
"sub_orders": [
{
"oid": "3626335551669894932",
"status": "WAIT_BUYER_CONFIRM_GOODS",
......
},
{
"oid": "3626335551670894932",
"status": "WAIT_BUYER_CONFIRM_GOODS",
......
}
],
"tid": "3626335551668894932",
......
}
}
发表评论