Data Lake Insight (DLI) is a serverless data processing and analysis service fully compatible with Apache Spark and Apache Flink ecosystems. It frees you from managing any servers.
DLI supports standard SQL and is compatible with Spark SQL and Flink SQL. It also supports multiple access modes, and is compatible with mainstream data formats. DLI supports SQL statements and Spark applications for heterogeneous data sources, including CloudTable, RDS, GaussDB(DWS), CSS, OBS, custom databases on ECSs, and offline databases.
You can query and analyze heterogeneous data sources such as RDS, and GaussDB(DWS) on the cloud using access methods, such as visualized interface, RESTful API, JDBC, and Beeline. The data format is compatible with five mainstream data formats: CSV, JSON, Parquet, and ORC.
DLI is interconnected with OBS for data analysis. In this architecture where storage and compute are decoupled, resources of these two types are charged separately, helping you reduce costs and improving resource utilization.
You can choose single-AZ or multi-AZ storage when you create an OBS bucket for storing redundant data on the DLI console. The differences between the two storage policies are as follows:
Elastic resource pools support the CCE cluster architecture for heterogeneous resources so you can centrally manage and allocate them. For details, see Elastic Resource Pool.
Elastic resource pools have the following advantages:
Resources of different queues are isolated to reduce the impact on each other.
SQL jobs can run on independent Spark instances, reducing mutual impacts between jobs.
The queue quota is updated in real time based on workload and priority.
Using elastic resource pools has the following advantages.
Advantage |
No Elastic Resource Pool |
Use Elastic Resource Pool |
---|---|---|
Efficiency |
You need to set scaling tasks repeatedly to improve the resource utilization. |
Dynamic scaling can be done in seconds. |
Resource utilization |
Resources cannot be shared among different queues. For example, a queue has idle CUs and another queue is heavily loaded. Resources cannot be shared. You can only scale up the second queue. |
Queues added to the same elastic resource pool can share compute resources. |
When you set a data source, you must allocate different network segments to each queue, which requires a large number of VPC network segments. |
You can add multiple general-purpose queues in the same elastic resource pool to one network segment, simplifying the data source configuration. |
|
Resource allocation |
If resources are insufficient for scale-out tasks of multiple queues, some queues will fail to be scaled out. |
You can set the priority for each queue in the elastic resource pool based on the peak hours to ensure proper resource allocation. |
DLI is a serverless big data query and analysis service. It has the following advantages:
A web-based service management platform is provided. You can access DLI using the management console or HTTPS-based APIs, or connect to the DLI server through the JDBC client.
You can submit SQL, Spark, or Flink jobs on the DLI management console.
Log in to the management console and choose Data Analysis > Data Lake Insight.
If you need to integrate DLI into a third-party system for secondary development, you can call DLI APIs to use the service.
For details, see Data Lake Insight API Reference.