forked from docs/blueprints
151 lines
6.8 KiB
ReStructuredText
151 lines
6.8 KiB
ReStructuredText
Data Management and Analytics Platform
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
Data Lakes
|
||
^^^^^^^^^^
|
||
|
||
Data lakes are a new type of centralized repository that can store both
|
||
structured and unstructured data at any scale and does not require data
|
||
to be structured first.
|
||
|
||
First-generation data lakes use the distributed architecture of the
|
||
open-source Apache Hadoop ecosystem. They use common hardware in local
|
||
data centers to allocate and process a large amount of raw data. The
|
||
Hadoop Distributed File System (HDFS) enables customers to store data in
|
||
its native form. Administrators of first-generation data lakes must keep
|
||
an eye on complex tasks such as capacity planning, resource allocation,
|
||
and performance optimization. Due to the complexity, slow valuation, and
|
||
heavy system management workloads, many local data lake projects failed
|
||
to meet expectations.
|
||
|
||
Next-Generation Data Lakes
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
Next-generation data lakes are based on cloud object-based storage. The
|
||
cloud provides diverse high-performance, scalable, and reliable
|
||
analytics engines and huge economies of scale, making data lakes more
|
||
cost-effective and scalable.
|
||
|
||
Open Telekom Cloud's next-generation data lakes are built on Object Storage
|
||
Service (OBS) and feature storage-compute decoupling. This means that
|
||
compute and storage resources can be scaled separately, preventing
|
||
unbalanced allocation of computing and storage resources on a single
|
||
node.
|
||
|
||
A data lake is a big data platform that converges data sources in
|
||
various formats within an enterprise. It provides data and compute power
|
||
through strict data permissions and resource control. A data lake is big
|
||
with multiple small marts. The most notable characteristic of a data
|
||
lake is that one piece of data can be analyzed in multiple ways.
|
||
|
||
The evolution of data lakes is divided into three phases:
|
||
|
||
- Offline data lake: Data is imported to the data lake more than 15
|
||
minutes after the data is generated.
|
||
|
||
- Real-time data lake: Data is imported to the data lake in real time
|
||
(usually in less than one minute) or quasi real time (1 minutes to 15
|
||
minutes) after the data is generated.
|
||
|
||
- Logical data lake: Data is integrated into a virtual data lake formed
|
||
by multiple physically isolated data platforms.
|
||
|
||
.. image:: ../../assets/caf/image54.png
|
||
|
||
Specialized data marts store data in specific formats for query and
|
||
analysis in specific scenarios. They are an important supplement to a
|
||
data lake. Customers may choose different data marts to meet their
|
||
varied data analysis requirements. For example, customers who have
|
||
ultra-high performance requirements may choose real-time OLAP or
|
||
in-memory databases, and those who prioritize their existing
|
||
applications' requirements may choose search databases.
|
||
|
||
If customer data is used only for query and analysis of certain types,
|
||
specialized data marts do not depend on a data lake. However, such cases
|
||
are rare now, and specialized data marts are usually used together with
|
||
a data lake.
|
||
|
||
Apart from the basic characteristics of data warehouses, data warehouse
|
||
marts have the following characteristics:
|
||
|
||
- They are small and flexible, and can be organized in various ways,
|
||
such as by application, department, or region.
|
||
- Development is generally defined, designed, implemented, managed, and
|
||
maintained by business departments.
|
||
- They can be implemented quickly at a low cost. Investment can be paid
|
||
back in a short period of time.
|
||
- They integrate a wide range of tools.
|
||
|
||
To reduce costs, it is recommended that the source data and detail data
|
||
be stored in OBS and that the summary data be stored in GaussDB(DWS).
|
||
|
||
Specialized data marts are classified into real-time and offline marts
|
||
based on application scenarios. Real-time marts are used together with
|
||
Kafka and Flink. A typical case in point is the mart for querying bank
|
||
transactions.
|
||
|
||
Open Telekom Cloud FusionInsight Intelligent Data Lake
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
This next-generation data lake takes full advantage of cloud-native
|
||
advantages, such as fast deployment, auto scaling, almost infinite
|
||
scalability, cost-effective storage-compute decoupling, and Serverless
|
||
data analytics services. It aims to provide enterprises with a highly
|
||
scalable, available, and intelligent next-generation data lake
|
||
ecosystem, helping enterprises reduce O&M time and costs, and allowing
|
||
them to devote more resources to data analysis and business.
|
||
|
||
.. image:: ../../assets/caf/image55.png
|
||
|
||
FusionInsight provides extensive analytics services that adapt to all
|
||
types of data analysis scenarios and enable organizations of all sizes
|
||
and industries to reshape their business. Open Telekom Cloud provides
|
||
cost-effective and scalable dedicated services throughout data
|
||
collection, data management, data storage, data analysis, log analysis,
|
||
stream analysis, and machine learning (ML).
|
||
|
||
If you want to use a big data platform for data
|
||
processing, you must integrate your data into the big data platform.
|
||
You can use different data integration tools based on the data type.
|
||
For example, you can use Data Ingestion Service (DIS) to import data
|
||
in real time, use Cloud Data Migration (CDM) to move massive amounts
|
||
of on-premises data to Open Telekom Cloud, and use Data Replication Service
|
||
(DRS) to migrate databases.
|
||
|
||
DIS enables you to easily collect, process, and distribute real-time
|
||
streaming data so that you can quickly respond to new information. DIS
|
||
can be interconnected with a wide range of third-party data collection
|
||
tools and provides various cloud service connectors, agents, and SDKs.
|
||
DIS is applicable to scenarios such as device monitoring, real-time
|
||
recommendations, and log analysis in industries such as IoT, Internet,
|
||
and media. For details, visit
|
||
https://www.huaweicloud.com/intl/en-us/product/dis.html.
|
||
|
||
For details about CDM and DRS, see section "4.2.4.4 Big Data Migration
|
||
Tools".
|
||
|
||
.. tip::
|
||
|
||
It is recommended that you store the data migrated to the cloud in
|
||
OBS. If the data is small in size and needs to be processed in a
|
||
timely manner, you can also store it in HDFS.
|
||
|
||
OBS is an object-based storage service that provides secure, reliable,
|
||
and low-cost data storage with an unlimited capacity. OBS provides
|
||
various storage types to meet customer requirements. For details, visit
|
||
https://www.huaweicloud.com/intl/en-us/product/obs.html.
|
||
|
||
As for data computing, we provide different components for different
|
||
scenarios. You can use Data Ingestion Service (DIS) for stream
|
||
processing, MapReduce Service (MRS) or Data Lake Insight (DLI) for
|
||
offline batch processing, CloudTable for real-time query,
|
||
GaussDB(DWS) for interactive analysis or BI analysis, and Cloud
|
||
Search Service (CSS) for search.
|
||
|
||
Big data analysis results can be used for enterprise management,
|
||
including report analysis, OLAP analysis, track mining, and user
|
||
tagging, helping enterprises make informed business decisions.
|
||
|
||
.. toctree::
|
||
:maxdepth: 1
|