forked from docs/doc-exports
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com> Co-authored-by: Yang, Tong <yangtong2@huawei.com> Co-committed-by: Yang, Tong <yangtong2@huawei.com>
40 lines
4.3 KiB
HTML
40 lines
4.3 KiB
HTML
<a name="mrs_01_1976"></a><a name="mrs_01_1976"></a>
|
|
|
|
<h1 class="topictitle1">Data Serialization</h1>
|
|
<div id="body1595920217123"><div class="section" id="mrs_01_1976__s21e0dab541ed4cd49e4e07bf16abda26"><h4 class="sectiontitle">Scenario</h4><p id="mrs_01_1976__a1e0406d3db894f178c8c8086a17193e4">Spark supports the following types of serialization:</p>
|
|
<ul id="mrs_01_1976__u796a65b121304f5cae42eb015578399f"><li id="mrs_01_1976__l1214dea554124628b5fd8a7e47e1f15c">JavaSerializer</li><li id="mrs_01_1976__l3781007eab774adbb9944a021c6201ca">KryoSerializer</li></ul>
|
|
<p id="mrs_01_1976__a9ba1a4dba14c4f0baeb46f9601227965">Data serialization affects the Spark application performance. In specific data format, KryoSerializer offers 10X higher performance than JavaSerializer. For Int data, performance optimization can be ignored.</p>
|
|
<p id="mrs_01_1976__a940a2cfbb43341e8a433d078ad8ccef3">KryoSerializer depends on Chill of Twitter. Not all Java Serializable objects support KryoSerializer. Therefore, class must be manually registered.</p>
|
|
<p id="mrs_01_1976__afa63445d6b1c44e9ba06fcb4cba9e4a7">Serialization involves task serialization and data serialization. Only JavaSerializer can be used for Spark task serialization. JavaSerializer and KryoSerializer can be used for data serialization.</p>
|
|
</div>
|
|
<div class="section" id="mrs_01_1976__s89239366cce041b2850b40860e388a60"><h4 class="sectiontitle">Procedure</h4><p id="mrs_01_1976__afcbf858da84c48518498291642e557ba">When the Spark program is running, a large volume of data needs to be serialized during the shuffle and RDD cache procedures. By default, JavaSerializer is used. You can also configure KryoSerializer as the data serializer to improve serialization performance.</p>
|
|
<p id="mrs_01_1976__a854ecc88afce4738b17494a8f0f729cc">Add the following code to enable KryoSerializer to be used:</p>
|
|
<ul id="mrs_01_1976__ub06bda66098c4942b8cbc8ed716daa41"><li id="mrs_01_1976__l35a17d640714461e8b84ad6d7dd61c47">Implement the class registrar and manually register the class.<pre class="screen" id="mrs_01_1976__sd66269dabe674925918ec8eb4c6af884">package com.etl.common;
|
|
|
|
import com.esotericsoftware.kryo.Kryo;
|
|
import org.apache.spark.serializer.KryoRegistrator;
|
|
|
|
public class DemoRegistrator implements KryoRegistrator
|
|
{
|
|
@Override
|
|
public void registerClasses(Kryo kryo)
|
|
{
|
|
//Class examples are given below. Register the custom classes.
|
|
kryo.register(AggrateKey.class);
|
|
kryo.register(AggrateValue.class);
|
|
}
|
|
}</pre>
|
|
<p id="mrs_01_1976__acebb66ff12ea46aaa56bb9f226531d07">You can configure <span class="parmname" id="mrs_01_1976__pdf98ef62d92b4208827b8c617e644c33"><b>spark.kryo.registrationRequired</b></span> on Spark client. Whether to require registration with Kryo. If set to 'true', Kryo will throw an exception if an unregistered class is serialized. If set to false (the default), Kryo will write unregistered class names along with each object. Writing class names can cause significant performance overhead. This operation will affect the system performance. If the value of <span class="parmname" id="mrs_01_1976__pb98a1f76dd924a7f8a83afbf6d65aad2"><b>spark.kryo.registrationRequired</b></span><strong id="mrs_01_1976__aa3e3bf6e1d00429ab1d02022ddeaccaa"> </strong>is configured to <strong id="mrs_01_1976__af4bfdaa9b9e04469b34031de7fe3fa6a">true</strong>, you need to manually register the class. For a class that is not serialized, the system will not automatically write the class name, but display an exception. Compare the configuration of <strong id="mrs_01_1976__a980a7f2e16f94bc780f794fe2ad0ff13">true</strong> with that of <strong id="mrs_01_1976__a248984a3ddab4be98c4ef0c5ebe1779d">false</strong>, the configuration of <strong id="mrs_01_1976__a1b444208687441b1995bbbe43d51dd65">true </strong>has the better performance.</p>
|
|
</li><li id="mrs_01_1976__l6a8054def6b84b0eb3b8ed33afad015f">Configure KryoSerializer as the data serializer and class registrar.<pre class="screen" id="mrs_01_1976__s7e5a33e8e8fc458e8bfacc1f065cae3b">val conf = new SparkConf()
|
|
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
|
|
.set("spark.kryo.registrator", "com.etl.common.DemoRegistrator")</pre>
|
|
</li></ul>
|
|
</div>
|
|
</div>
|
|
<div>
|
|
<div class="familylinks">
|
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_1975.html">Spark Core Tuning</a></div>
|
|
</div>
|
|
</div>
|
|
|